June 2021

Local Development using Databricks Clusters

This entry is part 5 of 6 in the series Development on Databricks

In this post we’ll be looking at using our local development environment, with the various productivity benefits from IDE tools, with a Databricks cluster. We’ve covered how to use a local Spark setup, with various benefits such as cost savings and isolated working, but this will only take you so far. You might be in need of collaborative development/testing or are at the point where you simply need the power of a databricks cluster. Whatever the motivation, being able to hook your development workstation in to the databricks service is something you’ll want to consider at some point.

Wiring up with Databricks Connect

As the name suggests, this allows local connection to a databricks cluster, allowing you to issue actions against your databricks environment. To connect your favourite local development tools to your databricks cluster, you’ll need to use the ‘databricks-connect‘ python package. Before we get too giddy at this prospect there are however a number of caveats to be aware of.

Available Versions of Databricks Connect

A prerequisite of working with databricks connect against your cluster is that the cluster runtime major, minor versions match the databricks connect package major, minor version. Not all versions of the runtime are supported, which is something of a pain and does leave the tail wagging the dog in this regard. In essence, the availability of the databricks connect packages will dictate the runtime version of databricks that you choose for your development cluster. You are of course free to use a later version beyond the development environment as you won’t need databricks connect outside of this, but that does add a small element of risk regarding the disparity of runtime versions between environments.

At the time of writing, if you are going to use databricks connect, you essentially have databricks runtimes 7.3 and 8.1 to choose from (unless you stretch back to versions well out of support). If you’d like to further information on the versioning aspects, please take a look at https://docs.databricks.com/release-notes/dbconnect/index.html.

Scala and Runtime 8.1

As you’re probably going to opt for runtime 8.1, be aware that Scala developers will need to install Scala version 2.12 for local development against the cluster.

Okay now we’ve got all the caveats around versioning out of the way, we can crack on.

Setup of Databricks Connect for Azure Databricks

You can find the guide for setting up your client machines and Azure databricks workspace clusters for databricks connect at https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect. If you’ve been running Spark locally, you will need to change your PYSPARK_PYTHON and SPARK_HOME environment variables whilst using databricks connect, as mentioned in the article.

Specifying Environment Variables in PyCharm

If you are using PyCharm, you can set environment variables for each of your configurations. Go to Run | Edit Configurations… and then for the relevant configuration, amend your PYSPARK_PYTHON and SPARK_HOME variables as required as shown below.

PyCharm Edit Configuration Environment Variables

You can easily copy the environment variables from one configuration to another using the UI.

Note: The use of other existing environment variables in the value, such as with %MY_ENV_VAR% syntax usual for embedding variables within other variable values appears to not work in PyCharm. I haven’t pursued this one and tend to take the longer way round of copying the full variable value where otherwise I would have embedded it. For our variables above this is not an issue, but something to note if you were for example using %SPARK_HOME% in you Path variable value.

Time to Check the Small Print

Before you start your local development setup with databricks connect, ensure you read the Limitations section. Two important development areas that are not supported are worth highlighting here.

Spark Structured Streaming is not possible
Some elements of Delta Lake APIs won’t work

The docs state that all Delta API calls are not supported, but I’ve found that some do work, such as DeltaTable.isDeltaTable, whereas DeltaTable.forPath and DeltaTable.forName do not. Also, using spark.sql(“””Create table….. using delta…”””) won’t work (contrary to the docs), complaining of not being able to instantiate a DeltaDataSource Provider. You can however still read and write data to delta tables using spark.read and spark.write.format(“delta”).save(“path/to/delta_table”).

Having this disparity between APIs available may result in you deciding to take a different approach to development that doesn’t include databricks connect when using delta lake. I tend to work locally on these aspects of the development, to the point where I’m happy to push up as jars or wheels for executing directly on the cluster.

Okay, if you’re fine with those limitations then there is one additional piece of info required for us to get started with local development using a databricks cluster.

Missing Spark Configuration Setting

The above setup guide does neglect to include a a Spark configuration value for your cluster relating to the databricks service port . With this added, the required list of cluster configuration settings is as below:

spark.databricks.service.port 8787
spark.databricks.service.server.enabled true

Add this to your Cluster via the ‘Advanced Options | Spark | Spark Config’ section and your cluster is all databricks connect friendly and ready to receive.

Create a Separate Python Environment

It makes a lot of sense to create a separate Python environment for your databricks connect package, and you’ll probably want to keep this separate from any ‘native’ PySpark installation you may already have, such as if working locally as per the previous post Local Databricks Development on Windows.

Connecting to Different Clusters

The databricks connect package uses a file in your home directory called ‘.databricks-connect‘ in order to connect to your cluster. If you are using multiple clusters for different aspects of your work you could swap various files in and out in order to manage your cluster connections. This would be pretty messy and error prone and hard to coordinate. A much better approach is to simply change the spark.configuration settings that are used for the cluster connections. Credit for this idea goes to to Ivan G in his post at https://dev.to/aloneguid/tips-and-tricks-for-using-python-with-databricks-connect-593k. We’ll cover this in the next section.

Determining the Execution Environment

We basically have three scenarios to consider when executing our Spark code for databricks.

Local development executing against a local Spark Installation (as previously mentioned covered in the post Local Databricks Development on Windows)
Local development executing against a databricks cluster via databricks connect
Execution directly on a databricks cluster, such as with a notebook or job.

Our spark session will be setup differently for each of these scenarios, and it makes sense to have a way of determining programmatically which of these is relevant. There are a number of configuration settings and environment elements that we can examine to deduce this, as outlined below:

Only scenario 3, Execution directly on a databricks cluster, will return a name from the spark configuration setting ‘spark.databricks.clusterUsageTags.clusterName‘.
Databricks connect uses a different code base for the pyspark package, which includes an additional ‘databricks‘ directory.

I should add that these are current determinants, and that you should ensure that you test that these still hold with each change in databricks runtimes and the related databricks-connect package release.

With those conditions to work with, I have created a ‘SparkSessionUtil‘ class that configures our required Spark Session for us.

Local Spark Setup

For local Spark installations we can turn off a number of settings that don’t apply if we are not on a cluster (there are probably more but these should be a good set to run with).

Databricks Connect against a Specific Workspace Cluster

We can pass a cluster id for when working with databricks connect and wanting to use a cluster different to that set in our ‘databricks-connect‘ configuration file.

Direct Cluster Execution

If we are executing directly on our cluster (scenario 3 above) we don’t need to do anything with our Spark Session so simply return the original session.

Right, finally, some code.

import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession


class SparkSessionUtil:
    """
    Helper class for configuring Spark session based on the spark environment being used.
    Determines whether are using local spark, databricks-connect or directly executing on a cluster and sets up config
    settings for local spark as required.
    """

    DATABRICKS_SERVICE_PORT = "8787"

    @staticmethod
    def get_configured_spark_session(cluster_id=None):
        """
        Determines the execution environment and returns a spark session configured for either local or cluster usage
        accordingly
        :param cluster_id: a cluster_id to connect to if using databricks-connect
        :return: a configured spark session
        """
        spark = SparkSession.builder.getOrCreate()
        if SparkSessionUtil.is_cluster_direct_exec(spark):
            # simply return the existing spark session
            return spark
        conf = SparkConf()
        # copy all the configuration values from the current Spark Context
        for (k, v) in spark.sparkContext.getConf().getAll():
            conf.set(k, v)
        if SparkSessionUtil.is_databricks_connect(spark):
            # set the cluster for execution as required
            # note: we are unable to check whether the cluster_id has changed as this setting is unset at this point
            if cluster_id:
                conf.set("spark.databricks.service.clusterId", cluster_id)
                conf.set("spark.databricks.service.port", DATABRICKS_SERVICE_PORT)
                # stop the spark session context in order to create a new one with the required cluster_id, else we
                # will still use the current cluster_id for execution
                spark.stop()
                con = SparkContext(conf=conf)
                sess = SparkSession(con)
                return sess.builder.config(conf=conf).getOrCreate()
        else:
            # set up for local spark installation
            conf.set("spark.broadcast.compress", "false")
            conf.set("spark.shuffle.compress", "false")
            conf.set("spark.shuffle.spill.compress", "false")
            conf.set("spark.master", "local[*]")
            return SparkSession.builder.config(conf=conf).getOrCreate()

    @staticmethod
    def is_databricks_connect(spark):
        """
        Determines whether the spark session is using databricks-connect, based on the existence of a 'databricks'
        directory within the SPARK_HOME directory
        :param spark: the spark session
        :return: True if using databricks-connect to connect to a cluster, else False
        """
        return os.path.isdir(os.path.join(os.environ.get('SPARK_HOME'), 'databricks'))

    @staticmethod
    def is_cluster_direct_exec(spark):
        """
        Determines whether executing directly on cluster, based on the existence of the clusterName configuration
        setting
        :param conf: the spark session configuration
        :return: True if executing directly on a cluster, else False
        """
        # Note: using spark.conf.get(...) will cause the cluster to start, whereas spark.sparkContext.getConf().get does
        # not. As we may want to change the clusterid when using databricks-connect we don't want to start the wrong
        # cluster prematurely.
        return spark.sparkContext.getConf().get("spark.databricks.clusterUsageTags.clusterName", None) != None


# specify a cluster_id if needing to change from the databricks connect configured cluster
spark = SparkSessionUtil.get_configured_spark_session(cluster_id="nnnn-mmmmmm-qqqqxx")

And We’re ‘Go’ for Local Development using Databricks Clusters

Okay, assuming you’ve followed the setup guide at https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect, and added the omission around configuration, you should now be good to go. Go forth and develop locally whilst running your code either against your local Spark installation or your amped up, supercharged, megalomaniacal databricks clusters. Can I get a ‘Woop Woop’? No? Oh well, worth a try…

Using Your Own Libraries

One common issue when using libraries in your development that are still somewhat ‘volatile’ is that updating these on the cluster will cause issues for other users. You may be working on elements of the development that use a library version that is different to that used by a colleague, such as when referencing inhouse libraries that are still evolving.

Databricks uses three scopes for library installation, as summarised below:

Workspace Libraries

These are available across the databricks workspace, and can be referenced when installing onto clusters as required. Please see https://docs.microsoft.com/en-us/azure/databricks/libraries/workspace-libraries for more information.

Cluster Libraries

These have been installed on a cluster and can be referenced from any code running on the cluster. Please see https://docs.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries for more information. The big gotcha with these is the need to restart the cluster if you need to change the library code. This is a disruption most dev teams could do without and you won’t be popular if your libraries are in a state of rapid flux.

Notebook Libraries

These are available within a single notebook, allowing the best isolation from other users’ code and the least disruption. You can read more about them at https://docs.microsoft.com/en-us/azure/databricks/libraries/notebooks-python-libraries. One pain with using %pip for dbfs-located packages results from databricks changing periods, hyphens and spaces with underscores. You will need to first rename the file back to the correct name to conform to the wheel package naming standards, in order to use %pip from the notebook.

Adding References to Your Own Versions of Libraries

If you’re not doing notebook-based development, for whatever reason, then the option of upsetting your colleagues by using cluster libraries may not sit well. You can however add the required packages for local development using databricks by adding a reference to the egg/jar to the SparkContext, using either ‘addPyFile‘ (Python) or ‘addJar‘ (Scala). This is mentioned in the databricks connect setup article at https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect. First add your files to a file share somewhere accessible to your workspace clusters, such as a storage mount point, or the dbfs:/FileStore.

The only scenario we really need to consider volatile libraries issues is when you are using databricks connect to execute against a cluster (scenario 2 above). The local Spark installation approach (scenario 1 above) is by its nature not shared, so we’re free to use whatever there. We can usually assume that if you are executing directly on a cluster (scenario 3 above) that you have installed the required libraries on the cluster and as such the packages will be available.

Here’s some code for using our own library versions.

Python

if SparkSessionUtil.is_databricks_connect(spark):
    spark.sparkContext.addPyFile("dbfs:/FileStore/jars/lib1-0.3-py3.8.egg")
    # insert additional libraries here...

from lib1 import mod1._

some_blah = mod1.some_method('blah')

if SparkSessionUtil.IsDatabricksConnect(spark){
    spark.sparkContext.addJar("dbfs:/FileStore/jars/lib1_2.12-0.3.jar")
    // insert additional libraries here...
}
import lib1.mod1

val someBlah = mod1.someMethod("blah")

Notice that we have moved any import statements to after the adding of the referenced library files.

If you do find that you need to extend this to scenario 3, you can simply add a condition based on the ‘SparksessionUtil.is_cluster_direct_exec‘ method. At some point however you’ll probably want to use cluster-installed libraries.

If you add the above to your code entry point, including your referenced libraries as necessary, you can then manage your libraries’ versions independently of the cluster. This avoids disruption to other team members and any incoming low-flying donuts/staplers/deckchairs that may result.

Using DBUtils

Limitations with Databricks Connect

You’ll only be able to use the secrets and file system (fs) elements of DBUtils if you are using databricks connect. This shouldn’t be an issue as the other elements are generally more for notebook development.

Compiling with Scala 2.12

If you are developing with Scala against Spark 3.0+, such as with databricks runtimes 8.x, you won’t have the required DBUtils library available. You can find the DBUtils version for Scala 2.11 on Maven, but 2.12 is not there. It is however available on the driver node of your clusters. You can use the Web Terminal within the Databricks service to connect to the driver node and download from there, for local development using databricks. First you’ll need to copy it from the driver node’s source directory to somewhere more accessible. The command below will do this for those not familiar with bash.

# Create a directory within the FileStore directory, which is available from our workspace
mkdir dbfs/FileStore/dbutils
# copy from the driver node jars directory to our new accessible dbutils directory
cp /databricks/jars/spark--dbutils--dbutils-api-spark_3.1_2.12_deploy.jar dbfs/FileStore/dbutils/

You can then either use the databricks CLI to copy across locally, or simply browse to the file in your web browser. If you do the latter you’ll need to grab your workspace URL from the Azure Portal – you’ll find it top right of the workspace resource.

Databricks Workspace URL Azure Portal

Your URL will have a format like that below.

https://adb-myorgidxxxxx.nn.azuredatabricks.net/?o=myorgidxxxxx

You can access the file required by adding the file path, where ‘/files/’ refers to the ‘FileStore’ source, so ‘/files/dbutils/spark–dbutils–dbutils-api-spark_3.1_2.12_deploy.jar’ is our file of interest.

https://adb-myorgidxxxxx.nn.azuredatabricks.net/files/dbutils/spark–dbutils–dbutils-api-spark_3.1_2.12_deploy.jar?o=myorgidxxxxx

You can then use this jar file for local development with Scala 2.12. Notice the naming convention used includes the Scala version. Now can I get that ‘Woop Woop’? I can, oh that’s wonderful. I’m welling up here…

Thanks for Reading

You should hopefully have a good few things to help you with your databricks development by this point in our series. We’ve covered setting up your own local Spark and also local development using databricks clusters, thereby catering for the most productive development scenarios. You’ll be cranking out great code in no time via your favourite development tools, with integrated support for testing frameworks, debugging and all those things that make coding a breeze. Ta ta for now.

Title June 2021