In this post we’ll be looking at why you may want to consider alternatives to databricks notebooks for development and what it means for teams starting out on the unified analytics platform. We’ll go through some of the common areas of difficulty with notebook development on databricks and how they compare with using an IDE. This post is based on my own experiences of implementing projects with clients on the platform, and as such should be taken as my own opinion on the matter. Something hopefully to take on board for consideration rather than dogma, mantra, gospel or any other theologically aligned thinking on the matter. Ultimately things depend on your own needs and circumstances, and hopefully this article will help with awareness of other ways of working.
If you’ve done any development on databricks you’ve probably used databricks notebooks. They are the first point of call for many of us working with data on the platform. Very popular with Data Scientists in particular, they provide an environment whereby data transformation, wrangling, munging or whatever your preferred terms for generally ‘doing stuff’ with data, and visualising the results, is readily accessible. There is no software to install, no real configuration and a generally gentle learning curve to getting working. The team can easily collaborate by creating their notebooks in workspaces and progress on their data journeys unhindered. Based on Jupyter notebooks, they offer great functionality and compatibility with a well established platform widely in use across Data Science teams. For some ways of working, such as data exploration, quick analysis exercises and so forth they are very appealing. Many very mature data teams never really leave the notebook environment, finding all they need in this native offering.
Another popular use for notebooks is to quickly ‘proof out’ something before moving onto more in-depth development work. The ease with which you can execute and visualise results makes databricks notebooks ideal for these rapid turn-around development tasks.
Before I find myself cast as the pantomime villain, just to be clear, there is absolutely nothing ‘bad’ about development with just notebooks, after all they are perfectly capable of implementing code across all aspects of the databricks analytics platform. They were however, at least as far as I can tell, not really intended to serve as fully fledged complex development environments.
As many teams have discovered, when you get into the more involved data engineering activities, working with code that would benefit from improved structuring, visibility and reuse for example, you may find yourself wanting to revisit this development approach. From what we’ve experienced, there is a definite tipping point beyond which working in this fashion has considerable impact on developer productivity in these settings. When trying to exercise any of the well established modern development practices that are second nature to most developers, such as breaking down the code into encapsulated modules, refactoring elements of the code, and writing unit tests to accompany your code, notebooks-based development poses a lot of challenges. Any of these aspects of development that are taken for granted in a respectable Integrated Development Environment (IDE) are not readily available to your notebook endeavours. As things progress further you’ll inevitably find maintaining and expanding on your efforts to be increasingly difficult and productivity is likely to suffer considerably as a result.
Okay, enough doom mongering for one lifetime. The simple solution when you’re struggling with one tool is of course to change to one better suited for the job at hand. For complex data engineering or analytical developments on databricks, this means pretty much the same as with any substantial code effort. Using an IDE.
Whenever we see clients that are set on the attractive simplicities of notebook-based development, we always make them aware of the alternative ways of working that IDEs offer. This allows them to best understand the overheads and feasibility of each approach and make informed decision as to which will best suit their needs. After all, in the short term notebook-based development may be fine, but looking to what will be required as time and demands progress to the next level of development complexity is what will determine the best approach.
Here’s some of the development experiences that have a noticeable difference across the two ways of working.
In Python, Scala, Java and just about any other language you’ll be used to structuring your code into libraries, packages and modules or their equivalents, with best practices for how to structure the files. You’ll be familiar with distribution of these logical units and referencing from within your code using import statements. Navigating your code in the IDE is straight forward, structured into projects, and assisted with search and ‘Go To Definition’ functionality for all your code. You can quickly bounce around from one block of code to another in another file somewhere like an over-caffeinated tech-evangelist tipsy on the latest coolaid.
As part of encapsulating your development code, you may have created various databricks notebooks, perhaps grouping up functions and classes like in standard modules in Python. You might even put these in a hierarchy of folder in order to structure them better, like you would do with Python packages and modules. This does of course help prevent duplication of code and aid with understanding and reuse and should help make your notebook-based development easier to manage.
Note however that you won’t have a true Library\Package\Module hierarchy, as the folder structure doesn’t enforce any concept of scope to your code definitions. Any functions or classes that are defined in one notebook will clash with same named definitions in other notebooks regardless of folder structure, unlike how Python or Scala code is structured.
The notebook navigation experience in the workspace isn’t the best however, and you’ll find yourself opening a lot of tabs to navigate between different notebooks as you refer to various elements of your code. The UI isn’t really intended for too much to-ing and -fro-ing between notebooks, preferring a more ‘all in one place’ way of working. You may start to get a little dizzy as you find your way around the various browser tabs of your code.
Not much to say here really, you structure your code into projects, packages and modules and add the respective import statements. If you want to refactor files, move things around or whatever, the IDE generally takes care of ensuring that everything still lines up nicely.
In order to make use of your nicely structured notebook development, you’ll need to make sure it has been loaded into your spark session, by running the required notebooks on which your databricks code depends. You’ll soon find yourself having to maintain rather ugly notebooks that are wrappers for calling all these required other notebooks, with lots of %run magic statements such as the ‘run_all’ notebook below:
and then make sure that you call this notebook wherever you need to make use of the code, as shown with the Python code below:
This can introduce some development pains. Neglecting to add the required %run statements for loading the new elements of code is easily done, resulting in calls to undefined code. Additionally, should you decide to restructure your code through some attempts at refactoring you’ll find yourself having to rehash a load of paths in these wrappers. It is easy to introduce bugs and harder to improve your code as a result, both of which will impact your notebooks development productivity.
In both databricks notebooks and IDE projects development you can of course reference libraries in the standard fashion for the language in question. Again we use an import statement approach, ensuring that the libraries in question are made available on the cluster executing the code or within the session in question.
I’ll talk later in the series on different ways of making these libraries available to your executing code, as there are some differences around managing the impact of updates to the libraries.
Every modern IDE has some degree of integration with unit testing frameworks built in, either in the core product or through various plugins/extensions. You can often right-click on an element of your code and ‘Generate Unit Tests’ in order to quickly sketch out some tests and get started on ensuring things actually work. You may be following Test-Driven Development (TDD) and working very tightly between code and validation/verification, and your IDE will try to make this a generally happy experience for all involved. Executing your tests either in isolation or as part of a suite is simple, efficient and transparent. You can find out more on IDE testing functionality for PyCharm here, JetBrains IDEA Scala here, and for Visual Studio Code python testing here. I’ll be going into testing in a number of later posts in the series.
In notebooks development on databricks you’ll be preferably grouping suites into the same notebook and then executing either the whole notebook or individual cells in the case of specific test cases, and don’t have this functionality available. You can of course reference unit testing frameworks such as PyTest or ScalaTest and crack on with implementing tests, but the ease with which you can exercise and get feedback on these is limited. Depending on how much you embrace testing as part of your work (and the general advise is to make it at least a big enough part to avoid embarrassment or the QA firing squad) you will find this may have a real impact.
In the IDE world of course we have nice debuggers that allow stepping through, over, around, behind your code like Donnie Yen in an Ip Man movie when you need to figure out why your code has coughed up the odd fur ball. Variables are available for inspection that point you happily along the way to bug squishing and all is happy once more. These tools are pretty indispensable and you soon see why.
In databricks notebooks you run your cell, check the results and hopefully it did what you expected or if not and you’re lucky it is obvious where the development went belly up. However, there are times when you really really need to see what is happening on each line, and where that bug is creeping in. Not being able to step into your code leaves you splattering your notebook development efforts with print() statements and the like in an attempt to play hunt the gremlin. Then at some point when you find the foible you then go back and remove all those print() splatters and finally get back to coding. Not great.
I won’t go into all the other advantages of using an IDE as we’ve all got better things to be doing. Suffice to say IDEs such as PyCharm, Intellij IDEA, VS Code and the like are generally crammed with a trove of productivity enhancing tools and functionality that will make your development a different league. With the community editions having such fantastic features you won’t need to shell out a penny, though for some the non-gratis editions may of course have somethings additional to offer worth the price tag.
Here’s a quick summary of the above points:
Functionality | IDE | Notebook |
---|---|---|
Data Visualisation | ||
Rapid Prototyping | ||
Ad-hoc Analysis and Exploration | ||
Code Structure and Navigation | ||
Referencing Your Code | ||
Testing | ||
Debugging | ||
Code Editing Productivity |
No-one said it has to be either/or of course. For many teams, the IDE is the tool of choice for the more complex library code development, with the productivity gains that IDEs offer. The notebook is there to tap in to these libraries, providing a development interface for further manipulating, exploring and visualising the resultant datasets. Each approach plays to its strengths.
What to do if you’ve dived in and have a whole slew of notebooks that are your backbone of your data integration and analytics efforts and are struggling with the development experience? Do not despair! It doesn’t have to descend into a frustrating, opaque, unmanageable quagmire of notebook nastiness. It’s generally not difficult to convert the notebook code to scripts. You can automate the exporting of your notebooks’ source code from workspaces using the databricks CLI workspace export_dir command. This will recursively export the notebooks within the specified folder, into .py, .scala, .sql or .r files based on the notebook language of choice.
# export all files recursively to the destination folder. databricks workspace export_dir /Shared/lib1 D:\tmp\shared\lib1
You will find certain cells using magics will need to be revisited, but if most of your code is based on the various Spark APIs this won’t play a big part.
From my experience of having worked on projects that are notebooks-based/heavy and on projects that are more IDE based, for me the IDE approach for data engineering is the way to go. When it comes to crafting code, a good ‘fat client’ IDE running on your workstation will make a massive difference, with all the responsiveness and functionality you need right there at your fingertips. Of course this is just my opinion on the matter.
We’ll be going through some of the most common aspects of developing on databricks throughout this series, so that you can see for yourself how to really get to grips with working on this fantastic platform. I hope this post has been of some use in deciding whether to opt for the IDE or remain with notebook-based development for your particular situation.
Thanks for reading and see you in the next post on Development on Databricks.
Most people using Power BI normally do so with Microsoft technology at the core of their business and IT operations. Power BI is rapidly adding connectors for non-Microsoft technologies and new capabilities on a monthly basis. The combination of new sources, excellent visualisation and modelling, and a low price point is leading to it being used with technologies other than the Microsoft data platform.
This blog is the result of one such project. I am currently using Power BI as the main reporting tool on a Google Cloud Platform (GCP) data project. There are lots of interesting topics to discuss given the technologies being used. However, this post is going to focus on the programmatic refreshing of your Power BI datasets using Python. “Why Python?” I hear you say. Well the workflow tool used on GCP is none other than Apache Airflow, implemented as Composer. Apache Airflow is written in Python and you create all of your workflows using Python.
When I looked at the Power BI REST API documentation all of the examples were in C#. After a lot of reading and experimenting (aka hitting my head against my desk), I had the process running the way I wanted. This post is my attempt at creating the document I wish existed when I started.
It is important to understand the main steps involved in this process before we get into the detail. Having this high-level process clearly defined was one of the things missing from the information online.
The first and most important part of this entire process is to create a Power BI app registration. There are multiple ways of doing this and this video from Guy in a Cube will give you all of the information you need. No matter how you choose to do your app registration there are three main things you need to know.
1. The Power BI REST API only supports delegated permissions. That means that you need to run any calls to the REST API in the context of a user. For unattended applications, such as our data pipeline step, you need to register your app as a Native app. You only receive the client id when you register it as native app. Server-side Web apps receive both a client id and client secret but this is the wrong type of app for our use case. When you authenticate from your code you will need the client id together with the username and the password of the account that has delegated these permissions to the app.
2. You need to ensure you select the correct permissions when registering your app. For our purposes we need access to Read and write all datasets. As always take the approach of providing the minimum permissions needed. You can always add more permissions later in the Azure Portal.
3. This brings us to the most overlooked yet important point which is granting permissions to the App in the Azure Portal. You need to log into the Azure Portal with the account that will be delegating the permissions to the app. This will be the account of which you will pass the username, password with the client id to authenticate against Azure AD. If you do not perform this step you will end up with authentication errors. (You might get an authentication token when you authorise but you won’t get the correct scope in the access token)
Interacting with the Power BI REST API requires a two-step process.
1. Acquiring an access token from Azure AD by supplying your client id, username and password
2. Calling the desired REST API using the access token received in step 1.
There is a choice in how you can perform step 1. You can choose to either use the ADAL Python library or pure REST calls to obtain the access token.
Before we continue a word of caution. In the examples provided below we have the client id, username and password entered directly in the code. This is for demonstration purposes only. I cannot stress this enough. For production solutions you would want to use the credential management best practice of your platform.
The adal library for Python is the official Microsoft Azure Active Directory authentication library. It provides you with everything you need to authenticate against Azure AD using Python. Below is an example of the code you will use to authenticate and get your access token. Keep in mind that we have to pass the username and password along with the Client ID. No other way of authenticating will give us a token with the scopes we need to call the Power BI REST API successfully.
import adal authority_url = 'https://login.windows.net/common' resource_url = 'https://analysis.windows.net/powerbi/api' client_id = <INSERT CLIENT ID> username = <INSERT USERNAME> password = <INSERT PASSWORD> context = adal.AuthenticationContext(authority=authority_url, validate_authority=True, api_version=None) token = context.acquire_token_with_username_password(resource=resource_url, client_id=client_id, username=username, password=password) access_token = token.get('accessToken')
As you can see the adal library makes it extremely easy to authenticate against Azure AD and get the token you need.
An alternative to the ADAL library is normal REST API calls to obtain the token. Below is an example of the code you will use if you were using the requests library to make your REST calls.
import requests url = 'https://login.microsoftonline.com/common/oauth2/token' data = { 'grant_type': 'password', 'scope': 'https://api.powerbi.com', 'resource': 'https://analysis.windows.net/powerbi/api', 'client_id': <INSERT CLIENT ID>, 'username': <INSERT USERNAME>, 'password': <INSERT PASSWORD> } r = requests.post(url, data=data) access_token = r.json().get('access_token')
So now that we have gotten our access token, we can move onto the next step which is refreshing our dataset in Power BI. In order to do this, you will need one or two keys depending on where your dataset is located.
If your dataset is in the workspace of the account under which your app will be running, the “My Workspace” of that account, then you only need the dataset key of your dataset. If, however your dataset is located in an app workspace you will need to get both the group id and the dataset key.
You get these values from the URL in your browser when you navigate to the settings of the dataset. Here are examples for each one of the scenarios.
Once you have these keys you are ready to construct your API call to refresh your dataset. The Power BI REST API documentation shows the two different API calls.
Below is an example of the code you would use to refresh a dataset in an App Workspace.
import adal import requests authority_url = 'https://login.windows.net/common' resource_url = 'https://analysis.windows.net/powerbi/api' client_id = <INSERT CLIENT ID> username = <INSERT USERNAME> password = <INSERT PASSWORD> context = adal.AuthenticationContext(authority=authority_url, validate_authority=True, api_version=None) token = context.acquire_token_with_username_password(resource=resource_url, client_id=client_id, username=username, password=password) access_token = token.get('accessToken') refresh_url = 'https://api.powerbi.com/v1.0/myorg/groups/<INSERT GROUP ID>/datasets/<INSERT DATASET KEY>/refreshes' header = {'Authorization': f'Bearer {access_token}'} r = requests.post(url=refresh_url, headers=header) r.raise_for_status()
As you can see it is extremely easy to refresh your datasets in Power BI using Python and the Power BI REST API. The most common mistakes people make which account for the majority of forum posts online are:
By getting the steps right you will have no problem in automating your Power BI tasks with Python or any other programming language you choose to use.