In this post we’re discussing how to copy files from SharePoint Online using Azure Data Factory, the Microsoft Graph API and managed identity authentication.
A lot of existing documentation, including the official ADF documentation, references the SharePoint v1 API, but the v1 API is now considered legacy. The recommendation from Microsoft is that all SharePoint API calls use the SharePoint Online v2 API (Microsoft Graph API) instead.
The biggest benefit to this approach is that there is no need to obtain and pass around a bearer token. As a result, we can cut out the work of setting up service principals, managing those service principal credentials in Azure Key Vault and obtaining those service principal credentials at Azure Data Factory runtime.
Copy activities in both the SharePoint v1 and v2 APIs are executed using an HTTP connector. When using the SharePoint v1 API an Authorization header is required. The HTTP connector doesn’t support managed identity authentication. As a result, a bearer token must be obtained and used in the copy activity. In contrast, the Microsoft Graph API short-lived pre-authenticated download URLs (@microsoft.graph.downloadUrl
) do not require an Authorization header to download a file. Therefore, we can use an anonymous HTTP connection to complete the download in the copy activity.
Web activities provide native support for user assigned managed identity, these can be used in both the SharePoint v1 and Graph API with managed identity authentication i.e. without a Authorization header and bearer token.
As a side note, The REST linked service does support managed identity authentication, but it does not support the necessary file based datasets that would be required for the file based copy activity.
For the purpose of this blog, we’re going to use a User Assigned Managed Identity. A System Assigned Managed Identity could also be used, with a few small changes to the instructions below,
The required steps are as follows
Programmatic access to the Graph API from Azure Data Factory is achieved using a user assigned managed identity. A user assigned managed identity can be created in the Azure Portal. Detailed steps below:
Example:
It’s not possible to assign Graph API access rights to a user assigned managed identity in the Azure Portal. This must be done programatically. Based on requirements, the user assigned managed identity can be assigned the following rights.
Permission | Description |
---|---|
Sites.Read.All | Read items in all site collections |
Sites.ReadWrite.All | Read and write items in all site collections |
Sites.Manage.All | Create, edit, and delete items and lists in all site collections |
Sites.FullControl.All | Have full control of all site collections |
Sites.Selected | Access selected site collections |
To copy files and navigate folders in this demo, we are going to assign the Sites.Read.All right. Best practice is to assign least privilege.
Using Powershell:
Connect-AzureAD $ManagedIdentity = Get-AzureADServicePrincipal ` -ObjectId "<Object (principal) ID>" $GraphAPI = Get-AzureADServicePrincipal ` -Filter "DisplayName eq 'Microsoft Graph'" $SitesReadAll = $GraphAPI.AppRoles | where Value -like 'Sites.Read.All' New-AzureADServiceAppRoleAssignment ` -Id $SitesReadAll.Id ` -ObjectId $ManagedIdentity.ObjectId ` -PrincipalId $ManagedIdentity.ObjectId ` -ResourceId $GraphAPI.ObjectId
At this point we have a user assigned managed identity with read access to SharePoint Online via the Graph API. Consequently, we can now proceed to setup Azure Data Factory.
Example:
Below is a list of components we’ll need to create in Azure Data Factory for the copy activity.
The sink components are not discussed in this post, additional detail and examples of Azure Data Lake Gen2 connectors can be found here.
Your HTTP linked service should use anonymous authentication and a parameter for passing in a dynamic URL. Configuration details below:
Example:
The sink components are not discussed in this post, additional detail and examples of Azure Data Lake Gen2 connectors can be found here.
@dataset().URL
Example:
We now have the necessary source and sink components in place to initiate a copy activity. Let’s put together a pipeline with a few activities that will ultimately download a file. In this pipeline demo we’re going to do the following:
It’s well worth looking at the API documentation referenced below. There are multiple ways to navigate drives and search for files. Consequently, for your use case, you may find some methods more efficient than others.
In a new empty pipeline create a web activity to fetch the SharePoint SiteID:
Create a SiteURL variable, and use the output of the web activity above to set the variable to the SharePoint SiteID in a Set Variable activity
@concat( 'https://graph.microsoft.com/v1.0/sites/' ,activity('Get-SharePointSiteID').output.id )
Create a web activity to fetch the list of drives:
@concat( variables('SiteURL') ,'/drives' )
Create a filter activity to filter to the Documents drive
@activity('Get-Drives').output.value
@equals(item().name,'Documents')
Create a web activity to obtain a download URL for a file called data.csv in the General folder.
@concat( variables('SiteURL') ,'/drives/' ,activity('Filter-Documents').output.value[0].id ,'/root:/General/data.csv' )
Create a copy activity that uses a HTTP source, referencing the download URL
@activity('Get-File').output['@microsoft.graph.downloadUrl']
Configure the sink based on your requirements. If you are using a binary source dataset (as per this demo) you will have to use a binary sink dataset.
Well, that about sums up how we copy files from SharePoint Online using Azure Data Factory and the Microsoft Graph API. I hope this helps you in your quest to remove those nasty bearer tokens from your workflows and move towards a world of managed identity.
At Pivotal BI we do a lot of work on Microsoft Azure data solutions, so naturally being a Microsoft double Gold Partner in data fits well with what our clients ask for.
The world of data is one that is ever evolving, providing new ways of working in the analytics space accelerated by the rich and accessible cloud offerings that a platform like Azure makes available. Being part of the Microsoft Partner ecosystem is something that our clients find has huge benefits. Working closely with the provider of a broad range of integrated services that span the needs of the modern data-driven enterprise allows us to deliver above and beyond the expectations of our customers.
The change in attitudes towards Open Source Software marks a real revolution in thinking at Microsoft, and one that has potential benefits to customers across all business sectors. Leading analytical solutions such as Apache Spark and the Databricks platform provide a plethora of capabilities, from Streaming and Big Data Processing through to Data Ops savvy Machine Learning, that are transforming the way organisations compete in an ever changing marketplace.
We are proud to have achieved Microsoft double Gold Partner status, with competencies in Data Platform and Data Analytics, and to be playing a key part in enabling businesses to truly capitalise on their data assets. We look forward to an exciting future where cloud data platforms help provide greater opportunities for all organisations through improved insight and understanding.
To understand more around what we do and how we do it, please check out our services and capabilities, and how we work together with you to ensure success.
You can find more information on the services and technology that we partner on in data analytics, data management and modern data capabilities at our Microsoft Partner listing here.
To your next data journey,
The Pivotal BI Team.
In this final post in the series I thought it’d be useful to share some little tips and what may be thought of as tricks that I’ve come across when using Azure DevOps and Terraform. So without delay, in no particular order, here are those that are my favourites.
You may find yourself in the situation where you need to capture a Terraform resource configuration value to an Azure DevOps variable. This tends to be for using in a separate downstream task outside of Terraform.
Setting DevOps pipeline variables from within Terraform can be easily achieved using PowerShell and the local-exec provisioner. The following Terraform resource will capture the configuration values for you into DevOps variables.
resource "null_resource" "terraform-to-devops-vars" { triggers = { // always execute uuid_trigger = "${uuid()}" } provisioner "local-exec" { command = <<EOT Write-Host "##vso[task.setvariable variable=ServicePrincipleId]${azuread_service_principal.my_app.id}" Write-Host "##vso[task.setvariable variable=ServicePrinciplePassword]${azuread_service_principal_password.my_app.value}" EOT interpreter = ["PowerShell", "-Command"] } }
The trigger will always fire, as it uses the uuid() function that will always generate a changed value.
If we leave things as they are, we will unfortunately expose our variables from within the DevOps outputs, such as the pipeline execution log, which for some values, such as secrets is going to create a security concern.
There is a native DevOps solution to this, and that is to use the IsSecret flag on the task.setvariable call, as below.
Write-Host "##vso[task.setvariable variable=DatabricksSecret;IsSecret=true]${azuread_service_principal_password.databricks.value}"
This will avoid any ‘leaky values’ and allow variables to capture Terraform values safely for use within the pipeline with no unwanted exposure.
Terraform does a great job of determining which resources have changed and need to be updated whenever a ‘Plan‘ or ‘Apply‘ action is made. There are times however when you will want to include other files in your Terraform configurations, such as when using a JSON file to store a REST payload for use in a script. In order to determine whether resources that use these files need to be included in the deployment, we can check against the MD5 hash of the file to see whether the file has changed. To redeploy the resource when the file has changed, we use a trigger that employs the filemd5() function, as below:
resource "null_resource" "exec_some_rest_with_payload_file" { triggers = { some_payload_changed = "${filemd5("${path.module}\\..\\Payloads\\SomePayload.json")}" } provisioner "local-exec" { command = <<EOT .'${path.module}\..\Scripts\REST\ExecuteSomeRest.ps1' ` -ApiRootUrl "https://${var.location}.${var.some_api_root_url_suffix}" ` -SubscriptionId "${var.subscription_id}" ` -TenantId "${var.tenant_id}" ` -ApplicationId "${var.client_id}" ` -Secret "${var.client_secret}" ` -Payload ""${path.module}\..\Payloads\SomePayload.json" EOT interpreter = ["PowerShell", "-Command"] } }
This ensures that changes to related files used within our deployment are treated in a similar manner to changes to Terraform resource definitions. Nothing too clever this one of our tips and trick, and not really Azure DevOps-specific, just out-of-the-box Terraform native stuff. All the same something very useful that you may not be aware of.
This is one of the Terraform Azure DevOps tips and tricks that I find most useful. I’ve used it a lot when there are elements of API calls involved in the deployment. There are plenty of occasions where we find ourselves using scripts for elements of our deployment. Often we will supply a script with a JSON file or similar that contains a number of Terraform resource attributes for use in the script. A classic example of this is as with the above payload for a REST request body. These values may not be available however until deployment time, such as when needing resource module outputs that contain values generated at creation time, such as platform-specific unique ids. Hmmm, what’s a Terraformer to do?
A common technique often used with application or web .config files in the DotNet world is to use placeholder tokens in the config files and then replace these with the required configuration values that are passed in at deployment time. This ‘detokenising’ approach can be employed within Terraform as well. Here’s a simple example of a placeholder from such a file,
"some_platform_resource_id": "#{some_resource_id_as_output}#"
where here we have used ‘#{‘ and ‘}#’ characters to demarcate our placeholders.
We can replace the placeholder tokens in the file using a simple script such as the PowerShell shown below.
param( [string] $BaseFilePath, [string] $FileFilters, [PSCustomObject] $TokenValues ) Function Update-FileTokens { param( [string] $BaseFilePath, [string] $FileFilters, [PSCustomObject] $TokenValues ) <# .SYNOPSIS Updates placholder values in a group of files with their replacements. .DESCRIPTION Calls the Update-FileToken procedure for files at the base path included based on the names filters. .PARAMETER BaseFilePath The path from which to include files, including all subfolders .PARAMETER FileFilters A CSV string of the filters to apply to file names. .PARAMETER TokenValues A hashtable of tokens and the values to replace them with. #> foreach ($filter in $FileFilters.Split(',')) { $fileNames = Get-ChildItem -Path $BaseFilePath -Recurse -Filter $filter | Select-Object FullName ` foreach ($fileName in $fileNames) { Write-Host "Started replacing tokens in $($fileName.FullName)." Update-Tokens -FilePath $fileName.FullName -TokenValues $TokenValues Write-Host "Finished replacing tokens in $($fileName.FullName)." } } } Function Update-Tokens { param( [string] $FilePath, [PSCustomObject] $TokenValues ) <# .SYNOPSIS Updates placholder token values in a group of files with their replacements. .DESCRIPTION Calls the Update-FileToken procedure for files at the base path included based on the names filters. .PARAMETER FilePath The path from of the file for token replacements .PARAMETER TokenValues A hashtable of tokens and the values to replace them with. #> $content = (Get-Content -Path $FilePath) $TokenValues.GetEnumerator() | ForEach-Object { $content = $content -replace $_.Key, $_.Value } Set-Content -Value $content -Path $FilePath } Update-FileTokens -BaseFilePath $BaseFilePath -FileFilters $FileFilters -TokenValues $TokenValues
We pass in a hash table object keyed on the placeholder tokens that we want to replace, such as ‘#{some_resource_id_as_output}#‘ above, with the values of the hash table entries being the replacements we want to substitute for. The above script will update the placeholders with their values in all files that match the BaseFilePath and FileFilters. Pretty straight-forward stuff.
In order to execute this within Terraform, with the required substitutions made at runtime, we can again use the local-exec provisioner with a PowerShell interpreter, constructing the hash table parameter from our resource attributes and variables and passing this in to the script call. The referencing of the module resource attributes will ensure that the replacements are triggered after these values have been made available so we don’t need any ‘depends_on’ clauses. The following resource snippet shows an example of these placeholders in action.
resource "null_resource" "update-file-tokens-payload-json" { triggers = { // always execute value = "${uuid() }" } provisioner "local-exec" { command = <<EOT .'${path.module}\..\scripts\util\Update-FileTokens.ps1' ` -BaseFilePath '${path.module}\..' ` -FileFilters '*.payload.json' ` -TokenValues @{ '#{some_config_from_var}#' = "${var.dbr_executable_storage_mount_name}" '#{some_resource_id_as_output}#' = "${azurerm_template_deployment.some-arm.some-id-as-output}" '#{some_config_from_secret}#' = "${var.some-secret-value}" } EOT interpreter = ["PowerShell", "-Command"] } }
Once our required file has been processed using our Update-FileTokens.ps1 script, we can use the filemd5() trigger approach shown above to determine whether any resources that use this file need to be redeployed. If the file content has been changed by the detokenising, the resource will be redeployed as required.
Adopting this approach is very useful when using Rest API calls with JSON payloads for some elements of the Terraform deployment process. We can keep the payloads in their own JSON files, with any references to Terraform resource outputs and the like as placeholders,. Providing we call our Update-FileTokens.ps1 script before these JSON files are used we are able to treat these API calls like other resource definitions.
Thanks for reading. Quite a long one this time but I do hope the above Terraform and Azure DevOps tips and tricks prove to be of use to you Terraformers out there. Adding these strings to your bow may just help with those situations where Terraform doesn’t immediately offer up an obvious solution to realising your infrastructure management needs.
If you have any helpful techniques or simple tricks and tips to add or any questions on the above I’d love to hear about them in the comments below.
That winds up this series on Terraform on Azure. I’ve really enjoyed sharing my thoughts, opinions and experiences of this great combination of tooling that really empowers you on your Azure journeys. Over to you to stake your claim in the Cloud. May your deployments be idempotent, your Infrastructure as Code transparent and your solutions, well, just plain amazing.
This article discusses the incorporation of Key Vault Secret values in Terraform modules and how they can be used as part of a release pipeline definition on Azure DevOps.
Secret management done right in Azure basically involves Key Vault. If you’re not familiar with this Azure offering, you can get the low-down at the following link:
https://docs.microsoft.com/en-us/azure/key-vault/
This article assumes you have followed best practice regarding securing your state file, as described in Terraform with Azure DevOps: Setup. Outputs relating to Secret values will be stored within the state file, so this is essential for maintaining confidentiality.
There are two key approaches to using Key Vault secrets within your Terraform deployments.
This involves using Terraform to retrieve the required Key Vault. One of the advantages of this method is that it avoids the need to create variables within Azure DevOps for use within the Terraform modules. This can save a lot of ‘to-ing and fro-ing’ between Terraform modules and the DevOps portal, leaving you to work solely with Terraform for the duration. It also has the advantage of being self-contained within Terraform, allowing for easier testing and portability.
We’ll assume you have created a Key Vault using the azurerm_key_vault resource type, added some secrets using the azurerm_key_vault_secret and set an azurerm_key_vault_access_policy for the required Users, Service Principals, Security Groups and/or Azure AD Applications.
If you don’t have the Key Vault and related Secrets available in the current Terraform modules that you are using, you will need to add a data source for these resources in order to reference these. This is typically the case if you have a previously deployed (perhaps centrally controlled) Key Vault and Secrets.
Setting up the Key Vault data source in the same Azure AD tenant is simply a matter of supplying the Key Vault name and Resource Group. Once this is done you can access various outputs such as Vault URI although in practice you’ll only really need the id attribute to refer to in Secret data sources.
data "azurerm_key_vault" "otherworld-visitors" { name = "ET-and-friends" resource_group_name = "central-rg-01" } output "vault_uri" { value = data.azurerm_key_vault.central.vault_uri }
I’ll leave you to browse the official definition for the azurerm_key_vault data source for further information on outputs.
Create Key Vault Secret data sources for each of the secrets you require.
data "azurerm_key_vault_secret" "ufo-admin-login-password" { name = "area-51-admin-password" key_vault_id = data.azurerm_key_vault.otherworld-visitors.id } output "secret_value" { value = data.azurerm_key_vault_secret.ufo-admin-login-password.value }
There are again a number of outputs for the data source, including the Secret value, version and id attributes.
You can then reference the Secret’s value by using the respective Key Vault Secret data source value attribute wherever your module attributes require it.
resource "azurerm_sql_database" "area-51-db" { name = "LittleGreenPeople" administrator_login_password = "${data.azurerm_key_vault_secret.ufo-admin-login-password.value}" .... }
If you are using a centralised variables file within each module, which aligns with recommended best practice, this means only having to change the one file when introducing new secrets. Our variables file simply references the required Key Vault Secret data sources as below,
ufo_admin_login_password = "${data.azurerm_key_vault_secret.ufo-admin-login-password.value}"
and our module resource includes the variable reference.
resource "azurerm_sql_database" "area-51-db" { name = "LittleGreenPeople" administrator_login_password = "${var.ufo_admin_login_password}" .... }
As previously mentioned this has not involved any Azure DevOps elements and the Terraform won’t require additional input variables in order to work with the Key Vault Secrets.
The second approach uses a combination of DevOps variable groups and Terraform functionality to achieve the same end result.
The first step is to grab our secrets into DevOps variables for use within the pipeline. Variable groups can be linked to a Key Vault as below.
This then allows the mapping of Secrets to DevOps variables for use within the various tasks of our pipelines.
I’ll demonstrate two ways to work with these variables within our Terraform modules. I’m sure there are others of course, but these are ones that I’ve found simplest for DevOps – Terraform integration.
The Replace Tokens task can be used to to replace delimited placeholders with secret values stored in variables. This does of course require that you adopt a standard for your placeholders that can be used across your modules. This approach can result in code that is disjointed to read, but is a common practice with artifacts such as app.config files in the DotNet world. The advantage of this is that you can take a single approach to Secret substitution. We can use Token replacement for both of these areas your code, be it Terraform IaC or DotNet.
The other technique I mention here is the use of the inbuilt support for variables with names that are prefixed ‘TF_VAR_’. Any environment variables with this naming convention will be mapped by design to Terraform variables within your modules. More information from Terraform docs is available at https://www.terraform.io/docs/commands/environment-variables.html.
We can pass DevOps variables that have been populated with Secrets values into the Terraform task as Environment Variables. You can then use standard variable substitution within your modules. So, ‘TF_VAR_my_secret’ will substitute for the ‘my_secret’ Terraform variable. Please note that all DevOps variables containing secret values should be marked as sensitive. This then obfuscates the variable values within the DevOps log.
Terraform and Azure DevOps allow more than one method for building pipelines that require secrets stored within Key Vault. For me, the Terraform ‘native’ approach of using Key Vault and Key Vault secrets data sources via the Azure RM Terraform provider is the simplest approach. There is no overhead of managing DevOps variables involved which keeps things nicely contained. You may of course prefer alternatives such as those others shown above or have another method, which I’d love to hear about.
I hope this post has provided some insight into using Terraform within Azure DevOps. These two technologies are a winning combination in address real-world Infrastructure as Code adoption within your organisation.
In the final post of this series I’ll be looking at best practices for managing your code using Terraform Modules.
In part 1 and part 2 of this series we looked at how the SQL generated by Power BI in combination with the data type of the partition key in the source data can impact the likelihood of partition elimination taking place. We also discussed how the SQL statements generated by M & DAX convert date/datetime data types:
Partition Key Data Type | Date Filter Expression Language | Partition Elimination Possible | Expression Langauge Data Type Conversion |
---|---|---|---|
date | M | No | datetime2(7) |
date | DAX | No | datetime |
datetime | M | No | datetime2(7) |
datetime | DAX | Yes | datetime |
datetime2(7) | M | Yes | datetime2(7) |
datetime2(7) | DAX | No | datetime |
datetime2(other) | M | No | datetime2(7) |
datetime2(other) | DAX | No | datetime |
int | M | Yes | int |
int | DAX | Yes | int |
A fairly common way to implement dates/date partitioning on large fact tables in a data warehouse is to use integer fields for the partition key (e.g. 20191231). The good news is that both M and DAX will pass a predicate on the partition key through to SQL with no fundamental change to the data type i.e. partition elimination takes place as expected, joy!
The example below uses a simple measure against a DirectQuery data source, partitioned on an integer field.
create table dbo.MonthlyStuffInt ( MonthDateKey int not null, TotalAmount decimal(10, 5) not null, index ccix_MonthlyStuffInt clustered columnstore) on psMonthlyInt(MonthDateKey);
In Power BI I have a slicer on the MonthDateKey column and a measure displayed in a card visual that sums TotalAmount.
Total Amount = SUM ( MonthlyStuff[TotalAmount] )
Using DAX Studio or Azure Data Studio to trace the above, I can capture the following SQL generated by the DAX expression:
SELECT SUM([t0].[Amount]) AS [a0] FROM ( (select [_].[MonthDateKey] as [MonthDateKey], [_].[TotalAmount] as [Amount] from [dbo].[MonthlyStuffInt] as [_]) ) AS [t0] WHERE ( [t0].[MonthDateKey] = 20160501 )
If I execute the query in SSMS, I can see that partition elimination takes place.
It does indeed but we’d generally use a date dimension table to improve the browsability of the data model and when doing so, we could lose the partition elimination functionality we see above.
In the below example I have set up a date dimension table, related to a fact table, both using DirectQuery as the storage mode. To improve the browsability of the data model I’ve used the MonthYear field from the date dimension in the slicer, something I suspect is much more representative of the real world.
create table dbo.DimDate ( DateKey int not null primary key clustered ,Date date not null ,MonthYear nchar(8) not null )
SELECT SUM([t0].[Amount]) AS [a0] FROM ((select [_].[MonthDateKey] as [MonthDateKey], [_].[TotalAmount] as [Amount] from [dbo].[MonthlyStuffInt] as [_]) AS [t0] inner join (select [$Table].[DateKey] as [DateKey], [$Table].[Date] as [Date], [$Table].[MonthYear] as [MonthYear] from [dbo].[DimDate] as [$Table]) AS [t1] on ( [t0].[MonthDateKey] = [t1].[DateKey] ) ) WHERE ( [t1].[MonthYear] = N'Jun-2016' )
Unfortunately, the generated SQL statement does not use partition elimination when querying the large fact table.
One way to invoke partition elimination and improve the query performance of the underlying SQL is to introduce a cross-island relationship between the dimension table and the fact table in Power BI.
Put simply, by introducing the cross-island relationship Power BI will bypass the join to the Date dimension table in the underlying SQL and instead, the value/s of the relationship field (DimDate[DateKey]) will be passed directly to the fact table as part of the generated DirectQuery SQL.
SELECT SUM([t0].[Amount]) AS [a0] FROM ( (select [_].[MonthDateKey] as [MonthDateKey], [_].[TotalAmount] as [Amount] from [dbo].[MonthlyStuffInt] as [_]) ) AS [t0] WHERE ( ([t0].[MonthDateKey] IN (20160612,20160626,20160613 ,20160627,20160614,20160601,20160615,20160628 ,20160602,20160629,20160603,20160616,20160630 ,20160617,20160604,20160618,20160605,20160619 ,20160606,20160620,20160607,20160621,20160608 ,20160609,20160622,20160623,20160610,20160624 ,20160611,20160625)) )
A somewhat controversial recommendation as it’s in direct conflict with the Power BI documentation for filter propagation performance. My experience, especially in the context of large partitioned fact tables, is that the improvement you see in the data retrieval times from large fact tables (as a direct result of the ‘fact table only’ SQL statement and subsequent partition elimination) far outweighs any additional work required elsewhere.
To introduce this relationship type, change the storage mode of the date dimension table to Import.
SELECT SUM([t0].[Amount]) AS [a0] FROM ((select [_].[MonthDateKey] as [MonthDateKey], [_].[TotalAmount] as [Amount] from [dbo].[MonthlyStuffInt] as [_]) AS [t0] inner join (select [$Table].[DateKey] as [DateKey], [$Table].[Date] as [Date], [$Table].[MonthYear] as [MonthYear] from [dbo].[DimDate] as [$Table]) AS [t1] on ( [t0].[MonthDateKey] = [t1].[DateKey] ) ) WHERE ( [t1].[MonthYear] = N'Jun-2016' )
Partition Elimination does not take place
SELECT SUM([t0].[Amount]) AS [a0] FROM ( (select [_].[MonthDateKey] as [MonthDateKey], [_].[TotalAmount] as [Amount] from [dbo].[MonthlyStuffInt] as [_]) ) AS [t0] WHERE ( ([t0].[MonthDateKey] IN (20160612,20160626,20160613 ,20160627,20160614,20160601,20160615,20160628 ,20160602,20160629,20160603,20160616,20160630 ,20160617,20160604,20160618,20160605,20160619 ,20160606,20160620,20160607,20160621,20160608 ,20160609,20160622,20160623,20160610,20160624 ,20160611,20160625)) )
Partition Elimination takes place and we see a reduction in CPU and reads. The simplified query plan also came with the added benefit of aggregate pushdown on the columnstore index which would contribute to the improvement in performance.
If you are using a dimension table to improve the browsability of the data model, and you want to harness partition elimination to improve query performance in DirectQuery against very large fact tables then consider introducing a cross-island relationship between the dimension table and the fact table. This can be achieved by setting the dimension table to use Import mode. Familiarise yourself with the limitations before doing so and as always, profile your queries to ensure you’re obtaining optimal performance from your solution because one size never fits all.
In part 1 of this series we looked at how the SQL generated by Power BI, in combination with the data type of your partitioned columns in the source data, can impact the likelihood of partition elimination taking place. We also discussed how the SQL statements generated by M & DAX convert date/datetime data types:
Partition Key Data Type | Date Filter Expression Language | Partition Elimination Possible | Expression Langauge Data Type Conversion |
---|---|---|---|
date | M | No | datetime2(7) |
date | DAX | No | datetime |
datetime | M | No | datetime2(7) |
datetime | DAX | Yes | datetime |
datetime2(7) | M | Yes | datetime2(7) |
datetime2(7) | DAX | No | datetime |
datetime2(other) | M | No | datetime2(7) |
datetime2(other) | DAX | No | datetime |
int | M | Yes | int |
int | DAX | Yes | int |
In this post, I’ll dig a little further into the detail and demonstrate the issue at hand with the use of some code snippets when working with a date column.
I have a SQL table shoved with some test data that I’ve partitioned by date (monthly) as follows:
create partition function pfMonthly (date) as range right for values( '20190101','20190201' ,'20190301','20190401' ,'20190501','20190601' );
create partition scheme psMonthly as partition pfMonthly all to ([PRIMARY]);
create table dbo.MonthlyStuff ( MonthDateKey date not null , TotalAmount decimal(10, 5) not null , index ccix_MonthlyStuff clustered columnstore ) on psMonthly(MonthDateKey);
Using the import storage mode, the following Power Query statement is used to filter a monthly slice of the data in Power BI.
let Source = Sql.Databases("mydbserver.database.windows.net"), mydb= Source{[Name="mydb"]}[Data], dbo_MonthlyStuff = mydb{[Schema="dbo",Item="MonthlyStuff"]}[Data], #"Filtered Rows" = Table.SelectRows(dbo_MonthlyStuff , each ([MonthDateKey] = #date(2019, 2, 1))) in #"Filtered Rows"
Query folding takes place, producing the following native query (below). Note the correct data type (date) has been detected by Power Query however the native SQL query converts the filter to a datetime2:
If I plug the generated SQL into SQL Server Management Studio and view the execution plan I can see that partition elimination does not take place against my partitioned table.
select [_].[MonthDateKey] , [_].[TotalAmount] from [dbo].[MonthlyStuff] as [_] where [_].[MonthDateKey] = convert(datetime2, '2019-02-01 00:00:00')
This is due to the misalignment of the data type in the predicate of the SQL statement (generated by M) with that of the data type of the partition key in the database. The source partition key is using the SQL date datatype and the SQL generated by M is casting the field to a datetime2.
To demonstrate this further, if I change the query manually in SSMS to use date conversion, partition elimination takes place resulting in much lower CPU and I/O overhead.
select [_].[MonthDateKey] , [_].[TotalAmount] from [dbo].[MonthlyStuff] as [_] where [_].[MonthDateKey] = convert(date, '2019-02-01 00:00:00')
CPU: 4250000 Reads: 45684
CPU: 578000 Reads: 4274
Now let’s demonstrate the issue using DAX with DirectQuery. If I use DirectQuery storage mode in Power BI and run a calculation against the underlying partitioned table, filtered to a particular month, I see a similar issue.
In Power BI I have a slicer on the MonthDateKey column (set as a date datatype) and a measure, that sums TotalAmount, displayed in a card visual.
Total Amount = SUM ( MonthlyStuff[TotalAmount] )
The following SQL is generated by the DAX expression when using DAX Studio or Azure Data Studio to trace the above.
SELECT SUM([t0].[TotalAmount]) AS [a0] FROM ( (select [$Table].[MonthDateKey] as [MonthDateKey], [$Table].[TotalAmount] as [TotalAmount] from [dbo].[MonthlyStuff] as [$Table]) ) AS [t0] WHERE ( [t0].[MonthDateKey] = CAST( '20190201 00:00:00' AS datetime) )
As you can see from the below, once again we fail to harness the power of partition elimination. This is due to the misalignment of the data type in the predicate of the SQL statement (generated by DAX) and the data type of the source column. Partition elimination does not take place because the source column is using the SQL date datatype and the SQL generated by DAX is casting to a datetime data type.
To demonstrate this further, if I change the query manually in SSMS to use date conversion, partition elimination takes place and results in much lower CPU and I/O overhead.
SELECT SUM([t0].[TotalAmount]) AS [a0] FROM ( (select [$Table].[MonthDateKey] as [MonthDateKey], [$Table].[TotalAmount] as [TotalAmount] from [dbo].[MonthlyStuff] as [$Table]) ) AS [t0] WHERE ( [t0].[MonthDateKey] = CAST( '20190201 00:00:00' AS date) )
CPU: 1640000 Reads: 27681
CPU: 266000 Reads: 4274
If you want to harness partition elimination to improve query performance in DirectQuery mode and reduce load times in Import mode, then it’s worth keeping in mind how DAX and M convert the various date/datetime data types in the underlying SQL. As always, profile your queries and ensure they are performant before you commit.
In part 3 of this series we’ll look at partitioned tables that use an integer as the date partition key, a common practice in data warehousing.
This series will focus on the use of partition elimination when using Power BI to query partitioned tables in a Microsoft SQL Database (On-Prem or Azure).
Partition elimination can significantly improve Power BI DirectQuery performance. Similarly, Power BI Import models that target specific partition slices across very large datasets can see a reduction in load duration by using partition elimination. Unsurprisingly, as with all things tech, there are a few gotchas to look out for along the way. With partition elimination, the big gotcha is sensitivity to data type alignment between the predicate of the SQL query and the partition key of the partitioned table.
In this post, you’ll find an overview of the data type alignment issues we’re facing. Part 2 will cover some code examples and query plans to demonstrate the problem in more detail. My final write-up of the series will take a closer look at Power BI DirectQuery against tables partitioned on an integer datekey, a common practice in data warehousing.
Before we start, let’s briefly run through some terminology for those less familiar with table partitioning.
I won’t go into the details on this but in short, it’s a way of splitting large tables into smaller chunks which the SQL optimiser can (if it so chooses) access very quickly using a technique called partition elimination.
Check out the official Microsoft documentation for full details, and peruse this excellent post written by Cathrine Wilhemsen for a great introduction to the topic.
Copping out once again, I won’t be going into the details. In essence, if you write a query against your partitioned table which includes a filter on the partition key then partition elimination can take place, provided the optimiser deems it so :-). This detailed post by Kendra Little discusses the query performance we’re aiming to achieve via the dynamic SQL generated by Power BI.
….well, everything. As powerful as partition elimination can be, it’s a sensitive beast. You need to make sure the data types used in your query align exactly with the data type of your partition key in your database. An example of this sensitivity is covered in this great post by Kendra Little.
No big deal, right? Provided you have control over the SQL syntax then it’s no big deal. However, with Power BI we’re at the mercy of the application and the SQL that Power BI generates via DAX & M.
The SQL generated by a DAX expression against a DirectQuery datasource will cast dates and datetimes as a SQL datetime datatype e.g. cast('20190909' as datetime)
. The SQL generated by a M expression will cast dates and datetimes as a SQL datetime2 data type with default precision e.g. cast('20190909' as datetime2)
.
What’s the impact of this? Well, for example, if you’re running DirectQuery against large fact tables that are partitioned on a date data type and your Power BI Analysis includes a date filter on the partition key, partition elimination will not take place as the underlying dynamic SQL from Power BI will cast your date filter into a datetime or datetime2 data type, depending on the expression language used i.e. DAX or M.
The below table provides a summary of where partition elimination will and won’t take place.
Partition Key Data Type | Date Filter Expression Language | Partition Elimination Possible | Expression Langauge Data Type Conversion |
---|---|---|---|
date | M | No | datetime2(7) |
date | DAX | No | datetime |
datetime | M | No | datetime2(7) |
datetime | DAX | Yes | datetime |
datetime2(7) | M | Yes | datetime2(7) |
datetime2(7) | DAX | No | datetime |
datetime2(other) | M | No | datetime2(7) |
datetime2(other) | DAX | No | datetime |
int | M | Yes | int |
int | DAX | Yes | int |
This post discusses some initial tasks when starting to use Terraform for your CI/CD workflow with Azure DevOps.
The Terraform state file contains all information regarding the deployment of the resources to the target environment. This can include sensitive information such as storage keys and other secrets that may be output from Terraform. If you are working in a team environment, you will want to store this file where it can be used by the team, the obvious choice if you are working in Azure being a storage account. When running your Terraform deployments with Azure DevOps the only user account that should have access permissions granted to this storage account is that under which the DevOps release pipelines are being executed. It is critical to the security of your deployments to ensure this is in place before proceeding with Terraform as your resource management solution. You can find a great tutorial from Microsoft which walks you through setting this up at the following link.
https://docs.microsoft.com/en-gb/azure/terraform/terraform-backend
For those already established in Azure, it is likely that some of your resources will not be part of your IaC code base, for whatever reason. In order to complete the picture for future deployment efforts you’ll want to bring these into Terraform resource definitions and ensure that the state file reflects the correct picture. Fail to do this and Terraform will tear them down seeing them as undesired leftovers from previous deployment efforts. One area where Terraform lags significantly behind ARM Templates is in creating infrastructure definitions from existing resources.
With ARM, the majority of deployed resources within Azure can be exported from the portal to a template file which you can then amend, further parameterise if needed and store within your IaC repository.
Terraform’s ‘import’ command does allow something similar in spirit but which requires a lot more manual work in order to reach the same result. You can read about this here.
https://www.terraform.io/docs/import/index.html
Let’s take a look at what we need to bring a resource under Terraform control.
In order to bring the resource in, we need to create a resource definition for the desired resource, which can be simply of the form <resourceType>.<resourceName>, e.g.
resource "azurerm_storage_account" "datalake-storage" { }
Pretty straight forward.
With our skeleton resource definition in place, running Terraform’s import command will populate the state file with the required resource attributes from the deployed resource. With our above resource we would execute the following, where the resource_id is the desired Azure resource id.
terraform import azurerm_storage_account.datalake-storage /subscriptions/<subscriptionId>/resourceGroups/<myresourcegroup>/providers/Microsoft.Storage/storageAccounts/<myaccount>
Okay, so now our state file reflects the resource we have imported. Terraform is now aware of its existence and won’t squish it like an unnoticed cherry tomato that strayed out of the fridge.
With the required information in the state file, we then need to extract the required information. This can be done by the Terraform show command, as below:
terraform show -json
This will output the state file information in JSON format, giving you the information required for the definition.
Next comes the not so fun bit. At present there is no means to auto-populate the definition from the state info. This leaves a manual exercise to update the resource definition file to complete the Terraform definition. You’ll need to refer to the Terraform Azure resource definition attribute documentation as you go for this. This may take a while if you have a number of resources. There are thankfully plans to automatically populate the resource definition file as part of the import.
Due to Terraform’s state file centric view of the world, it will be blissfully unaware of any operations undertaken outside of its deployments. When using the IaC approach to resource deployment with Terraform this is something that shouldn’t really happen but sometimes things can slip through. This issue of ‘state drift’ is something that we can however correct for with the ‘Terraform refresh’ command, as described here.
https://www.terraform.io/docs/commands/refresh.html
Terraform refresh will amend resource definitions stored in the state file. The state file must exist as the refresh command is not able to recreate the file itself completely. Again, as there is no auto-populating of the resource definitions, you’ll need to use Terraform show to extract the resource information and another manual update to align things.
Setting up Terraform on Azure for DevOps is something that will really get you on the right CI/CD path to Azure estate IaC management. Bringing existing resources into the Terraform fold is not straight forward and you should plan a reasonable amount of time for it, though this will change in a future release. Once in, you’ll really start to see it paying dividends for all your DevOps activities. You may even start to wonder how you previously got by without it. In the final posts in this series we’ll be looking at integration with Release Pipelines, Secrets management and some other usability tips and tricks. See you soon for more Terraforming.
Terraform has some great support for Azure resources but there are some situations where you will need to fall back to calling ARM Templates for your resource deployments. In general this is when the Terraform Azure provider simply does not have a definition for the resource required. Some complex deployments within Azure are made available via ARM templates, such as when deploying an Azure Databricks workspace that has dependent VNet and public/private subnets. Doing this level of deployment from scratch is not for the feint-hearted and would probably not go smoothly straight off the bat on the first attempt. I guess that is why Azure came up with ARM Templates in the first place.
As those familiar with Terraform will know, the state of deployed resources and related attributes is tracked in the ‘state file’. This file, with the extension .tfstate, is the go-to de facto picture of what is deployed, and anything not contained within that state file simply doesn’t exist as far as Terraform is concerned.
When a resource is removed from the Terraform module definitions, upon issuing a Terraform Apply to redeploy the latest view of the infrastructure, the resource is torn down from Azure. Based on the dependency graph known to Terraform, the resources no longer required are gracefully deleted from the respective resource group(s). There are however a number of instances when the information in the state file does not provide what is required to manage this tear down process.
Null resources, by their nature do not actually tie into any actual resource that Terraform will recognise, and so any removing of related resources will need to be explicitly managed via additional script execution to back out the original deployed resources. Although something of a pain, it is taken as a given when adopting this approach to resource management.
ARM Templates will contain a reference in the state file to the actual template deployment but this has no information on what the template actually deployed. It is simply a reference to the Azure provider ARM Template resource.
You can find information about the Azure Resource Manager Template Deployment Terraform resource at:
https://www.terraform.io/docs/providers/azurerm/r/template_deployment.html
As mentioned, unless you want to tear down everything that is not in the template, you will want to use the ‘Incremental’ deployment mode for the template.
parameters = { ... } deployment_mode = "Incremental" }
In order to manage our resources created using Terraform and ARM Templates, we’re going to need some way of removing these resources when no longer required.
One recommended approach is to simply deploy all resources to a separate resource group, which can then be deleted, along with all contained resources. For any but the simplest of deployments however this is not really going to be desirable.
Terraform allows conditional statements within resource definitions that are based on the action being undertaken by Terraform. This allows us to run custom scripts that can explicitly tear down the resources that were deployed once we no longer want this ARM Template deployment, by specifying the related condition as ‘destroy’. These are then executed when we explicitly action ‘Terraform destroy’. If you only want to target the ARM Template Terraform resource for the destroy action, this should be specified using the ‘-target’ argument of the destroy command. Failure to specify this will result in all resources within the module definitions being torn down (probably not the desired effect).
In order to know what resources to remove, the respective ids of the resources contained within the ARM Template will need to be captured as outputs from the ARM Template Terraform deployment. These will be referenced in the explicit Azure resource delete statements executed.
There is a bug in older Terraform versions with using ARM Template outputs. They will not see the output definition on the initial Terraform Plan action, which will cause the plan to fail. In order to overcome this, you need to run a deployment with the ARM Template output defined but no references to it in the Terraform. This which will update the state file accordingly. The ARM Template output can then be successfully referenced from this point forward in your Terraform resources.
To formalise the above, in order to destroy resources that are deployed via ARM Templates with Terraform, the following is required:
Putting this all together we have our ARM Template outputs (I’ve only included the one here for brevity),
{ "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { ... ... ... } }, "type": "Microsoft.Databricks/workspaces" }], "outputs": { "resourceID": { "type": "string", "value": "[resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName'))]" } } }
the deletion script elements within the Terraform resource definition for the ARM Template (I’ve used az commands but of course you can use PowerShell if you prefer),
resource "azurerm_template_deployment" "dbr-wsp-dataeng" { name = "${var.organisation}-${var.system}-${var.environment}-${var.dbr_workspace_name_prefix}-${var.dbr_name_prefix}-wsp-${var.location}" resource_group_name = "${azurerm_resource_group.datalake.name}" template_body = <<DEPLOY ${file("${path.module}/databricks-workspace.json")} DEPLOY # these key-value pairs are passed into the ARM Template's `parameters` block parameters = { ... ... } deployment_mode = "Incremental" provisioner "local-exec" { when = "destroy" command = "az login --service-principal -u ${var.clientId} -p ${var.clientPwd} --tenant ${var.tenantId} && az resource delete --ids ${self.outputs.resourceID}" } }
and the destroy action targeting only the Terraform resource desired for removal.
Terraform destroy -target azurerm_template_deployment.dbr-wsp-dataeng
This approach does require that we capture all the required outputs and deletion commands in order to clear up the ARM Template deployment. This will require inspection of the ARM Template specifics to ensure all items are covered. It does however offer the granularity more often required for selective resource tear down that is not an option for the resource group level deletion approach. For me, gaining this desired level of control over resource management when falling back to ARM Templates makes the extra effort well worth it.
In order to determine the respective order of resource deployments and how it all fits together, we need to understand the dependencies that exist within our resources. This is a particular area of potential maintenance and complexity for our Infrastructure as Code (IaC) best efforts as we amend our deployment requirements over time. In this post we’ll look at how ARM templates and Terraform deal with the gnarly aspect of dependencies within resource deployments.
The ‘reference’ function can be used to reference properties from one resource for use within another. For example
"properties": { "originHostHeader": "[reference(variables('webAppName')).hostNames[0]]", ... }
The referenced resource must be a name of a resource defined within the same template. This creates an implicit dependency on the other resource. Passing Resource ID values in will not create an implicit dependency as ARM will not be able to interpret this.
The use of these within you template, as described at https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/template-functions-resource#list will also create an implicit dependency.
When calling a linked/nested template, ARM will create an implicit dependency on the deployment that is defined within the linked template, essentially conducting the linked deployment first. This is an ‘all or nothing’ dependency, with no ability to define more granular resource dependencies across templates.
For all those resources that are defined within the same template that do not use ‘reference’ or ‘List*’ functions, ARM defines dependencies between resources explicitly using the ‘dependsOn’ attribute. An example of the use of the ‘dependsOn’ attribute is given below.
{ "type": "Microsoft.Compute/virtualMachineScaleSets", "name": "[variables('namingInfix')]", "location": "[variables('location')]", "apiVersion": "2016-03-30", "tags": { "displayName": "VMScaleSet" }, "dependsOn": [ "[variables('loadBalancerName')]", "[variables('virtualNetworkName')]", "storageLoop", ],
It is not possible to define dependencies across different templates in this manner. Managing these dependencies within each template is an element of the deployment definition that will need attention and can easily result in failed deployments when items are missed. This can add a considerable overhead to our efforts to get things managed via IaC for those deployments that have many resources.
For all resources defined within the Terraform modules, both root modules and any referenced, Terraform will determine the dependencies and execute the resource management actions as needed. There is no need for explicitly declaring dependencies between resources, and as items are added/removed, so dependencies are adjusted as needed. When one resource depends on another it will be apparent from any referenced resource ids. This is essentially the same behaviour as for the ARM template ‘reference’ function, with the main advantage of Terraform being that these referenced resources can exist within separate modules/files.
One area where dependencies will need to be explicitly defined is where Terraform resource providers cannot be used for all resource aspects of the deployment required. Two common instances of this within Azure are:
Terraform will be unaware of the resources actually deployed within an ARM template and as such will not be able to determine which other resources may depend on these. Likewise, actions undertaken within scripts and other executables will also present no known resource dependencies for Terraform to consider. In both these scenarios we need to state what our dependencies will be.
Terraform allows the explicit defining of dependencies using the ‘depends_on’ attribute for a resource. In the example below, the dependency resources must all be deployed prior to this resource.
resource "null_resource" "exec_notebook_job_mount_workspace_storage_checkpointing" { depends_on = { null_resource.create_databricks_data_eng_cluster, null_resource.add_workspace_notebook_storage_mount, azurerm_storage_container.datalake }
Terraform provides a ‘Null Resource’ for arbitrary actions, being basically a resource definition for a non-existent resource. The ‘local-exec’ provisioner within this allows for the execution of various script types such as PowerShell, Bash or Cmd. ‘Triggers’ control the point at which this can be executed that define one or more prerequisite resources which must undergo change in order for the execution to take place. Using this approach we can execute any script or executable at a set point within the deployment. Note that these triggers act as logical ‘Or’ rather than ‘And’ triggers, so the resolution of any one of their resource conditions will allow the execution of the null resource.
resource "null_resource" "exec_notebook_job_mount_workspace_storage_checkpointing" { depends_on = { null_resource.create_databricks_data_eng_cluster, null_resource.add_workspace_notebook_storage_mount, azurerm_storage_container.datalake } triggers = { create_databricks_data_eng_cluster = "${null_resource.create_databricks_data_eng_cluster.id}_${uuid()}" add_workspace_notebook_workspace_storage = "${null_resource.add_workspace_notebook_storage_mount.id}_${uuid()}" datalake_storage = "${azurerm_storage_container.datalake.id}_${uuid()}" } provisioner "local-exec" {
Note that there are multiple triggers that could set this resource to execute the respective script. All three of the related resources must however be deployed prior to executing the script, due to internal dependencies that the script has on these resources. As such we have used the depends_on block to ensure that all is in place beforehand.
Terraform makes dependencies a lot simpler to define than ARM templates. The combination of implicit dependency detection plus explicit definition where required allows for much easier definition of resource deployments. Being able to implicitly determine dependencies across files with Terraform also provides for less ‘structural constraints’ when writing the IaC definitions. This results in increased productivity and reduced management overhead. Administrators need think less about the order of deployment and can get on with defining the infrastructure, leaving the rest for Terraform.