Data Lake Creation In Azure

Data Lake Creation In AzureSarath SasidharanBlockedUnblockFollowFollowingMar 11This blog focuses on how to create a data lake on Azure using a script.

This enhances developer productivity and quick deployment of data lakes using the concept of IaC ( Infrastructure as Code).

This would cover :Creation of storageCreation of an Orchestrator / Worklow EngineCreation of relational database (OLTP)Creation of an ETL / Compute EngineCreation of an Massively Parallel Processing Data warehouse (OLAP)Single shell script when executed creates all of the above resources and bundles them in a resource group.

PrerequisitesIDE , in this case visual studio code.

Azure accountCreate a new projectCreate a new folder.

Open Visual Studio Code by typing in :code .

Create a folder inside this project named infra , create a script inside this named data_lake.

sh , this script will contain the steps required to setup the infrastructure needed for a data lake.

In this tutorial we are using az commands to create the resources , there are other ways on how resources could be created in azure which includes ,ARMPowershellTerraformAzure REST APIsFirst step in the data lake creation is to create a data lake store.

Creation of StorageThe qualities of the storage layer need to be :Easily ScalableCheap StorageEasy integration with Services (Both Open Source and Proprietary Services)Folder Restrictions based on POSIX to restrict access on folder levels.

Easy sink for all types of data sources ( Streaming as well as Batch)For this blog , we have Azure Data Lake Store Gen 2 as the raw storage layer for all our data flowing into our data lake.

An Introduction to this can be found hereAzure Data Lake Storage Gen2 IntroductionProvides an overview of Azure Data Lake Storage Gen2docs.

microsoft.

comThis data lake supports POSIX level file systems permission which can be set using Active Directory Entities , which can help control access in a very scalable and elegant manner.

Create a resource group which contains logical grouping of resources which are spun up on azure, click to learn more about resource group.

Copy the snippet of code as below and paste it inside the data_lake.

sh file , replace the variable names with the correct values.

#!/bin/bashDATA_LAKE_RG='<REPLACE_WITH_DATA_LAKE_RESOURCE_GROUP_NAME>'DATA_LAKE_LOCATION='<REPLACE_WITH_DATA_LAKE_REGION>'# Create a new Resource Groupaz group create –name $DATA_LAKE_RG –location $DATA_LAKE_LOCATIONThis command uses az cli to request creation of a new resource group with name specified with the “- -name” flag and also specify the location of the resource group using the “- -location” flag.

Location is the the region where the data center of Azure is put up.

This would request the resources to be spun up in that particular region during the creation.

To run this on azure , you can :Run it from your terminal by using az login , and then executing the shell script (If you have a linux machine)Login to the azure portal (portal.

azure.

com) and use the cloud shell.

There are two options in the cloud shell , one is bash shell and the other is a powershell.

This blog we use option 2 with bash shell, but feel free to play around with option 1 if you have a linux machine.

Go to the portal and then click on the bash shell to open a new window , where you can execute bash commands.

Follow the steps as mentioned below to request a new cloud shell from the portal.

Azure Cloud Shell overviewAzure Cloud Shell is an interactive, browser-accessible shell for managing Azure resources.

It provides the flexibility…docs.

microsoft.

comScript ExecutionOnce you have the shell import the script (data_lake.

sh) using the selection on the window.

Upload the file data_lake.

sh.

If you use a windows machine , issue the dos2unix command and also grant execute rights on the script , to do this run the following command.

dos2unix data_lake.

shchmod +x data_lake.

shRun the shell script , this will create a new resource group with the name specified in the azure location.

/data_lake.

shAfter this script succeeds , execute the list command to see if the resource group created is present.

az group listThis can also be viewed in the azure portal from portal.

azure.

com.

Setup a storage account within this resource group , this would be the raw storage layer for the data lake.

Add this snippet to the data_lake.

sh.

Before running this on the cloud shell run the following command on the cloud shell.

az extension add –name storage-previewThis adds the extension for Azure Cli needed to install ADLS Gen2 .

Copy this snippet into data_lake.

sh script , replace storage names with the right values.

DATA_LAKE_STORAGE='<REPLACE_WITH_DATA_LAKE_STORAGE_NAME>'# Create a new storage (ADLS Gen2) Accountaz storage account create –name $DATA_LAKE_STORAGE –resource-group $DATA_LAKE_RG –location $DATA_LAKE_LOCATION –sku Standard_LRS –kind StorageV2 –hierarchical-namespace trueRepeat step “Script Execution” from above .

This should create a new storage account.

View the resource using the following command.

az resource list -g <NAME_OF_RESOURCE_GROUP>The new storage should be visible here , this can also be viewed from the azure portal under storage accounts blade on the left side of the portal.

Creation of an Orchestrator / Pipeline / Workflow builderOnce the storage account is ready the next step is to create a tool which can help in loading data to the storage layer.

What we need to keep in mind for this are :Connectors available from varying sources of input can range from legacy like mainframes to latest sources like snowflake for ex.

Ease of usageQuick / Rapid pipeline developmentLineageAvailability of Triggers ( Time Based / Event Based / File Trigger /Rolling window .

etc)Support for Steaming and Batch data flowsConnectors to push back data to different sinksIntegration with version controlFor this blog we use Azure Data Factory Gen 2 .

Read through the documentation before your proceed further to familiarize what ADF Gen2 can and cannot do , this will help during the selection of the right tool.

Azure Data Factory Documentation – Tutorials, API ReferenceLearn how to use Data Factory to set up automated data pipelines to and from on-premises and cloud sources in order to…docs.

microsoft.

comAz cli commands for creating the azure data factory are not yet available so we will be using ARM template to create this resource.

To know more about ARM , this is the documentation :Resource Manager on Azure documentation – Tutorials, samples, reference, and resourcesLearn how to use Resource Manager to deploy and manage solution resources as a group and to create templates for…docs.

microsoft.

comCreate a new folder inside the project named “arm” inside the infra folder.

The folder structure is shown below.

This folder holds the ARM template used to create the data factory instance.

Create a new file (json) inside arm folder and then copy the contents into the new file.

{ "$schema": "https://schema.

management.

azure.

com/schemas/2015-01-01/deploymentTemplate.

json#", "contentVersion": "1.

0.

0.

0", "parameters": { "name": { "type": "string", "defaultValue": "<NAME_OF_DATA_FACTORY>" }, "location": { "type": "string", "defaultValue": "<REPLACE_WITH_DATA_FACTORY_REGION>" }, "apiVersion": { "type": "string", "defaultValue": "2018-06-01" } }, "resources": [ { "apiVersion": "[parameters('apiVersion')]", "name": "[parameters('name')]", "location": "[parameters('location')]", "type": "Microsoft.

DataFactory/factories", "identity": { "type": "SystemAssigned" } } ]}For explanation of this json and the elements in it , refer to the documentation link above.

Replace the variable values in the json and then save this file.

This json also refers to a properties file where the values of the parameters are obtained.

Create a new folder named conf at the root level and create a new file inside this folder.

Copy these contents into the properties file.

{ "$schema": "https://schema.

management.

azure.

com/schemas/2015-01-01/deploymentParameters.

json#", "contentVersion": "1.

0.

0.

0", "parameters": { "name": { "value": "<NAME_OF_DATA_FACTORY>" }, "location": { "value": "<REPLACE_WITH_DATA_FACTORY_REGION>" }, "apiVersion": { "value": "2018-06-01" } }}These values get substituted into the run time values for the parameters in the main ARM template.

Once the templates are ready we need to add this deployment instruction to the script , infra/data_lake.

shDATA_LAKE_WORKFLOW='datalakeorchestrator'ARM_LOCATION='arm/data_factory.

json'ARM_PROPS_LOCATION='.

/conf/data_factory_prop.

json'# Create Data Factory Version 2 az group deployment create –name $DATA_LAKE_WORKFLOW –resource-group $DATA_LAKE_RG –template-file $ARM_LOCATION –parameters $ARM_PROPS_LOCATIONThis command deploys a Azure Data Factory Gen 2 using the template referred by the “- -template-file” argument , the properties file is referred by “- -parameters”.

The best way to transport this to the cloud shell is to commit this to a git repository and clone this to your azure cloud shell.

Check if the data factory has been created.

This can also be viewed in the portal.

az resource list -g <NAME_OF_RESOURCE_GROUP>Create a Relational DatabaseSource systems could be Relational Databases which can release data to the data lake.

Orchestrators / workflow engines could pull data from these systems onto the data lake storage to create a enterprise wide data sharing platform aka enterprise data lake.

For this part we use SQL DB (Managed SQL Server) on azure as our Relational Database.

Azure SQL Database Documentation – Tutorials, API ReferenceAzure SQL Database is a general-purpose relational database-as-a-service (DBaaS) based on the latest stable version of…docs.

microsoft.

comIn order to create this add this to the infra/data_lake.

sh script , under the infra folder.

Replace the variable names with appropriate names.

Ideally these variables could be stored in a configuration file , secrets can be stored and retrieved from azure keyvault.

DATA_LAKE_SERVER_NAME='<SQL_SERVER_NAME>'DATA_LAKE_DATABASE_NAME='<SQL_DB_NAME>'ADMIN_USERNAME='<ENTER_USERNAME>'ADMIN_PASSWD='<ENTER_PASSWORD>'## Create a logical server in the resource groupaz sql server create –name $DATA_LAKE_SERVER_NAME –resource-group $DATA_LAKE_RG –location $DATA_LAKE_LOCATION –admin-user $ADMIN_USERNAME –admin-password $ADMIN_PASSWD## Create a database in the server az sql db create –resource-group $DATA_LAKE_RG –server $DATA_LAKE_SERVER_NAME –name $DATA_LAKE_DATABASE_NAME –sample-name AdventureWorksLT –service-objective S0# Configure a firewall rule for the serveraz sql server firewall-rule create –resource-group $DATA_LAKE_RG –server $DATA_LAKE_SERVER_NAME -n AllowYourIp –start-ip-address 0.

0.

0.

0 –end-ip-address 0.

0.

0.

0Once this is ready execute , to see the resources.

az resource list -g <NAME_OF_RESOURCE_GROUP>Create an ETL EngineOnce the storage and orchestrator have been defined the next core component in the data lake is an ETL / ELT tool.

Crucial topics in choosing an ETL engine :Should be scalable / On DemandShould provide flexibility to process both structured and unstructured dataShould be easy to develop / script jobsIntegration with source control systemsShould be able to handle stream and batch flow of dataFor this blog , we go with Azure Databricks.

Follow this for documentation.

Azure Databricks Documentation – TutorialsLearn how to use Azure Databricks using the quickstart and other links provided here.

Azure Databricks is an Apache…docs.

microsoft.

comSimilar to Azure data factory we will use ARM (Azure Resource Manager ) Template to create an Azure Databricks instance.

Under the /infra/arm folder create a new file of your choice for example , databricks.

json.

Copy the following snippet into the file.

{ "$schema": "https://schema.

management.

azure.

com/schemas/2015-01-01/deploymentTemplate.

json#", "contentVersion": "1.

0.

0.

0", "parameters": { "workspaceName": { "type": "string", "metadata": { "description": "The name of the Azure Databricks workspace to create.

" } }, "pricingTier": { "type": "string", "defaultValue": "premium", "allowedValues": [ "standard", "premium" ], "metadata": { "description": "The pricing tier of workspace.

" } }, "location": { "type": "string", "defaultValue": "[resourceGroup().

location]", "metadata": { "description": "Location for all resources.

" } } }, "variables": { "managedResourceGroupName": "[concat('databricks-rg-', parameters('workspaceName'), '-', uniqueString(parameters('workspaceName'), resourceGroup().

id))]" }, "resources": [ { "type": "Microsoft.

Databricks/workspaces", "name": "[parameters('workspaceName')]", "location": "[parameters('location')]", "apiVersion": "2018-04-01", "sku": { "name": "[parameters('pricingTier')]" }, "properties": { "ManagedResourceGroupId": "[concat(subscription().

id, '/resourceGroups/', variables('managedResourceGroupName'))]" } } ], "outputs": { "workspace": { "type": "object", "value": "[reference(resourceId('Microsoft.

Databricks/workspaces', parameters('workspaceName')))]" } } }This is the code which requests a new databricks workspace from azure.

There is also a properties file which needs to be added , this is under /conf/ directory.

Create a new file for ex : databricks_prop.

json and add the following snippet.

Substitute the name of the databricks instance.

{ "$schema": "https://schema.

management.

azure.

com/schemas/2015-01-01/deploymentParameters.

json#", "contentVersion": "1.

0.

0.

0", "parameters": { "workspaceName": { "value": "<ENTER_NAME_OF_DATABRICKS_INSTANCE>" } }}Once this is done , add the trigger for these into our /infra/data_lake.

sh scriptCopy this snippet of code and append it to the script.

DATA_LAKE_WORKFLOW_DB='<ENTER_ANY_DEPLOYMENT_NAME>'ARM_LOCATION_DB='arm/<NAME_OF_ARM_TEMPLATE>'ARM_PROPS_LOCATION_DB='.

/conf/<NAME_OF_PROPERTY_FILE>'# Create Azure Databricks az group deployment create –name $DATA_LAKE_WORKFLOW_DB –resource-group $DATA_LAKE_RG –template-file $ARM_LOCATION_DB –parameters $ARM_PROPS_LOCATION_DBIf you use git the commit this code to git and then pull it from cloud shell and execute this script.

On successful execution , check the resources using.

Newly created databricks instance should be visible now , this can also be seen via the portal.

az resource list -g <NAME_OF_RESOURCE_GROUP>Creation of an Massively Parallel Processing Data warehouse (OLAP)Once the data is prepared and ready to be served an MPP data warehouse needs to be provisioned to expose the data to visualization tools.

The core requirements would be :Highly scalable Data WarehouseEasy to tune up performance and tune down when not needed to avoid extra costsEasy connectors to Raw storageSQL based query engineFor this part of the blog , SQL Data warehouse is considered.

Azure SQL Data Warehouse Documentation – Tutorials, ConceptsLearn how to use an Azure SQL Data Warehouse, which combines SQL relational databases with massively parallel…docs.

microsoft.

comThis can be created using the az command , append the following snippet to the script data_lake.

sh.

DATA_LAKE_DWH_NAME='<ENTER_SQL_DWH_NAME>'# Create a SQL Datawarehouseaz sql dw create –resource-group $DATA_LAKE_RG –server $DATA_LAKE_SERVER_NAME –name $DATA_LAKE_DWH_NAMECommit this back into git and execute the script again , check the list of resources inside the resource group.

This should include the newly created SQL Data Warehouse.

Adjust the script accordingly and do the required cleanup , refer a sample git repository which has been used by me for this blog.

sarathsasidharan/datalakeData Lake template.

Contribute to sarathsasidharan/datalake development by creating an account on GitHub.

github.

comThis script /infra/data_lake.

sh , would be the single click deploy script for the data lake.

It would deploy a storage account , an Azure Data Factory instance, Azure Databricks Instance, SQL Database and a SQL Data warehouse.

This script can be included in a Build pipeline (Azure Devops) , which can enable automated rollouts to Test / Acceptance / Production or also as a templates to teams within the organization to start building their data lakes.

The next blog focuses on :Data loading / ExtractionData PreparationData SinkThe series of blogs to follow will build up on this blog with other factors like Monitoring / controls / nomenclatures / DevOps practices for data lake / Data Management etc.

As a follow up of this blog you can go here for part 2 of this blog series.

.. More details

Leave a Reply