How to troubleshoot your Azure Data Science project in productionRené BremerBlockedUnblockFollowFollowingMay 261.
IntroductionSuppose a model is running in production and creates predictions that are not understood.
Then troubleshooting is needed and at least the following shall be traced back:Version of model deployed in productionMetrics and statistics of modelData and algorithm used in modelPerson/team that created modelIn this blog, an Azure data science project is defined and then it is discussed how troubleshooting can be done.
In case you are interested how the project is implemented in detail, refer to my previous blog.
Notice that this blog is standalone and it is not necessary to read the previous blog first.
Setup of Azure data science projectAzure offers a lot of possiblities to do data analytics and data science.
In this project, functionality of popular Azure services is combined to create a predictive endpoint.
The setup of the project can be depicted as follows:2.
Setup of Azure data science project using popular Azure servicesThe Azure services and its usage in this project can be described as follows:Azure Storage gen 2 is used for storing the data.
Data is made immutable and new data is added in separate folders with a timestamp such that it is clear when it is added.
Azure Databricks is used for interactive development and feature engineering of the model.
Databricks Spark Clusters are used to by Azure ML Service as Compute to train the model.
Git is used for source control of the notebooks and is linked as repository to Azure Devops.
In this project, git in Azure DevOps git is used, but GitHub could have been taken as well.
Azure Machine Learning (ML) Service is used to keep track of the model and its metrics.
Azure ML Service SDK is used to create and deploy docker images of the model to Azure Container Container and Azure Kubernetes Service.
For an overview of the deployment possibilities using Azure ML Service, see here.
Azure DevOps is used to setup a build/release pipeline.
In the build phase, a docker image is created using git, Databricks Spark Clusters and Azure ML Service SDK.
In the release phase, the docker image is deployed as Azure Container Instance for testing and then to Azure Kubernetes Service for production.
Azure Container Instance (ACI) is used as test endpoint, whereas Azure Kubernetes Service (AKS) is used for production.
AKS is scalable, secure and fault tolerant.
Using AKS with Azure ML Service SDK, a lot of features come out of the box, e.
log and evaluate model data and monitor fail rate, request rateThe setup of this project can be seen as a “best of breed“ approach which has the following advantages/disadvantages:Advantage of this setup is that teams can use tools they like the most and have most experience with.
For instance, use Azure DevOps that is already used for other IT projects.
Or use Azure Databricks that is already used for other big data projects within the company.
Disadvantage is that integration and troubleshooting can become more complex.
This will be dealt with in the next chapter.
TroubleshootingSuppose that the HTTP endpoint creates predictions that are not understood.
This needs be investigated and determined on what predictions are based.
In the remaining of this chapter, error investation will be done on the topics introduced in the first part of this blog.
Version of HTTP endpoint, image and model3b.
Metrics and statistics of model3c.
Data and source code used in model3d.
Person/team that created model3a.
Version of HTTP endpoint, image and modelIn the data science project, the model was deployed in AKS production using the Azure ML Service SDK.
In order to find the model used, go to your Azure ML Service Workspace in Azure portal and then to deployment tab.
Here you can see the IP address of HTTP endpoint, ACI and docker image used.
HTTP endpoint, AKS and imageSubsequently, the model version can be found in the model tab.
Model used in docker imageNotice that the run id that can be found in the tags.
This can be used to retrieve the model metrics in the next part.
Metrics and performance of modelUsing the run_id in the previous part, you can look up the metrics and statistics of your model.
Go to Experiments and use the run_id in the filter to find the metrics of the model.
Retrieve model metricsFor this model the roc, prc and false/negative positives/negatives were logged.
Additionally, you can also images that we were creating using for example matplotlib.
Display metics of models3c.
Data and source code used in modelIn this part, the data and source that was used to train the model will be identified.
All source code and data is stored in the repository of the Azure DevOps project, so the model identified in part 3b needs to be linked to the repository.
First, look up the release pipeline that used the Azure ML Service API to deploy image as container.
In the project, a night build/release pipeline is created, so the last succesfull release can be looked up, see below3c1.
Last succesfull build/release in Azure DevOpsIn the logs of the release, logging of release steps can be found.
In the logging of the deploy step, the HTTP IP address and image version can found that corresponds to the IP and version in step 3b, see below.
Logging of releaseSubsequently, download all input of the release by clicking on the artifact box.
In this download, all source can be found, see below.
Download build artifactIn this project, data is not part of version control and is not stored in the artifact.
Instead, data is stored on Azure Storage gen 2 and is immutable.
New data is added to a new folder with a timestamp.
Subsequently, in the build pipeline it is logged which folders were used for training the model such that it can be made clear what data is used.
Person/team that created modelIn the previous section, the source code and data was identified that was used to create the model.
In this part, it is traced back which person/team changed the code and when this happened.
In the build artifact downloaded in step 3c, also the commit id can be found, in this case e670024.
When you click on this release, a code difference is made with the previous commit.
Also, the person and datetime of change can be found, see below.
Code compare with previous commitBy clicking on the Parent 1 link, you can trace back to the previous commit until a suspicious change is found.
Alternatively, you can also go directly to the commit tab and see an overview with latest commit and start from there, see below.
Commit overviewFinally, notice that is possible to connect you Azure Git repository with Azure Databricks which is used as interactive development environment in this project, see screenshot below and explanation using this here3d3.
Azure Databricks integration with Azure DevOps4.
ConclusionIn this blog, a setup for a Data Science project was described in which functionality of popular Azure services are combined.
Since integration and troubleshooting can be challenging when multiple services are combined, it was subsequently discussed how error detection can be done.
Having a good traceability of you model may help to bring your project to production, see also architecture below.
Setup of Azure Data Science project.