Managing Data Science Workflows the Uber WayJesus RodriguezBlockedUnblockFollowFollowingMar 22Orchestrating workflows is one of the main challenges of machine learning solutions in the real world.
A machine learning solution involves more than just picking the right model and productizing it.
Data ingestion, training, deployment or optimization are common steps in any machine learning workflow.
Unfortunately, the technology stacks for building and managing coordinated actions across all those steps hasn’t developed at the same pace of the frameworks and libraries for creating the models.
Uber is one of the companies that have been innovating in this area.
Over the last few years, the Uber engineering team has regularly developed relevant building blocks for orchestrating and managing machine learning workflows at scale.
The challenge of orchestrating machine learning workflows is often lost in the grand vision of machine learning solutions.
Its more exciting to identify the right machine learning technique for a problem than to think about orchestrating data flows or deployments.
However, this oversight is the breaking point of many viable machine learning solutions that never find a way to an operational model.
Initial attempts to address this problem involved adapting workflow management tools such as Apache Oozie, Apache Airflow, and Jenkins to machine learning workflows.
That approach yielded some positive results but resulted very limited as machine learning workflows are fundamentally different from other applications.
Recently, domain-specific solutions such as Cloudera’s Data Science Workbench(DSW) have come into scene to address this same challenge.
While certainly a powerful stack, DSW hasn’t really been validated in large scale scenarios.
After experimenting with many of these alternatives, Uber decided to building their own workflow management framework optimized for machine learning workflows.
PiperPiper was the result of years of experimentation with workflow management stacks at Uber.
Using systems such as Apache Airflow as an inspiration, Piper built a multi-tenant, highly scalable framework for the creating and executing management workflows.
While Piper can be considered a general purpose workflow management framework, most of its applications have been related to machine learning workflows.
Conceptually, Piper was based following some key principles:Workflows should be easy to author via code while also being expressive and supporting the ability to generate workflows dynamically.
Support a development process that engineers are accustomed to, including developing data workflows as code, and tracking via revision control.
Workflows that are easy to visualize and manage.
Logs that are easily accessible for viewing, both for past and present runs of a workflow.
The goal of Piper was to create and run highly performant workflows.
The initial architecture of Piper was very close to the Apache Airflow but it steadily started incorporating several new components based on the lessons learned in Uber’s specific scenarios.
That evolution resulted in an extremely robust workflow management framework yet based on a very simple architecture.
Piper’s architecture shown in the previous figure includes some of the following components:Web server: Application server which services HTTP requests, including those for UI endpoints, as well as JSON API endpoints.
Scheduler: Responsible for scheduling workflows and tasks.
The scheduler takes into account various factors such as schedule interval, task dependencies, trigger rules and retries, and uses this information to calculate the next set of tasks to run.
Once resources are available, it will queue the task for execution in the appropriate executor (in our case, Celery).
Celery worker: The workers execute all workflow tasks.
Each worker pulls the next task for execution from the queue, in our case Redis, and executes the task locally.
Metadata database: Source of truth for all entities in the system such as workflows, tasks, connections, variables, and XCOMs, as well as execution status for the workflows.
Python workflows: Python files written by users to define workflows, tasks, and libraries.
Exploring Piper’s architecture, we can clearly see how many of those components are relevant to machine learning scenarios.
Uber has adapted Piper to power the workflows behind several of their mission critical machine learning workflows.
Piper for Machine Learning WorkflowsUsing Piper for machine learning at Uber starts by integrating it with the Michelangelo platform which is the heart of Uber’s data science workflows.
In that context, almost every aspect of the lifecycle of machine learning solutions can be orchestrated through Piper.
For instance, the process of orchestrating a machine learning training workflow in Piper can be divided in three main workflows:1) The first workflow ingests data into the Hadoop Data Lake.
2) The second workflow prepares the model data through extract, transform, and load (ETL).
3) The third workflow makes up the core of our ML tasks, typically consisting of four stages: model training, model performance validation, model deployment, and model performance monitoring.
Similarly, a natural language processing(NLP) process in Piper has the following workflows:1) The first workflow ingests raw data into our Hadoop data lake.
2) The second workflow updates the feature table with both structured data and free text that will be used in model training.
3) The third workflow starts with an Apache Spark job that tokenizes the free text and indexes some of the features, and embeds features for DL training.
Piper is still in very early stages and hasn’t been open sourced which limits its applicability to a wider spectrum of machine learning scenarios.
However, many of the architecture principles of Piper can be applied to build highly scalable machine learning workflows.
In the future, I expect many of these lessons to be adapted to more mainstream machine learning workflow tools.
.. More details