Introducing ManifoldUber’s Framework for Machine Learning Debugging and InterpretationJesus RodriguezBlockedUnblockFollowFollowingJan 21Machine learning programs defer from traditional software applications in the sense that their structure is constantly changing and evolving as the model builds more knowledge.
As a result, debugging and interpreting machine learning models is one of the most challenging aspects of real world artificial intelligence(AI) solutions.
Debugging, interpretation and diagnosis are active areas of focus of organizations building machine learning solutions at scale.
Recently, Uber unveiled Manifold, a framework that utilizes visual analysis techniques to support interpretation, debugging, and comparison of machine learning models.
Manifold brings together some very advanced innovations in the areas of machine learning interpretability to address some of the fundamental challenges of visually debugging machine learning models.
The challenge of debugging and interpreting machine learning models is nothing new and the industry has produced several tools and frameworks in this area.
However, most of the existing stacks focus on evaluating a candidate model using performance metrics such as like log loss, area under curve (AUC), and mean absolute error (MAE) which, although useful, offer little insight in terms of the underlying reasons of the model’s performance.
Another common challenge is that most machine learning debugging tools are constrained to a specific types of models(ex: regression or classification) and are very difficult to generalize across broader machine learning architectures.
Consequently, data scientists spend tremendous amounts of time trying different model configurations until they can achieve specific performances.
Entering ManifoldA company like Uber is operating hundreds of machine learning models across dozens of teams.
As a result, debugging and interpretability of those models becomes a key aspect of the machine learning pipeline.
With Manifold, the Uber engineering team wanted to accomplish some very tangible goals:· Debug code errors in a machine learning model.
· Understand strengths and weaknesses of one model both in isolation and in comparison, with other models.
· Compare and ensemble different models.
· Incorporate insights gathered through inspection and performance analysis into model iterations.
To accomplish those goals, Manifold segments the machine learning analysis process into three main phases: Inspection, Explanation and Refinement.
· Inspection: In the first part of the analysis process, the user designs a model and attempts to investigate and compare the model outcome with other existing ones.
During this phase, the user compares typical performance metrics, such as accuracy, precision/recall, and receiver operating characteristic curve (ROC), to have coarse-grained information of whether the new model outperforms the existing ones.
· Explanation: This phase of the analysis process attempts to explain the different hypotheses formulated in the previous phase.
This phase relies on comparative analysis to explain some of the symptoms of the specific models.
· Refinement: In this phase, the user attempts to verify the explanations generated from the previous phase through encoding the knowledge extracted from the explanation into the model and testing the performance.
The three steps of the machine learning analysis process materializes on a simple user interface that streamlines the debugging of machine learning models.
The Manifold user interface consists of two main dialogs:1) Performance Comparison View: Provides a visual comparison between model pairs using a small multiple design, and a local feature interpreter view.
2) Feature Attribution View: Reveals a feature-wise comparison between user defined subsets and provides a similarity measure of feature distributions.
Users can debug machine learning models in Manifold using three main steps:1) Compare: First, given a dataset with the output from one or more ML model(s), Manifold compares and highlights performance differences across models or data subsets.
2) Slice: This step lets users select data subsets of interest based on model performance for further inspection.
3) Attribute: Manifold then highlights feature distribution differences between the selected data subsets, helping users find the reasons behind the performance outcomes.
The Manifold ArchitectureFrom an architecture standpoint, the Manifold workflow takes a group machine learning models as input and produces different data segments based on feature engineering.
The feature segments are then processed by a group of encoders that produce a set of new features with intrinsic structures that were not captured by the original models and help users to iterate new models and obtain better performance.
The workflow depicted above is implemented in a simple architecture that is based on three main components: data source, backend and frontend.
Functionally, the Manifold architecture is based on three main modules:Data transformer, a feature that adapts data formats from other internal services (e.
js which remove the need of expensive computation hardware.
For more computation intensive processes, Manifold provides a Python-based interface based on Pandas and Scikit-Learn.
Manifold in Action at UberUber has adopted Manifold across all its data science teams.
Recently, the Uber Eats team leveraged Manifold to evaluate a new model that predicts order delivery times.
During the implementation, the Uber team integrated an extra set of features which they thought had the potential of improving the performance of the existing model.
However, after the first tests, they noticed that the performance of the model was barely affected.
Were the data scientists wrong on incorporating the new features?Using Manifold, the Uber team visualized the original model(green) and the model with the new features(orange).
As you can see in the following figure, the test dataset was automatically segmented into four clusters based on performance similarity among data points.
For Clusters 0, 1, and 2, the model with additional features provided no performance improvement.
However, the performance of the new model (the one with extra features) was slightly better in Cluster 3, as indicated by a log-loss shifted to the left.
The results indicate that the extra features help in Cluster 3 which tackles some very specific use cases that were hard to assess by the other clusters.
Manifold represents an important step towards improving the debuggability and interpretability of machine learning models.
Even if Uber doesn’t open source Manifold, some of the ideas outlined in the research paper can be incorporated into machine learning tools and frameworks in order to improve the lifecycle of machine learning solutions.