omega|ml: Deploying Data Pipelines & Machine Learning Models the Easy WayDeploying data pipelines and machine learning models can take anywhere from weeks to months, involving engineers from many disciplines.
Our open source data science platform omega|ml, written in Python and leveraging MongoDB, accomplishes the task in a matter of seconds — all it takes is a single line of code.
Patrick SentiBlockedUnblockFollowFollowingApr 8Photo by Ashkan Forouzani on UnsplashDeploying data & machine learning pipelines is hard — it should not beWhen it comes to deploying data & machine learning pipelines, there are many options to choose from — most of them are quite complex.
As of Spring 2019, best practices range from building your very own application server that somehow consumes a serialized version of your trained machine-learning model, or building docker images using partially automated tools, to using a full-scale sophisticated commercial data science platform that has it all.
Whatever the approach to deploying a machine learning model, it is usually not a core skill of data scientists.
To do it right requires passion for subjects that are closer to software engineering and distributed systems than they are to machine learning and statistics.
Whatever the approach to deploying a machine learning model, it is usually not a core skill of data scientistsBesides, why build a new infrastructure when it readily available?omega|ml: A single line of code is all you needEnter the open-source package omega|ml, the production-ready data science framework that scales from laptop to cloud.
With omega|ml deploying a machine learning model is as simple as this Python one-liner, available straight from your Jupyter notebook or from any other Python program:om.
put(model, ‘mymodel’) # model is e.
a scikit-learn modelWithout further ado the model is immediately available from omega|ml’s REST API and can be used for prediction by any client, written in any language:GET http://host/v1/model/mymodel/predict/ + <json body>Here <json body> is a dictionary with the names of the columns and corresponding data values.
A model is any scikit-learn model, other frameworks such as Keras, Tensorflow, Pytorch are easily added by using the extensions API.
There’s more: Data ingestion, data munging, scheduling & report publishingModel deployment does not start and stop with the model, but also requires data, model retraining and monitoring.
To this end omega|ml covers the full data science pipeline — from data ingestion, data munging, job scheduling, to model building, selection and validation, to online reporting & applications publishing.
All from an easy-to-use API that accomplishes most tasks with just a single line of code, directly from within Jupyter Notebook.
Model deployment does not start and stop with the model, but also requires data, model retraining and monitoring.
omega|ml covers the full data science pipeline — most tasks take just a single line of codeFor example, to ingest new data from Python:om.
put(data, ‘mydata’) # data is a container object, e.
list, dict or a pandas DataFrameSimilarly data can be ingested using the REST API:PUT http://host/v1/dataset/mydata + <json body+Data can also be queried from either the REST API or a Python client:# Python om.
get(‘mydata’, column=value)# REST APIGET http://host/v1/dataset/mydata?column=value'Pandas-like MDataFrame: Larger-than-memory dataframesIn addition to the deployment of model & data pipelines, omega|ml provides a Pandas-like API to columnar datasets of any size, called MDataFrame (“M” stands for massive in general, and specifically for MongoDB).
While not a drop-in replacement for Pandas yet, MDataFrame provides many of the most useful parts of the Pandas API such as indexing, filtering, merging and row-wise functions for data munging, as well as descriptive statistics, correlation and covariance.
With MDataFrame, even larger-than-memory data sets are easily analysed, models trained and executed out-of-core using omega|ml’s integrated data and compute cluster.
Leveraging MongoDB & Python’s excellent distributed task framework Celery by default, which integrates nicely with scikit-learn’s joblib, it can also readily utilize Dask Distributed or Apache Spark clusters (hosted and commercial editions).
With MDataFrame, even larger-than-memory data sets are easily analysed, models trained and executed out-of-core using omega|ml’s integrated data and compute clusterNote that datasets stored in omega|ml are only limited by the physical disk space available to its storage layer, not by memory size.
Unlike Spark, omega|ml’s MDataFrame does not incur a pre-loading latency before processing can begin, as all the heavy lifting is done by MongoDB.
New data is persisted by default and readily available to other data scientists.
Moreover, multiple users can leverage the cluster at the same time, consuming many different datasets, each or in combination larger then the physical memory available in the cluster.
Using MDataFrame, e.
to summarize data,mdf = om.
mean()to subset data by filter, index or by column,# by filterquery = mdf['column'] == valuemdf.
loc[query]# by indexmdf.
loc[['value1', 'value2', .
]# by columnmdf['column']mdf['column1', 'column2']to perform a deferred filter operation and calculation, executed by MongoDB,# query on salesmdf.
value# apply row-wise calculationmdf.
apply(lambda v: v * 2).
value Merging dataframes,mdf1.
merge(mdf2)MDataFrames can also readily be used with scikit-learn models stored in omega|ml, e.
when using the compute cluster.
# mydata is a MDataFrame previously stored using om.
fit('mydata[^Y]', 'mydata[Y]')The full documentation of omega|ml’s Python and REST API is available at https://omegaml.
Features & add-onsomega|ml includes a range of features typically missing from a data scientist’s workspace, out of the box:out-of-core datasets using a Pandas-like APIasynchronous and scheduled model training, optimization and validation, directly from Python and through the REST APIintegrated, scalable data cluster (based on MongoDB)integrated compute-cluster utilizing either Celery, Dask Distributed or Spark.
While omega|ml works just fine on a laptop from an easy-to use API, its architecture is built for cloud scalability and extensibility, integrating with scikit-learn and Spark MLLib out of the boxA number of add-ons make omega|ml viable for collaboration in teams and organizations:distributing Plotly Dash & Jupyter Notebooks to business users, straight from within Jupyter Notebook (add-on)a pure-Python mini-batch framework similar to Spark Streaming ( add-on, comes without the complexity of a Scala/JVM/Hadoop setup)multi-user roles and security (add-on, provided in the hosted & enterprise editions)An extensible & scalable architectureWhile omega|ml works just fine on a laptop from an easy-to use API, its architecture is built for cloud scalability and extensibility.
Integrating with scikit-learn and Spark MLLib out of the box, its core API enables developers to build extensions for any machine learning framework such as Keras, Tensorflow or Pytorch with only few lines of code while keeping a stable API both in Python and for the REST API.
The same is true for external data sources such as Amazon S3 or other object stores as well as databases such as MySQL or Oracle, which can easily be added as extensions.
omega|ml architectureLeveraging MongoDB as its storage layer, omega|ml scales horizontally to any size datasets, distributed to any number of storage/compute nodes, while it does not have the memory requirements nor the data-loading latency of an all in-memory stack (e.
Spark or Dask), effectively combining MongoDB’s high-performance hybrid architecture for in-memory processing and distributed storage.
Thanks to its integrated, pure-Python RabbitMQ/Celery compute cluster it offers Python-native serverless functions while it can leverage any compute cluster such as Apache Spark or Dask Distributed.
Getting started with omega|mlTo get started run omega|ml straight from docker (this is the open source community edition):$ wget https://raw.
yml$ docker-compose up -dNext open your browser at http://localhost:8899 to open Jupyter Notebook.
Any notebook you create will automatically be stored within the omega|ml database, thus making it easy to work with colleagues.
The REST API is available at http://localhost:5000.
You can also use omega|ml as an add-on package to your existing Python distribution (e.
In this case you will have to also run MongoDB and RabbitMQ.
pip install omegamlLeveraging MongoDB’s high-performance aggregation, omega|ml scales horizontally to any size datasets.
Yet it does not have the memory requirements nor the data-loading latency of an all in-memory stack like e.
Apache Spark or DaskLearn moreomega|ml (Apache License) is built on top of widely used Python packages like scikit-learn, Pandas and PyMongo.
Extensions for other machine learning frameworks such as TensorFlow are easy to achieve through a well-defined API.
omega|ml is provided as a ready-to-deploy docker image for docker-compose and as software-as-a service at https://omegaml.
io (currently in Beta).
An on-premise edition for deployment to Kubernetes, on private or public clouds, is available under a commercial license.
There is a getting started guide and a tutorial notebook to get you up and running.
About the authorPatrick Senti is a freelance senior Data Scientist and Fullstack Software Engineer with almost 3 decades worth’ of professional experience.
He originally built the core of omega|ml as the internal data science platform for his smart city & next-gen mobility startup launched in 2014, where the challenge was to collaborate on large out-of-core datasets between a distributed team of data scientists, and to deploy many hundreds of machine learning models for operation in the cloud and integration into a smartphone travel app.