Data Science for Startups: Containers

Source: https://commons.



jpegData Science for Startups: ContainersBuilding reproducible setups for machine learningBen WeberBlockedUnblockFollowFollowingMay 1One of the skills that is becoming more in demand for data scientists is the ability to reproduce analyses.

Having code and scripts that only work on your machine is no longer sustainable.

You need to be able to share your work and have other teams be able to repeat your results.

Some of the biggest impacts that I’ve seen in data science organizations is when other teams repurpose old code for new use cases.

This blog post is about encouraging the reuse of analyses via containers, which means that your work can transfer.

The idea of a container is that it is an isolated environment in which you can set up the dependencies that you need in order to perform a task.

The task can be performing ETL work, monitoring data quality, standing up APIs, or hosting interactive web applications.

The goal of a container framework is to provide isolation between instances with a lightweight footprint.

Containers are an alternative to virtual machines, which are a great solution to isolation, but require substantial overhead.

With a container framework, you specify the dependencies that your code needs, and let the framework handle the legwork of managing different execution environments.

Docker is the defacto standard for containers, and there is substantial tooling built around Docker.

In general, using Docker is going to take more work for a data scientist, versus a local deployment.

However, there are several benefits to Docker:Reproducible Research: If you can deliver your analysis as a container, then other data scientists can rerun your work.

Explicit Dependencies: In order to set up your script as a container, you need to understand the dependencies of your code and any additional libraries that may be needed, and their versions.

Improved Engineering Collaboration: If you want to scale up a model you’ve built, providing a dockerfile to your engineering team is going to go much further than handing off an R or Python script.

It also calls out the dependencies that the code needs in order to execute.

Broader Skill Set: Being able to stand up infrastructure as code is a valuable skill set, and using Docker containers can help data scientists to start developing this skill set.

The ideal state for an effective data science organization is that any member of the team can reproduce prior research.

As a former academic, I’d like to take this recommendation further and encourage all submissions to arXiv to include reproducible environments.

It’d be great to establish a standard framework for research, and as a proof-of-concept I ported one of my prior research papers to a container environment:Reproducible Research: StarCraft MiningIn 2009 I published a paper on predicting build orders in StarCraft: Brood War using different classification…towardsdatascience.

comThere’s a lot of ecosystems that have been built around container environments, such as Kubernetes and Elastic Container Service (ECS).

Instead of focusing on scale, as provided by these environments, we’ll focus on taking an existing script and wrapping it in a container.

Deploying Keras Deep Learning Models with FlaskThis post demonstrates how to set up an endpoint to serve predictions using a deep learning model built with Keras.


comAll of the code that is used in this post is available on github.

When working with Docker, I encourage hosting all files in source control in order to ensure that your container can deploy to new environments.

In this post, I’ll cover Docker installation, wrapping a simple web app in Docker, and then hosting a deep learning model as a Docker container.

Installing DockerThe first step in using Docker is to set up Docker on a machine where you want to build and test images.

For this post, I spun up an EC2 instance with the new AWS AMI (Setup Instructions).

You can install and verify a Docker installation with the commands shown below:# python 3 sudo yum install -y python3-pip python3 python3-setuptools# docker installsudo yum update -ysudo amazon-linux-extras install dockersudo service docker start# test docker setupsudo docker psFor AWS, additional details are available here if you’re using a different instance type.

For all other environments, see the docker instructions.

After running these steps, you can check which containers are running by running the following command: sudo docker ps.

An empty Docker InstallWhile there are not any active containers yet, this output indicates the Docker is up and running on your instance.

We’re now ready to start hosting web apps and Python scripts as Docker containers!An Echo ServiceOne of the most common tools for standing up web services in Python is Flask.

To start, we’ll stand up a simple echo web service, in which a passed in message is returned to the caller.

This is a relatively simple environment.

We need to install Python 3, which we already did when installing Docker, and then install Flask as shown below:pip3 install –user FlaskNow we can write a Flask app to implement this echo service, where a param passed to the service is echoed to the terminal:This is a simple web app that will return a payload with the msg param echoed to the web response.

Since we are using Flask, we can deploy the application with a single command:python3 echo.

pyThe result is that we can post messages to the service:# web callhttp://ec2-3-88-9-61.



com:5000/predict?msg=HelloWorld# result{"response":"HelloWorld","success":true}The majority of the work we’ve done so far is around setting up AWS to allow incoming connections, and installing Python 3 on an EC2 instance.

Now we can focus on the containerization of services.

Echo Service as a ContainerSince we got the echo service to work on a fresh EC2 instance, we’ve already gone through some of the process of setting up a reproducible environment.

We needed to set up Python 3 and Flask before we could execute our simple service.

With Docker, we need to do the same process, but in an automated way.

To specify how to construct an environment with Docker, you need to create a Dockerfile object in your project, which enumerates the details on setting up your environment.

A dockerfile that reproduces the echo service app is shown below, and is on github:FROM ubuntu:latestMAINTAINER Ben Weber RUN apt-get update && apt-get install -y python3-pip python3-dev && cd /usr/local/bin && ln -s /usr/bin/python3 python && pip3 install flask COPY echo.

py echo.

py ENTRYPOINT ["python3","echo.

py"]This Dockerfile provides a few entries:From: Lists a base container to build upon.

Run: Specifies commands to run when building the container.

Copy: Tell Docker to copy files from the EC2 instance to the container.

Entrypoint: specifies the script to run when the container is instantiated.

We’ll start with an Ubuntu environment, setup Python 3, copy our script into the container, and then launch the script when instantiating the container.

I tested out this container using the following script:# install gitsudo yum -y install git# Clone the repo and build the docker image git clone https://github.

com/bgweber/StartupDataSciencecd StartupDataScience/containers/echo/sudo docker image build -t "echo_service" .

# list the docker images sudo docker imagesI installed git on the EC2 instance, cloned the code from my repo to the local machine, and then built the container.

Running the ps command resulted in the following command line output:Docker ImagesWe now have a container that we can run!.To run it, we need to specify the image name and a port mapping that identifies container port (5000) and external port (80):sudo docker run -d -p 80:5000 echo_servicesudo docker psMore details on exposing EC2 ports are available here.

When I ran the commands above, I got the following output:This output indicates that the echo service is now running as a container and is exposed as an endpoint to the web.

The result is exactly the same as before, but instead of the port being exposed as a Flask app, the port is exposed as a mapped port to a Docker instance running a Flask app.

# web callhttp://ec2-18-204-206-75.



com/predict?msg=Hi_from_docker# result{"response":"Hi_from_docker","success":true}Functionally, the API call is similar between the initial and dockerized set ups.

The key difference is that the dockerized setup uses container-scoped python libraries, while the direct flask setup relies on the system-scoped python libraries.

It’s trivial to stand up this service on a new instance with the container approach, but may be non-trivial to reproduce on a new machine if not using Docker.

Hosting a Complex ModelThe power of Docker is more apparent when using complicated libraries.

In this section, we’ll train a Keras model locally, and then deploy it as a container.

To train the model locally, we need to install a few libraries:# Deep Learning setup pip3 install –user tensorflowpip3 install –user keraspip3 install –user pandasNext, we’ll train the model by running a Python script locally.

The output of this script is an h5 model that we want to host as an endpoint.

More details about the training code are available here.

Since we’ve installed the necessary libraries on our host EC2 instance, we can build the model file with the following command:python3 train_model.

pyThe result is a games.

h5 model that we want to include in our container for predictions.

While we could wrap this step into our Docker setup, it’s easier to separate these steps when first setting up a Docker workflow.

Now that we have a model specification, we can host a deep learning model as a Flask app, managed as a Docker container.

The code below shows how to set up a Flask app to service this model, and is unmodified from you prior post on hosting deep learning models with Flask:The next step is to specify a Dockerfile that takes in the code and model when building a container.

The script below shows that we’ve added a few more libraries, and also copied a model file from the local machine to the docker image, which means that it can be used when serving predictions:FROM ubuntu:latestMAINTAINER Ben WeberRUN apt-get update && apt-get install -y python3-pip python3-dev && cd /usr/local/bin && ln -s /usr/bin/python3 python && pip3 install tensorflow && pip3 install keras && pip3 install pandas && pip3 install flask COPY games.

h5 games.

h5COPY keras_app.

py keras_app.

pyENTRYPOINT ["python3","keras_app.

py"]The command line instructions below show how to turn this Dockerfile into a container that we can use to host a deep learning model:# Clone the repo and build the docker imagegit clone https://github.

com/bgweber/StartupDataSciencecd StartupDataScience/containers/model/sudo docker image build -t "model_service" .

# Expose a model endpointsudo docker run -d -p 80:5000 model_serviceThe result of running this container is that we now have a deep learning model exposed as a Flask endpoint that we can pass parameters to in order to get a prediction.

The code block below shows how I tested this interface in order to get a prediction result.

# web call http://ec2-18-204-206-75.



com/predict?g1=1&g2=0&g3=0&g4=0&g5=0&g6=0&g7=0&g8=0&g9=0&g10=0# result{"prediction":"2.

160104e-16","success":true}The result of all of this work was that we wrapped a Keras model in a Docker container, but maintained the Flask interface to expose the model as an endpoint on the web.

The key difference from my initial post on Flask is that the model is now defined within a container scoped environment, rather than an EC2-scoped environment, and it’s trivial to set up this model on a new machine.

In addition to designing models that can work in containers, eagerly targeting Docker and cloud tooling means that data scientist projects are easier to share used across an organization.

ConclusionData scientists should be able to author model and data workflows that extend beyond their local workspaces.

Container environments such as Docker are one way of achieving this goal, and becoming familiar with these types of tools helps build your portfolio with skills such as specifying infrastructure as code.

This post showed how to stand up a Keras model as a webpoint using Docker, but was only a glimpse into the capabilities of reproducible research enabled by these tools.

Ben Weber is a distinguished data scientist at Zynga and advisor at Mischief.


. More details

Leave a Reply