Machine Learning with Big Data

If we run Dask on our laptop, it allows us to distribute our code to multiple cores at once, but it does not help us run the code on multiple systems at the same time.

We have run it locally.

Ideally, we want to run on a cloud provisioned cluster, and we’d like this cluster to be self-repairing — that is, we’d like our code to respond to failures and expand onto more machines if we need them.

We need a cluster manager.

Kubernetes is a cluster manager.

We can think of it like being an operating system for the cluster.

It provides service discovery, scaling, load-balancing, and is self-healing.

Kubernetes think of applications as stateless, and movable from one machine to another to enable better resource utilization.

There is a controlling master node on which the cluster operating system runs, and worker nodes which perform the bulk of the work.

If a node (computer associated with the cluster) loses connection or breaks, the master node will assign the work to someone new, just like your boss would if you stopped working.

The master and worker nodes consist of several pieces of software which allow it to perform its task.

It gets pretty complicated so I will quickly give a high-level overview.

Master Node:API server, communication between master node and user (using kubectl)Scheduler, assigns a worker node to each applicationController Manager, performs cluster level functions, such as replicating components, keeping track of worker nodes, handling node failuresetcd, a reliable distributed data store that persistently stores the cluster configuration (which worker node is doing what at a given time).

Worker Node:Docker, to run your containersPackage your app's components into 1 or more docker images, and push them to a registryKubelet, which talks to the API server and manages containers on its nodekube-proxy, which load-balances network traffic between application componentsThe configuration of a Kubernetes cluster.

Doing all of this is great, but it isn’t particularly helpful unless we have 100 computers at our disposal to make use of the power that Kubernetes and Dask afford us.

Enter the cloud.

Dask Cloud DeploymentIf you want to run Dask to speed up your machine learning code in Python, Kubernetes is the recommended cluster manager.

This can be done on your local machine using Minikube or on any of the 3 major cloud providers, Microsoft Azure, Google Compute Cloud, or Amazon Web Services.

You are probably familiar with cloud computing since it is pretty much everywhere these days.

It is now very common for companies to have all of their computing infrastructure on the cloud, since this reduces their capital expenditure on computing equipment and moves it to operational expenditure, requires less maintenance and also significantly reduces the running cost.

Unless you are working with classified information or have very strict regulatory requirements, you can probably get away with running things on the cloud.

Using the cloud allows you to leverage the collective performance of several machines to perform the same task.

For example, if you are performing hyperparameter optimization on a neural network and it will need to rerun the model 10,000 times to get the best parameter selection (a fairly common problem) then it would be nonsensical to run it on one computer if it will take 2 weeks.

If you can run this same model on 100 computers you will likely finish the task in a few hours.

I hope I have made a good case for why you should make use of the cloud, but be aware that it can get quite expensive if you use very powerful machines (especially if you do not turn them off after using them!)To set up the environment on the cloud, you must do the following:Set up a Kubernetes clusterSet up Helm (a package manager for Kubernetes, it is like a Homebrew for Kubernetes cluster)Install Dask.

First run the followinghelm repo updateand thenhelm install stable/daskSee https://docs.



html for all the details.

Deep Learning on the CloudThere are several useful tools which are available for building deep learning algorithms with Kubernetes and Dask.

For example, TensorFlow can be put on the cloud using tf.

distributed of kubeflow.

The parallelism can be trivially used during grid optimization since different models can be run on each worker node.

Examples can be found on the GitHub repository here.

What do you use?For my own research (I am an environmental scientist) and in my consulting work (machine learning consultant) I regularly use either JupyterHub, a Kubernetes cluster with Dask on Harvard’s supercomputer Odyssey, or I will run the same infrastructure on AWS (no real prejudice against Azure or the Google Cloud, I was just taught how to use AWS first).

Example Cloud Deployment on AWSIn this section, I will run through the setup of a Kubernetes Cluster running Dask on AWS.

The first thing you need to do is set up an account on AWS, you will not be able to run the following lines of code unless you already have an account.

First, we download the AWS command line interface and configure it with our private key supplied by AWS.

We then install Amazon’s Elastic Container Service (EKS) for Kubernetes using the brew commands.

pip install awscliaws configurebrew tap weaveworks/tapbrew install weaveworks/tap/eksctl Creating a Kubernetes cluster is now ludicrously simple, we only need to run one command, but you should specify the cluster name, the number of nodes, and the region you are in (in this case I am in Boston so I choose us-east-1 ) and then run the command.

eksctl create cluster –name=cluster-1 –nodes=4 –region=us-east-1Now we must configure the cluster with the following commands:kubectl get nodeskubectl –namespace kube-system create sa tillerkubectl create clusterrolebinding tiller –clusterrole cluster-admin –serviceaccount=kube-system:tillerNow we set up Helm and Dask on the clusterhelm init –service-account tillerWait two minutes for this to complete and then we can install Dask.

helm versionhelm repo updatehelm install stable/daskhelm status agile-newthelm listhelm upgrade agile-newt stable/dask -f config.

yamlhelm status agile-newtA few more Kubernetes commands.

kubectl get podskubectl get servicesFor more details and a shell, you will need a command like this.

Your exact pod names will be different.

kubectl get pod agile-newt-dask-jupyter-54f86bfdd7-jdb5pkubectl exec -it agile-newt-dask-jupyter-54f86bfdd7-jdb5p — /bin/bashOnce you are in the cluster, you can clone the GitHub repository and watch Dask go!Kaggle Rossman CompetitionI recommend that once you have got the Dask cloud deployment up and running you try running the rossman_kaggle.

ipynb .

This is example code from the Kaggle Rossman competition, which allowed users to use any data they wanted to try and predict pharmacy sales in Europe.

The competition was run in 2015.

The steps in this notebook run you through how to set up your coding environment for a multilayer perceptron in order to apply it to a parallel cluster and then perform hyperparameter optimization.

All of the steps in this code are split into functions which are then run in an sklearn pipeline (this is the recommended way to run large machine learning programs).

There are several other examples on the repository that you can run on the parallel cluster and play with.

Also, feel free to clone the repository and tinker with it as much as you like.

Where can I learn more?To learn more about Dask, check out the following links:dask/dask-tutorialDask tutorial.

Contribute to dask/dask-tutorial development by creating an account on GitHub.


comDask – Dask 1.


4 documentationInternally, Dask encodes algorithms in a simple format involving Python dicts, tuples, and functions.

This graph format…docs.


orgTo learn more about Dask with Kubernetes:dask/dask-kubernetesNative Kubernetes integration for dask.

Contribute to dask/dask-kubernetes development by creating an account on…github.

comDask Kubernetes – Dask Kubernetes 0.


0 documentationCurrently, it is designed to be run from a pod on a Kubernetes cluster that has permissions to launch other pods…kubernetes.


orgTo learn more about Helm:Kubernetes and Helm – Dask 1.


4 documentationIf this is not the case, then you might consider setting up a Kubernetes cluster on one of the common cloud providers…docs.


orgIf you are struggling to work through any of the above steps, there are multiple other walkthroughs that go through the specifics in more detail:Adding Dask and Jupyter to a Kubernetes ClusterIn this post, we're going to set up Dask and Jupyter on a Kubernetes cluster running on AWS.

If you don't have a…ramhiser.

comSetup private dask clusters in Kubernetes alongside JupyterHub on Jetstream | Andrea Zonca's blogIn this post we will leverage software made available by the Pangeo community to allow each user of a Jupyterhub…zonca.


ioFor setting up the cluster on the Google Cloud (sadly could not find one for Microsoft Azure) check these links out:ogrisel/docker-distributedExperimental docker-compose setup to bootstrap distributed on a docker-swarm cluster.

– ogrisel/docker-distributedgithub.

comhammerlab/dask-distributed-on-kubernetesDeploy dask-distributed on google container engine using kubernetes – hammerlab/dask-distributed-on-kubernetesgithub.

comNow you should have a working parallel cluster on which to perform machine learning on big data or for big compute tasks!Thanks for reading!.????.

. More details

Leave a Reply