Seamlessly Integrated Deep Learning Environment with Terraform, Google cloud, Gitlab and DockerAlexander MuellerBlockedUnblockFollowFollowingJan 21When you are starting with some serious deep learning projects, you usually have the problem that you need a proper GPU.
Buying reasonable workstations which are suitable for deep learning workloads can easily become very expensive.
Luckily there are some options in the cloud.
One that I tried out was using the wonderful Google Compute Engine.
GPUs are available in the GCE as external accelerators of an instance.
Currently, there are these GPUs available (prices for us-central1).
NVIDIA® Tesla® P4: $1267.
28 USD per GPU per MonthNVIDIA® Tesla® V100: $306.
60 USD per GPU per MonthNVIDIA® Tesla® P100: $746.
06 USD per GPU per MonthNVIDIA® Tesla® K80: $229.
95 USD per GPU per MonthManual configuration usually isn’t something you can scale up easily, so I did some investigations on if there are methods with which I can ramp-up my environment as seamless as possible and destroy it the same way.
Consequently, I found a solution which uses terraform to set-up the infrastructure in the Google Compute Platform.
The Source Code is deployed from Git and a Docker container is automatically started having all necessary dependencies like tensorflow, keras and jupyter installed.
In this blog post I will guide you through the individual steps on how you can set up your environment easily.
Some of the work is based on this git repository: https://github.
com/Cheukting/GCP-GPU-JupyterWhat will you learn in the Blog post?Setting up a GCE Instance with GPUs in an automated wayHow to use terraform with GCPHow to deploy the code of a Gitlab repository to a GCE InstanceWhat we will build in this blog post.
As shown in the chart above I will show you how to write a terraform script that automatically spins up a Google compute engine virtual machine that installs CUDA, Docker, etc on it and finally starts a Docker container with the code from another external Git repository (in our case from gitlab).
This Docker container is running the jupyter notebook server which can be accessed from the outside via your browser.
In addition, I will show you how you can run longer tasks outside of a notebook inside your VM using Docker.
TL;DRJust want to try it, then check out the TL;DR section at the end with all necessary commands listed.
Create Gitlab Repository for your python codeIf you already have a repository that you want to us, feel free to do so.
Otherwise it now time to create a new repository.
All your ml and data exploration code will go into this repository.
We will keep it separate from the infrastructure code.
Again, in this Gitlab repository you could now add all your code and for instance, create a simple train.
py python file that trains a neural network and saves the trained weights at the end.
A very lean example of what is possible can be found here: https://gitlab.
com/dice89/deep-learning-experimentsThis repo simply contains a Jupyter Notebook for some data exploration and a train.
py to train an RNN LSTM text generation model.
Creating your own Docker image / Use an already existing imageEither you can take an existing Docker image or you can create your own.
For the sake of simplicity, we will use a pre-built Docker image with a python 3.
5 environment installed and all necessary libraries for a reasonable deep learning use case included.
It should contain everything you need for a first try: python 3.
X, tensorflow-gpu, numpy, pandas, sklearn, kerasIf your interested, you can check out this image here: https://hub.
com/r/dice89/ubuntu-gpu-python-dl/Configuring and starting your instance with TerraformTerraform is an infrastructure-as-code toolkit that allows you to define, create and destroy infrastructure by writing code.
This comes with a lot of advantages: For instance, you don’t need to configure anything by the means of a UI like in the GCP console.
Furthermore, your entire infrastructure configuration is documented by default since it is readable code that is ideally versioned in a git repository.
I just discovered Terraform a year ago and already cannot imagine setting up infrastructure without it.
Hashicorp found a wonderful combination of codable and versionable infrastructure that can be understood with a coders mindset.
You only need some terraform files and the terraform CLI to create your infrastructure.
Add a new Google compute instance is easy to create like this:Example of a gcloud compute instance creation with terraform.
Before you can create an instance you need to perform some steps upfront:Create a gcloud account (https://cloud.
com/)Install the gcloud CLI tools as part of the Google Cloud SDKLogin into gcloud using your terminal `gcloud auth login`In the following snippet, you will create a gcloud project, set the execution context to your current account, create a service account that is able to create new instances and finally download the private key to use this service account in terraform.
Execute these commands in your CLI and replace <your> with the project name you desire.
It can take a bit of time mainly due to the activation of compute API.
Before we start the instance let us have a look in the details on how to configure a Google compute instance with a GPU enabled.
Unfortunately, you have to request a quota for the GPU (it was working for me until Dec.
2018 without it).
Please follow the advice from this Stackoverflow article.
The handling of this request can take up to 2 business days.
Google Compute Instance with a GPUAs you can see in line 19, we added a Tesla K80 GPU to this instance and on start up we perform some action in a script (start-up-script.
This is shown below:Setting up Ubuntu VM to run with CUDAIn this script we install all libraries needed, add an ssh key to the user and run our Docker container that exposes port 80 to the outside world.
Therefore, we can reach the jupyter notebook server.
Please notice that anyone who knows the IP could access your notebooks after creating this instance.
This should be changed even for a short-lived environment like this one.
Now create a new ssh key to be able to deploy our code from Gitlab to the instance.
ssh-keygen -t rsa -b 4096 -C “your_email@example.
com”In order to make it work, you have to replace the placeholder “ADD YOUR SSH KEY HERE “ with your generated private (!!!) ssh key in the start-up-script.
Please note: Do not share your key with anyone!.Clone the full configuration from this repository and then change your ssh key: https://gitlab.
(do not commit your private key to any git repository) Also, make sure that your credentials.
json is in the root of this folder (don’t commit the credentials.
You also have to add this ssh key to your gitlab account in order to make the code from you Gitlab repository deployable.
Now we’re ready to create the machine.
And this is possible with only 3 bash commands!.????Fill in your GCP project id type yes and your instances will be created (also consider the costs that this will cause, GPUs are not covered by the free tier).
After the instance is created, you will see an IP address posted to your command line.
This is your public IP under which your jupyter instance will be available.
It might take a couple of minutes until the start-up-script.
sh is finished and everything is installed.
Let us take the time until the script is done to explore the instance a bit.
In order to do so, you have to ssh into it.
Luckily Google provides a command for us.
gcloud compute –project “<your>_dl” ssh –zone “europe-west1-d” “gpu-vm”The start-up-script.
sh is running as a root user, therefore, you have to switch to your root console to see what is happening.
sudo sucd /var/logtail -f syslog | grep startup-scriptNow we’re on the instance and can check, e.
, if a GPU is installed and already used.
nvidia-smi -l 1We could also install htop since it comes handy at monitoring memory consumption of your running processes:sudo apt-get install htopAfter a while you can check if there already is a Docker container running:docker psIf you see your Docker container on this overview you are ready to log into your jupyter notebook under the IP shown.
Also if you go to the path ~/datascience/deep-learning-experiments you will see that this automatically mounted to your Docker container under /root/project and contains the contents of your gitlab repository like the train.
Running a training task outside of jupyterJupyter is nice for some data exploration or experimental code.
However, training deep learning models takes a lot of time and you cannot afford your jupyter session to crash and to lose all the training progress.
Luckily, there is a remedy for this.
You can very easily train a new model with running a Docker container as a daemon that runs a python script to train your model.
Everything you need to do is, e.
, in our example type.
docker run –runtime=nvidia -d -v ~/datascience:/root/project dice89/ubuntu-gpu-python-dl python3 /root/project/deep-learning-experiments/train.
pyIf you now check docker ps you will see something like this:To see the logs of your training task, simply type:docker logs <your_container_id_from docker ps>Finally, you’re training your model with code from a Git repository in a reproducible fashion.
When you’re done and have saved and stored your weights, you can destroy the environment by simply typing:terraform destroy -var 'project_id=<your>-dl' -var 'region=europe-west1-d'If you need the same env again simply type:terraform apply -var 'project_id=<your>-dl' -var 'region=europe-west1-d'So, this is it for this little walkthrough on how to create, an environment for deep learning using cloud resources.
Have fun trying it out and send me some feedback if you have any suggestions to improve the environment.
TL;DRHere are the instructions to create the environment in a nutshell with a predefined Docker container.
Exchange <your> with some prefix your like.
Create a gcloud account2.
Install gcloud CLI: https://cloud.
com | bashexec -l $SHELLgcloud init3.
Create Gcloud account and project (replace <your>)gcloud projects create <your>-dl –enable-cloud-apisgcloud config set project <your>-dlgcloud services enable compute.
Install Terraform: https://www.
htmlbrew install terraform6.
(optional if you want to deploy some code to it) Fork and git clone deep learning experimentshttps://gitlab.
com/dice89/deep-learning-experiments/forks/newgit clone git@gitlab.
Git clone the code to define the Google compute engine VM with a GPUgit clone git@gitlab.
create ssh keyssh-keygen -t rsa -b 4096 -C “your_email@example.
Add private ssh key to Google cloud infrastructure the `start_up_script.
Add the public ssh key to your Gitlab account9.
Create a GCP service account and get credentials.
jsongcloud iam service-accounts create gcp-terraform-dl –display-name gcp-terraform-dlgcloud projects add-iam-policy-binding <your>-dl –member='serviceAccount:gcp-terraform-dl@<your>-dl.
com' –role='roles/owner'gcloud iam service-accounts keys create 'credentials.
Init your Terraform environmentterraform init11.
Start the environmentterraform apply -var 'project_id=<your>-dl' -var 'region=europe-west1-d'Wait a bit (roughly 5–10 minutes) to see the IP address of your jupyter notebook server:terraform show | grep assigned_nat_ipTo ssh into your compute instance:gcloud compute — project “<your>-dl” ssh — zone “europe-west1-d” “gpu-vm”12.
Destroy the environmentterraform destroy -var 'project_id=<your>-dl' -var 'region=europe-west1-d'If you find any problems with the tutorial, please report them to me!.I’m very keen to keep it up-to-date.