And maybe even “What is Docker?”This article is intended to help other Data Scientists, who are venturing into the Google Compute Platform (“GCP”) stack, or possibly revisiting it while starting a new Kaggle competition.
Follow my #FullStackDS thread on Twitter for similar posts.
DataLab Notebooks are GCP’s product for connecting to a Google Compute Engine (“GCE”) instance (or VM) from a private environment.
It uses Jupyter Notebook’s infrastructure and Docker under the hood, and can link to other Kernels in R/Julia/Python/Bash, etc.
— if you want to get really fancy.
The tricks to setting up DataLab Notebooks are in configuring firewalls, IAM permissions, and in understanding some unique Linux commands.
I’ve yet to find an article which tackles these intricacies yet.
The spinning up part is super easy, once you get past this crux.
Let’s walk through a complete reproducible setup to get you going.
Setup Your ProjectBefore we get going, you’ll want to set up your account in GCP, if you haven't already.
For first time users, there’s a $300 credit and 12-month free trial.
Otherwise, let’s get started https://console.
From the “Hamburger Menu” in the upper left corner (*looks like a hamburger), we’ll select HOME from main menu.
Next, let’s use “CREATE PROJECT” — highlighted below — from the home page.
Creating a project allows user to create multiple VMs in a shared environment.
It also allows for separated billing.
Next, from the New Project page, we’ll name our project “DistributedScraping” and click CREATE.
From the Home screen you should now see the project DistributedScraping in the dropdown.
We’ll also want to open the free Cloud Shell from the upper right part of the page, circled in green below — which isn’t yet connected to any GCE instances.
Note, the black command line Cloud Shell is a free Linux machine in itself, which we’ll use to set up our firewalls, IAM permissions, and create the DataLab Notebook from.
Setup the Firewall for Private Browser AccessBefore launching the DataLab GCE instance, we’ll want to set up access to the localhost gateway, which is protected within the GCP project, and linked via a secure SSH connection.
By default, DataLab uses both port 8081 and 22.
The source range is set to “0.
0/0”, which allows for linking to your notebook after relaunching each time — a new External IP is created each time the DataLab Docker instance is Started and Stopped.
We can assign a fixed External IP and assign a source-range to that, in a more advanced setup.
We’ll establish the firewall rules in on command below.
From the Google Cloud Shell, enter the code below from this Github Gist in one long line.
The above code highlights both the Project Name and the firewall rule name.
Next we’ll need to config our current project from the open Cloud Shell:Launch the DataLab NotebookThe list below shows the cost/sizes by machine-type names.
We’ll create an 8 core machine with low memory here at $23/hr.
We can get the lower Preemptible price by using another linux flag — which I won’t go into here.
You can choose your own machine type by visiting this link.
In my next article, we’ll explore advanced scraping from DataLab — including parallel proxy and selenium scraping from Python, so we’ll want to use a larger instance than a standard 2 core machine found.
Run the below code in the Cloud Shell to create your DataLab GCE instance, wrapped with a lightweight Docker container; it includes a 8 CPU/Core machine in the “us-west1-a” region, with the name of “iptest”:After running the above linux line, we’ll see the below output.
Visit the enable API link circled in green by clicking on it within the Cloud Shell.
This will launch a new browser tab, and give access to the newly created VM based on your GCP protected environment.
After selecting the ENABLE button, enter the same command again in your Cloud Shell to create the machine.
Not sure why this takes two steps…?The output should look like the below, which will propagate your SSH keys and set up your GCE DataLab instance.
Next, click the Web Preview button (circled in green below), and change the port from “8080” (default) to “8081” and then select the LAUNCH AND PREVIEW link (steps shown below).
You can also redirect from your browser without using the method above.
Simply paste the below link into your browser or find it in the output from the Cloud Shell text.
I’ve found that I had to try a couple times, for the secure redirect to take hold.
Stay patient!.It will work…http://localhost:8081/Explore Your DataLab NotebookSelect the+NOTEBOOK button (circled below in green) and launch a new “Untitled Notebook.
We’ll run through a few Hello World examples below, where I’ll show you how to set up read / write access to your DataLab Storage (“Bucket”) — which is automatically created, with automated backups under a folder with the same name as your VM.
Run the below…From the newly launched tab, we’ll need to select the dropdown Kernel in the upper right, selecting Python3 from the workbook.
Note, the current default is set to Python2 for some reason(?).
Now, useShift + Enter on your keyboard, from the first cell (code chunk) after the print('hello world') function.
Voilà, it’s a success!.“Hello World!”.
Let’s set up read/write access to our newly created bucket ‘distributedscraping’ — which you can access Hamburger >> Storage >> Browser extension.
Here, we’ll need to access our ‘GCE email’ for binding as a “service account”.
Run the gcloud command below in your Cloud Shell, and search the text output manually for serviceAccounts:email, and use in the next Linux commands.
Now, we’ll bind the service account to the DataLab bucket, which was automatically created.
Now let’s up our game, and try an advanced helloworld, writitng both a CSV and TXT to our bucket, and reading them back in to the DataLab Notebook in Python.
Note, the “!” at line starts, are for single line bash scripts interspersed with Python code (pretty cool!).
Enter the below code into your newly created Notebook in the browser.
Output:If you want to extend this exploration, there’s a great YouTube video from Google [Datalab: Notebook in the Cloud (AI Adventures)], which goes into reading / writing / munging with BigQuery — Google’s BigSQL distributed relational database product.
Note: the iptest buckets are being automatically backed up hourly/daily/weekly.
This will remain even if you delete your DataLab VM.
I’m not sure where we can control this feature; beware there is a nominal cost for this data storage.
Checking VM Status and Shutting DownFrom upper left Hamburger icon on the home screen, navigate to Compute Engine >> VM Instances.
Here you’ll see iptes, our DataLab GCE Instance running (note green checkmark).
Even if you close the tabs, the GCE instance will keep billing.
Remember to check the box on your machine with the blue arrow, and then STOP the machine (from the upper right of page) after each time you’re finished using the instance!Also note, you’ll have to reinstall your Linux packages and updates each time you restart, but you’ll retain all of your data which is backed up in the distributedscraping/iptest bucket — which can be accessed from the Hamburger icon via Storage >> Browser extension.
Make sure to confirm that your instance is stopped (grayed out).
Reconnect After Stopping Machine.
From the Cloud Shell, after Starting, simply run the below linux command to relaunch DataLab.
Again, you can access the notebooks via http://localhost:8081/ once the connection to iptest is made.
datalab connect iptestIf you need to SSH into the machine for any reason, use the below from SDK command on local machine, or from the Cloud Shell in the browser.
Note, you won’t find any of your files here, as they are all linked via the project bucket.
This will also work for general GCE instances without DataLab Notebook Docker images installed — and allows for ls exploration in that context.
gcloud compute ssh 'iptest' –zone=us-west1-a –ssh-flag='-D' –ssh-flag='10000' –ssh-flag='-N'You’re Now Free!Photo by Jünior Rodríguez on UnsplashThanks for taking the time to complete this exploration.
This should get you started on the right foot with DataLab Notebooks.
I employ you to dive into the details of connecting to remote instances, and setting up permissions for your particular needs, and to share your findings in a reproducible manner.
As usual, all content herein is to be used at your own risk.
In my next article, you’ll quickly become a web-scraping NinjaHeadless, Distributed, and Virtual (Oh My!)Advanced Web Scraping with Python in DataLab Notebooks.