The ‘Colaboratory’ Data Scientist:

with GPUs and TPUs.

These days, even when I’m at my research lab on a GPU heavy machine, I still find myself firing up a colab notebook.

It’s not just the ease of setting up and using a powerful environment, but it is also the ability to share and have someone take over later on.

1.

iPython -> Jupyter Notebook -> Google ColabIf you are new to the Colab bandwagon, I think of it as a Jupyter Notebook that lives and works in the cloud.

If you are new to all that, you are a couple of years late, but hey, you’ll get caught up soon.

I started using Colab (or notebooks in general) to explore a new dataset or a library.

These days, I find myself doing most of the data science within a notebook, that includes data visualization with libraries like seaborn or bokeh, all of the machine learning and even performance monitoring.

The strongest points (of a Colab workflow) for me have been:It lives in my Google Drive, so sharing code is easierIts kinda like a Google Doc, so can easily get comments and work with others in collaborationIt works in the cloud.

With GPU and TPU, and a size-able chunk of memory and storage, it can work hard!You can set a ‘data_dir’ in Google Drive so your peers or your boss can drop new files and can get results back all within the same shared folder.

This all seems too good to be true, but that is the kinda world we live in!Now, let’s get to work!BeforeA few months earlier, whenever I got my hands on a juicy new dataset and had noted down my hypotheses, this is what I did:$# conda create –name yetanother_projectFollowed by a a couple dozen of these:$# conda install -n yetanother_project some_new_libraries some_existing_librariesIn theory, I could use my existing conda environments, but from experience I almost always know I’m going to use some new library which will not sit well with some existing library.

So if I update the library, my main project that will one day bring me fame and money will break sadly which will lead to 5 more hours of scrounging stackoverflow.

Finally,$# mkdir deep_inside_my_workspace_and_mymac_running_out_of_space$# cd deep_inside_my_workspace_with_mymac_running_out_of_spaceAfter (With Colab )Get startedThen I add some GPU by going to Runtime > Change Runtime, like so.

(cause why the heck not?)2.

Files (Google Drive!)Now, since my University has pledged to give unlimited google drive storage, I connect my google drive, where I’ll keep all the data :p .

Thank you !from google.

colab import drivedrive.

mount('/content/gdrive/')You’ll get a link to allow authorization and you are good to go!.On the left side-bar you can navigate your files, ask your co-worker to drop a new file and collaborate.

data_dir = os.

path.

join('/content/gdrive', 'My Drive', 'workspace', 'colab_notebooks')import pandas as pdos.

listdir(data_dir)my_csv = pd.

read_csv(os.

path.

join(data_dir, 'master.

csv'))my_csv.

head()Output to the above code.

3.

Imports and installsWant to use library?.just do an import.

import pandas as pdimport matplotlib.

pyplot as pltimport PILimport numpy as npmy_arr = np.

arange(25)my_arrarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])Want to install a library?.you can do a pip install most libraries and use them right away.

!pip install kerasimport kerasUsing TensorFlow backend.

Want to use a library without pip support?.you can try apt-get install or import a wheel and then install.

If you don’t know what the last part means, you will most likely just be fine using pip.

Want to use a library without pip support?.you can try apt-get install or import a wheel and then install.

If you don’t know what the last part means, you will most likely just be fine using pip.

Data AnalysisSo now comes the fun part.

Lets see what a data is made of!Follow along in Colab here.

You can find the data hereUse the pandas describe function to get quick insights, and info function to see what columns are complete and which are not.

m_data = pd.

read_csv(os.

path.

join(data_dir, ‘master.

csv’))m_data.

describe()m_data.

info()4.

SharingSo, google Colab is a google doc that you can execute, so sharing and working together is pretty straight-forward.

Now that is very cool, but what I found amazing is ‘Code Snippets’.

Usually when I have an idea of a plot in mind, I don’t know how to actually plot it, so I stackoverflow it, which is generally quicker than looking through the documentations.

Now, in Colab, I can directly search for working code snippets and alter them as I want.

This saves a ton of time.

In the next part, I’ll show you how to think about the data and get some insightful visualizations using Pandas and Seaborn.

Stay tuned.

Comments?.Suggestions?Feel free to follow me on Medium, Twitter, and LinkedInHere are some of my other posts:Should AI explain itself?.or should we design Explainable AI so that it doesn’t have toIn this article, I’ll go over:towardsdatascience.

comWill Artificial Intelligence take shortcuts?.or is it just us?We all think alike, no one thinks very much — Walter Lippmanntowardsdatascience.

comBitcoin and the missing ‘bit’ of changeImagine there is a big dark chest in the middle of Times Square in New York.

The box is locked and in the front there…bit.

lyReferences:Google Colab.

. More details

Leave a Reply