Data Science with Optimus.
Part 2: Setting your DataOps Environment.
Breaking down data science with Python, Spark and Optimus.
Today: Data Operations for Data Science.
::Part 1 here::.
Here we’ll learn to set-up Git, Travis CI and DVC for our project.
Favio VázquezBlockedUnblockFollowFollowingApr 8Welcome back to the series of Data Science with Optimus.
In the first part:Data Science with Optimus.
Part 1: Intro.
Breaking down data science with Python, Spark and Optimus.
comWe started this journey talking about Optimus, Spark and creating out environment.
For that we are using MatrixDS:A Community for Data Scientists by Data ScientistsThe data community’s workbench matrixds.
comTo have access to the repo just click bellow:MatrixDS | The Data Project WorkbenchMatrixDS is a place to build, share and manage data projects at any scale.
comAnd on MatrixDS click on Forklift:There’s also a repo on GitHub:FavioVazquez/ds-optimusHow to do data science with Optimus, Spark and Python.
comYou just have to clone it.
DataOpsFrom the great people at DataKitchen:DataOps can accelerate the ability of data-analytics teams to create and publish new analytics to users.
It requires an Agile mindset and must also be supported by an automated platform which incorporates existing tools into a DataOps development pipeline.
DataOps spans the entire analytic process, from data acquisition to insight delivery.
So DataOps (from Data Operations) for me can be thought as the intersection of these fields:And its functional components will be:You can read more about some of these topics in my friend Andreas Kretz’s publication:Plumbers Of Data ScienceThe Engineering and Big Data community behind Data Sciencemedium.
comSetting-up the environment on MatrixDShttps://matrixds.
com/We will create a simple (but robust) DataOps environment in the platform using the tools: TravisCI, DVC, Git and GitHub.
com/Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
In data science git is like our internal manager with a great memory.
It will remember everything you have did, how you did it and the history of every file in the repository.
Git is installed by default in MatrixDS but we will need to set up two configurations.
First let’s open a new terminal and type:git config –global user.
name "FIRST_NAME LAST_NAME"to set your name and thengit config –global user.
com"to set your email.
I recommend that the email you put there is the same one you have on GitHub.
To start working with this repository just clone it:git clone https://github.
gitAnd then go to the directory:cd ds-optimusBecause this is already a Git repo you don’t need to initialize it, but if you are starting from scratch, you’ll need to type:git initOn the folder you want your repository.
org/DVC or Data Version Control is an open-source version control system for Machine Learning Projects, and data science projects too.
This video explains it better:Because we are using Python, we will install DVC by:pip install –user dvcAs the documentation says in order to start using DVC, you need first to initialize it in your project’s directory.
DVC doesn’t require Git and can work without any source control management system, but for the best experience we recommend using DVC on top of Git repositories.
IMPORTANT COMMENT:There are errors right now with some configurations in MatrixDS for DVC, so in order to run DVC you’ll have to do it in a different folder, not in /home/matrix.
For that do this (I’m assuming you have the original project in the default folder):cd /home/sudo mkdir projectcd projectcp -r .
cd ds-optimusSo to start using DVC with our repo we just type:dvc initIf for some reason, that doesn’t work for you on MatrixDS, install DVC for linux:wget https://dvc.
listsudo cp dvc.
d/sudo apt-get updatesudo apt-get install dvcIf for some reason you get the error:W: chown to root:adm of file /var/log/apt/term.
log failed – OpenLog (1: Operation not permitted)Do asudo suand then type:apt-get install dvcOk, so if you ran dvc init on this repo you’ll see:Adding '.
dvc/state' to '.
dvc/lock' to '.
local' to '.
dvc/updater' to '.
lock' to '.
dvc/state-journal' to '.
dvc/state-wal' to '.
dvc/cache' to '.
You can now commit the changes to git.
+—————————————————————–+| | | || DVC has enabled anonymous aggregate usage analytics.
| | || Read the analytics documentation (and how to opt-out) here: | | || https://dvc.
org/doc/user-guide/analytics | | || | | |+—————————————————————–+What's next?————- Check out the documentation: https://dvc.
org/doc- Get help and share ideas: https://dvc.
org/chat- Star us on GitHub: https://github.
com/iterative/dvcThen commit your work (if you change the folder you may need to configure Git again):git add .
git commit -m "Add DVC to project"Travis CI:https://travis-ci.
org/Travis CI (Continuous Integration) is my favorite CI tool.
Continuous Integration is the practice of merging in small code changes frequently, rather than merging in a large change at the end of a development cycle.
The goal is to build healthier software by developing and testing in smaller increments.
The hidden concept here is automatic testing of what you are doing.
When we are programming we are doing a lot of stuff all the time, we are testing new things, trying new libraries and more, and it’s not uncommon to mess things up.
CI helps you with that because you will begin doing your work, commit a little bit of it with Git, and you should have the necessary tests to see if the new piece of code or analysis you made impacts (in a good or bar way) your project.
There’s a lot more to know about Travis and CI tools, but the plan here is to use it, you’ll learn on the way.
So the first thing you have to do is to go to:Travis CI – Test and Deploy Your Code with ConfidenceEdit descriptiontravis-ci.
organd create an account with your GitHub profile.
Then you will go (I’m assuming here that you have successfully forked the repo from GitHub at this point) and then on https://travis-ci.
org/account/repositories you will choose ds-optimus:and then activate the repoIf everything went well you’ll see something like this:Ok so right now this is empty because we don’t have anything to test yet.
That’s fine, will get to that in following articles.
But right now we need to built the basic file that will trigger “travis builds”.
We need for that a .
yml file, and this is the basics content it should have:language: pythonpython: – "3.
6"# Before upgrade pip and pytestbefore_install:- pip install –upgrade pip- pip install pytest# command to install dependenciesinstall: – pip install -r requirements.
txt# command to run tests#script: pytestAs you can see we also need a requirements.
txt file, in our case it will only be optimus for now.
If you have a fork of the project on GitHub make sure to add my master as an upstream because the files will be already there.
If you don’t know how to add an upstream here’s how:How to Keep a Downstream git Repository Current with Upstream Repository ChangesThis article demonstrates how to keep a downstream repository current with upstream repository changes as you perform…medium.
comThen we have to push the commit that adds “.
yml” to the project.
Then when you go to travis again you will see your first build:For now it will give us an error because we haven’t create any test to run:But don’t worry we will get to that later.
Thanks for seeing the update and start setting up your environment for this project.
If you have any questions please write me here:Favio Vazquez – Founder / Chief Data Scientist – Ciencia y Datos | LinkedInJoin LinkedIn ‼️‼️ Important Note: Due to Linkedin technical limitations, I can now only accept connection requests…www.
com.. More details