Introduction to Github for Data ScientistsMaster the basics of GithubRebecca VickeryBlockedUnblockFollowFollowingJun 26Photo by Hack Capital on UnsplashExperience with version control is fast becoming a requirement for all data scientists.
Version control can help data scientists work better as a team, facilitating collaboration on projects, sharing work and helping other data scientists to repeat the same or similar processes.
Even if you are a data scientist working in isolation it is always useful to be able to roll back changes or make changes to a branch first, and test your change doesn’t break anything, before merging into the current project.
In the following post I am going to cover the following:What is Github?Why do data scientists need to use it?How to create and clone a repositoryBranchingPull requestsWhat is Github?Github is one of the most well known and widely used platforms for version control.
Github uses an application known as Git to apply version control to your code.
Files for a project are stored in a central remote location known as a repository.
Every time you make a change locally on your machine and push to Github your remote version is updated and a store of that commit is recorded.
If you want to rollback to a previous version of your project before you made a commit this record allows you to do this.
Additionally, because the project files are stored remotely anyone else with access can download the repo and make changes to the project.
The concept of branching, which in essence means you make a temporary copy of the project which is completely separate, means you can make changes there first without fear of breaking anything.
This is especially important if you are working on a project where there is a feature in production that is reliant on the code working.
This page covers the meaning of all the key terms I am using in this article such as commit, branch and repository.
Why do data scientists need to use it?Data scientists need to use Github for much the same reason that software engineers do — for collaboration, ‘safely’ making changes to projects and being able to track and rollback changes over time.
Traditionally data scientists have not necessarily had to use Github, as often the process of putting models into production (where version control becomes of paramount importance), was handed over to software or data engineering teams.
However, there is a growing trend in systems that are making it much more accessible for data scientists to write their own code to put models into production — see tools such as H20.
ai and Google Cloud AI Platform.
It is, therefore, becoming more and more important that data scientists are proficient in the use of version control.
Creating a repositoryI am going to give a brief introduction of how to use Github and Git to perform the most common operations from the command line.
If you don’t already have an account you will need to sign up for one (it is completely free!) here.
To create a repository from scratch go to https://github.
com/ and click the new button.
On the following page, you need to type a name for your project and select whether you want to make this public or private.
Next, you want to check the box initialise with a README.
md and click create repository .
You are now ready to add and make changes to files in your repository.
To do this from the command line you will first need to download and install Git following the instructions here.
To work on the project locally you first need to clone the repository.
You would also follow this step if you want to clone somebody else's project to work on.
cd my-directorygit clone https://github.
gitYou can find the URL for the repository by clicking the clone or download button.
A new directory will now appear in your current working directory with the same name as the repository.
This is now your local version of the project.
BranchingBranching allows you to make a copy of your repository, make changes there and test that they work correctly before merging into the master copy.
It is best practice to always make changes on a branch rather than work on the master.
Before creating a branch it is best to check that your local project is up to date with the remote repository.
You can check the status by typing:git statusIf you are not up to date you simply type git pull .
To create and check out a branch type the following.
git branch my-branchgit checkout my-branchYou can now make changes and they will not affect the remote repository until you merge them.
Let’s make a change to the README.
md file and work through the process of committing and merging a change.
Open the README file in your preferred text editor and make any change.
I’m using Sublime Text and just adding one line to the file.
Pull RequestsBest practice when working on a collaborative project is to use pull requests so we are going to merge our change using this process.
A pull request is a process that allows you or somebody else to review the changes you are making before merging them into the master version.
Before opening a pull request you need to add and commit your changes.
git add .
git commit -m "change to README.
md"git push –set-upstream origin my-branchYou will only need to add the argument —set-upstream origin my-branch the first time you push from a new branch.
You will now see this message in your remote repository.
Click compare and pull request and then click create pull request .
At this point, if you were collaborating with somebody else or a team on the project you might ask someone to review your changes.
They can add comments and when everyone is happy with the changes you can merge the pull requestYour changes will now be merged into the master branch.
If you have finished with the branch it is best practice to delete it by hitting the delete branch buttonUsing Github can get a lot more complex however I wanted to give a gentle introduction here.
For a more thorough overview, Github has produced a set of guides which can be found here.