Good coding practices for Data ScienceGanesh NBlockedUnblockFollowFollowingApr 21While Data Science is not a field that directly stems out of computer science, the rise of Machine Learning and AI has made coding an integral part of Data Science.
While Data Science projects are more experimental and not as well-defined as software projects, following some good coding principles help increase efficiency and scale up data science projects.
Here are some good coding practices that I try to follow:Code OrganizationI ensure that all my code is not put in a single file but split across multiple files.
I usually separate my code into the following 4 major parts.
Specification FilesIn most of my projects, I tend to have specification file either as a YAML or a JSON where I can specify various parameters to be set to run the code.
Having these specifications allow me to use the code in different ways with no code changes.
Eg, Let’s say I built a model for one country.
Instead of hardcoding the country variable in my code, if I specify a country variable in the spec file, I can use this code in a different country by just changing the specification.
UtilitiesIf there are parts of my code that is generic enough to be used across multiple projects, I put them in separate files called ‘utilities’ so that I can use them across multiple projects and minimize re-work.
Core FunctionalityWhen it comes to the code for the core logic, I again try not to put them all into a single file but instead spread them across multiple files.
, Every Data Science project will have a data extraction piece, data exploration piece, modeling piece and so on.
I ensure that these pieces are separated out across multiple files.
Main ExecutableFinally, I have a separate file(usually called main.
py) running which should execute the entire code.
I try to have minimal logic in this file.
The objective of this file is for someone to understand the interdependencies of the different parts of the code and code flow and not the detailed logic of the code.
DocumentationIn all my projects, I try to maintain a Readme page which is regularly updated when my code changes.
The Readme page provides the objective of the code, instructions to install and use it, and code architecture and high-level file structure.
While the Readme is documentation focused on the code, I maintain a separate documentation explaining the statistics and machine learning logic.
The objective of this is to help other Data Scientists understand my logic and algorithm.
While documentation seems like a mundane task, more than helping others, it helps me gain better clarity of my code.
CommentingI usually add high-level comments about the code at the top of every file.
In addition to giving the reader an overview of the file, this helps me organize my files better.
For every method that I have, I write comments about the objective of the method, the arguments it takes and what it returns.
This again helps me split my code into appropriate methods.
In addition to these high-level comments, if I have any complicated logic in my code, I try to write some high-level comments about it.
Naming conventionMajority of the Data Science code that I have seen have variables and functions named as x, y, z etc.
This probably stems from the fact that most Data Scientists come from math and statistics backgrounds.
These naming conventions make the code very abstruse for anyone trying to understand it.
When I write my code, I take some time to think of the most intuitive names to give to my methods, classes, variables etc.
I also ensure that there is consistency in part of speech and the letter case that I use for these code elements.
For Eg, I try to use lowercase verbs for methods, camel-case nouns for classes etc.
In fact, good naming conventions reduce the need for comments and documentation.
Version controlOne of the best practices of good coding is to maintain version control.
There are tons of benefits from maintaining a version control system.
You can seamlessly collaborate across multiple people, switch back to an older version of the code, add new changes/features to the code without affecting an older version etc.
If you have not used Git before, create a Github account and start by uploading your projects there.
It’s free and will give you a good understanding of version control systems.
Given the fact that Data Science projects involve continuous experimentation and edits, version control becomes all the more important.
Let’s say you developed a model and later added a new feature into the model.
If you save these as two versions of the code, this will easily allow you to compare the performance of these two models.
Automated TestingWhile Data Science may not demand exhaustive automated test cases to validate the sanity of code like the software world, having test cases to validate nuances of data is a very good practice.
In my projects, I use the unittest package to have automated test-cases to validate the functionality of the different parts of the code and more importantly check for the handling of potential data anomalies like null values, missing values, outliers etc.
While good coding practices ensure robustness of the module, writing good code takes a lot of time and effort.
Not all projects justify this.
If you are doing an analysis to be delivered in a span of a week, you do not have the luxury to beautify your code.
For this reason, I tend to think of every project into 3 phases : POC, MVP and Production as explained below:POC is where I want to get a solution to prove the feasibility of modelingMVP is where the solution is robust enough to be used( additional features and model tuning) andProduction is where the solution is fully automated and deployed.
If I’m in POC phase, I don’t tend to spend too much time on cleaning up my code but as soon as I go to the MVP stage, I make sure that my code follows good coding practices.
If you feel that there are any other practices that you follow in your data science practices, feel free to comment below!.Cheers.
.. More details