Boost Productivity for End-to-End Data Science ProjectsTools for data scientists from non-engineering backgroundSherry WangBlockedUnblockFollowFollowingMar 23Photo by Vitor Santos on UnsplashI started my career as a business analyst doing analysis with Jupyter Notebook.
As I became a data scientist, I’ve gotten more involved in coding but I still used notebooks — and no, no text editors.
However, the more complex the projects became, the more limitations I experienced with using only Jupyter Notebook.
Thus, I decided to explore tools to boost my productivity for end-to-end data science projects.
If you are in the same boat, you may also find these tools useful.
In this article, I’ll go over my end-to-end project workflow, room for productivity improvements, and introduce tools to boost productivity for data science projects.
My End-to-End Project Workflowdata understanding and exploratory analysismodel developmentThis step involves creating and testing different feature engineering ideas, experimenting with multiple algorithms, and searching for optimal model parameters.
extract functions and scripts, to create an end-to-end training jobNext I put together functions and scripts to enable end-to-end training.
The goal is to query, create features, search parameters, and train the final model with one command.
create a self-contained end-to-end scoring jobThis is the production code to score new data with the trained object.
It usually requires creating a self-contained environment as well, using tools like docker, conda etc.
Room for Productivity ImprovementRepeated manual scaffoldingI stick to certain folder structure for all my project, as recommended, but I’ve been manually creating all needed files and folders repeatedly.
Insane experiments trackingI used notebooks as a documentation for experiment tracking, but I can’t remember how many times I accidentally overwritten experiment results by running the wrong cell.
It also got long and messy pretty fast if you run a lot of different experiments in one notebook.
“Why not saving experiment results into a separate file?” you may ask.
Well, that defeats the purpose of using notebook as the documentation, but that’s what I actually ended up doing.
However managing these files of experiment results also take quite some effort.
Repeated work on the same piece of codeJupyter Notebook is perfect for data understanding and exploratory analysis, but it’s not the optimal choice for code development.
I re-work most part of the codebase when moving code from notebooks to scripts.
Constant switch between tabsThis may not be a problem for everyone, but for me I prefer an immersive environment where I don’t have to constantly switch between tabs when editing and navigating multiple scripts.
Tools to Boost Productivity for Data Science ProjectsScaffold with cookiecutterCookiecutter is flexible enough that it allows you to create your own template.
For me one of the existing templates (https://github.
com/kragniz/cookiecutter-pypackage-minimal) works perfect.
It not only saved my manual work but also ensured that my projects follow the same standards, which is easier to manage and also easier for my colleague to take over.
Track Experiments Results with MLflowMlflow has made my life so much easier by providing a quick and easy way to track experiments.
Its python API is really intuitive to use and is also powerful when comparing results.
It’s still under beta release, and a lot of its functionalities seem to be designed for databricks platform, but it’s more than enough for simple experiment tracking.
Vim + tmux/screenIncorporating text editor is probably the biggest change in my workflow.
I tried to turn to vim as soon as I’m done with exploratory analysis, and start developing functions and scripts in .
py file directly.
It has prevented a lot of repeated coding work, and the immersive mouse-free coding experience is addictive!.If you also found yourself start writing more and more code, and you’ve never tried coding in text editor, it for sure worth a shot.
Manage jobs with MLflow projectMLflow project uses conda environment under the hood to make python jobs executable through standardized bash commands.
It hides all the messy details, and provides a simple interface for other people to run my code without knowing much about it.
The more I learned about data science, the more I realize how much more I need to learn.
That’s why my workflow and toolsets will always be involving, and there will always be better things out there — better techniques, better tools etc.
So if you have suggestions, feel free to connect with me via LinkedIn or send them my way at sherry.
com!.. More details