Best practices organizing data science projectsWhich are the main considerations at the moment to start a data science project?Florencia ManginiBlockedUnblockFollowFollowingJun 1Photo by Siriwan Srisuwan on UnsplashData science projects imply in most of the cases a lot of data artifacts (like documents, excel files, data from websites, R files, python files), and requires repeating and improving each step, understanding the underlying logic behind each decision.
1) Objectives for a data organizationThere are several objectives to achieve:Optimization of time: we need to optimize time minimizing lost of files, problems reproducing code, problems explain the reason-why behind decisions.
Reproducibility: There is an active component of repetitions for data science projects, and there is a benefit is the organization system could help in the task to recreate easily any part of your code (or the entire project), now and perhaps in some moment in the future (6 months, 1 year, 2 years …)Improve the quality of the projects: organized projects usually mean detailed explanations along the process.
During the process of documentation and under the necessity to explain the reason behind each step is more probable find bugs and inconsistencies.
2) Starting a new project: the beginningSince the very beginning, it is a good practice to start with a good organization for a data science project, and instead of considering that as a waste of time, we can see that as a savvy approach to saving times in different ways.
Also, because we are working with others into a organization, it is important to understand that everyone has different workflows & ways to work.
For a shared project is a good idea to achieve a real consensus about not only the folder structure but the expected content for each folder.
3) Use control versionWhy is it necessary to use a control version? To delegate basic tasks like:To have an automatized backup system for the work, and just for that the necessary work to implement that is highly valuable.
For handling changes on the files during all the project.
Also, return to previous versions in order to check something.
Version control systems can solve the problem of reviewing and retrieving previous changes and allow single files to be used rather than duplicated.
To facilitate the process of working with others making it easy to share files and keep working on them.
Some of the most popular tools are GIT, SVN, Subversion… no matter the final choice the best idea is to implement it.
4) Document everythingWhen we are speaking about the documentation we are referring about:documents included for analysisintermediate datasetsintermediate versions of your codeThe most challenging decision is determining how much time to invest in a document: too much time and yes, it is a waste of time, too little and the documentation will be incomplete and useless.
5) Improve the processThe fundamental idea is to evaluate the process and improve the workflow.
At the moment to finish a project, or delivery something is a good idea to evaluate if there is something to improve: a better organization for the files, or correct the way to document, no matter what at the end the idea is to understand that any process is in constant movement and you need to improve it.
ConclusionsManaging the organization of a data project means to evaluate what are the objectives into an organization system, how to structure the data, the best way to establish a backup system and version control and finally how to document all the processes.