Best Practice for Data Science ProjectsMurat YalcinBlockedUnblockFollowFollowingMar 15https://www.
com/blog/analytics-and-beyondDue to the advancements in computer processing power, and lower processor and memory costs, it is now practical to analyze large volumes of data using data science techniques.
This rise turn into a serious demand in industry.
This has led to a serious demand for data scientists.
To satisfy this employment demand, people from very diverse backgrounds have started to change their career paths to data science.
People are eager to enter this field which unfortunately includes many people who lack the proper skills and an understanding of the data science life cycle.
This shortcoming has started to cause problems in the industry.
Data science projects have several steps to follow and for continuity, these steps should flow together.
Without a systematic work flow, it is very easy to get lost in one of these steps.
In industry, when people think that they finished a project, they often struggle to bring the project to full operational support as they have failed to consider these life cycle steps.
This is a common, but serious one resulting from people who do not know or appreciate the meaning of the term, “production ready.
”In my work, I have experienced, engineers who claimed a 92% accuracy, when testing with yielded an 83% accuracy.
Also, I have seen deployed projects that omitted several critical steps that were still in the engineer’s notebook.
A month later, the product manager requested bi-weekly training of the model causing a shock effect on the engineers.
The goal of this article is to develop a better understanding of the life-cycle and work-flow required when conducting data science projects.
The nature of the data science projects requires many tests at each step of the project.
For this reason, a very common practice for data science projects is using notebooks.
At the end of the project, it is very likely to have excess code in spanning multiple notebooks will not be used for in production.
When engineers finish one step, they frequently take the output of the completed step, leaving everything in their notebooks unorganized, and immediately continue to the next step.
When the code runs in the final step it is a mistake to assume the code is is production ready and project finished.
This misunderstanding can cause the loss of time and money.
What is production ready?.Production ready contains multiple checks:1.
Does it run?2.
Does it satisfy the project requirements?3.
Is the system stable?4.
Is it maintainable?5.
Is it scalable?6.
Is it documented?Let’s delve into each one of them.
Does it run?.Meaning does the code run properly?.Code should be able to run smoothly, without any intervention or modification at any step.
If you have a good reason, you can do some steps manual, however good practice is having a continuous flow from data acquisition to prediction, something that will most likely be required in production.
A very common, but bad practice, is writing the output of each step and reading the output of the previous step again.
You want to avoid writing and reading from the disk for every step.
Keeping information in memory and passing the information to the next step without writing it to a file will increase the performance of the software.
If the analysis data too exceeds the memory capabilities, then receive and pass the data in batches.
Software should meet the requirements of the project.
When engineers finish working on the last step of the project, they frequently run to the project manager to demonstrate the metrics of the model (mostly defined on accuracy).
It’s a mistake for the product managers to think that the project requirements are met.
At this point, if product manager does not have the technical knowledge to understand and assess the model, the engineers should encourage manager to assign another subject matter expert to ask these questions:1.
How was the data collected/sampled?.Engineers can, intentionally or unintentionally, introduce data bias.
Data scientists can take the same data and show the same results as favorable or unfavorable.
The model assessment should be based on those metrics that you are looking at.
How is data split into train/validation/test groups?.An inappropriate split of the data may result in significant differences in production results.
Applying an 80/20% split to the any size of the data sets, and not stratifying skewed data sets are very common mistakes that engineers regularly make.
Does the test data represent the data that the model will be used on?.Engineers and product managers should always consider how fast the data flows and that the data may change over time.
What percentage of data is changing, in certain amount of time?.A basic check is to ensure that the data received after completion of the project still represents the business needs.
After getting satisfactory answers to these questions, engineers and product managers can say that the software meets the requirements of the project.
A robust architect and minimizing the software defects at the beginning of the project development will help the engineer develop a stable system that will not require drastic changes down the road.
Engineers can understand every piece of project and every line of code because they created the code however, good documentation will allow new-hires to quickly come up to speed.
It should not take too much time for new people to understand existing work.
Good maintainable software should not be too complex to understand.
Engineers should keep in mind that it is the computer that runs the code, but human who read and support it.
Creating a complex code is not a talent, make it simple, readable and understandable.
Data scientists must ensure that, the software will be able to handle increased loads of work.
Scalability will decrease the total amount of time spent on the project in the future.
If the model works, will it scale with a 1,000,000x increase in data?Documentation is a critical part of the project and lacking it, the project should not be considered complete.
Most of the time, documentation is created at end of the project.
This is poor practice, since each step, in data science project, requires detailed information to be recorded for adequate documentation and recalling these details from memory at the end of the project is not a reliable practice.
Start documentation building while developing the project.
Make sure to note major points for each step, to avoid leaving out key steps.
Following these steps, increases the chances of having robust projects.
After setting up a project structure, creating a Docker image and working from this image is a safe and efficient way of developing a software project.
Since Docker image already includes an operating system and all dependencies of the programming language, it is easy to move in different directions even during the development process.
This will prevent the need to create a new image for each code change, as one only needs to create a new image when updating the docker file.
Also, the docker file will provide an operating system independent product.
You can deploy this Docker image on any operating system.
Make sure you have the following things at the end of the project:1.
Everything has been scripted and all code from the notebooks implemented.
A product ready project as described in the six checks mentioned above.
The system demonstrates a smooth flow from beginning (first step) to end (last step).
A docker image has been generated.