What do you need to know to be a successful Data Scientist?Viacheslav DubrovBlockedUnblockFollowFollowingJan 24“Black Farmed Eyeglasses in Front of Laptop Computer” by Kevin Ku is licensed under Pexels LicenseDisclaimer: This is my very subjective point of view on what you need to know to be successful as a Data Scientist.
All statements presented here are based on my personal experience in data science workflow.
Day to day work of a Data Scientist in the industry requires to not only know theoretical aspects of the machine learning algorithms for making correct model choices and use Scikit Learn methods “fit” and “predict” to build that model.
In Applied Data Science, the main goal is delivering a service and providing results to your customers.
To make this possible, a Data Scientist has to be aware of other topics which help him deliver required results.
Below is the list of topics necessary to build successful data science services.
Only a small part of the technologies used by the Data Scientist.
First of all, we can divide data science workflow into 4 main stages.
4 stages of data science workflows:ETL pipelineML pipelineProductizationCI/CDEvery stage requires a specific stack of knowledge.
The ML pipeline stage is the primary part that has to be known by Data Scientists as their focus main area of work.
All other parts Data Scientists must be aware of and reduce their lack of knowledge in the field.
ML pipeline stage:StatisticsMachine Learning theoryProgramming knowledge and code writing skillsThe first two components require little explanation, as they are classic DS/ML topics.
Therefore, the discussion will focus more on the third part.
Programming knowledge is depending on the type of a position, but in general it means knowing the basics of two programming concepts (OOP and Functional), Data Structures, Python (you could choose dark R path, of course), with an adjacent knowledge of the most important and popular libraries for manipulating data and building models (pandas, numpy, scikit learn).
Knowledge of some additional basic programming language (Java, C++, Scala) will only help you.
By code writing skills, we mostly refer to the ability to write clean, readable, maintainable code.
This makes it much easier for the team to work together with a data scientist in supporting the code.
ETL pipeline stage:ETL basicsRelational databases / SQLNoSQL / MapReduce / SparkSchedulers (Airflow)For building reliable, scalable and maintainable systems, a DS specialist has to know all aspects to which he should pay attention while accessing, manipulating and loading data.
He also needs to know general relational database building principles and SQL syntax for writing effective queries, understand NoSQL databases foundations, MapReduce paradigm, and Apache Spark framework.
As to the schedulers, we have put them into this stage, but they also intersect with other stages (Productization, CI/CD).
The most popular and flexible is Airflow.
Productization stage:Production code / TestingMicroservice Architecture / DockerHosting (RestAPI basics / Flask / Tensorflow Serving)As for any software product, if the code goes into production it should be readable and maintainable.
And for providing this a Data Scientist should (at least in general) know about basic testing principles (unit tests, acceptance tests, system tests) and tools (unittest, pytest, unittest.
After finishing his project, a Data Scientist has to create a service from his model and code.
Microservices architecture is everywhere right now and without the knowledge of Docker containers, you cannot deliver applications.
The model should not be just a script, it should have endpoints, that is why hosting is in this list.
It includes RestAPI basics, Flask (alternatively you can choose Django/ Tornado), Load balancing (Gunicorn).
In addition, there is Tensorflow Serving.
CI/CD stage:BashMakefileversion-control system (git)deploy tools / gitlab CIDVCBash is an integral part of the entire data science workflow, but we have added it in this part because it has the most significance in CI/CD.
Bash and Makefiles are a very important part of local deployment and automation process.
Of course you can’t participate in any projects right now without VCS knowledge.
But automatization tools such as gitlab CI also may come in handy during development.
Since Data Scientist works with data, the use of data VCS like DVC can greatly facilitate the workflow.
Bonus: Cross knowledgeCloud Services (AWS/ Azure/ Google cloud)With the increase in popularity of “cloud” and “serverless” architectures, companies integrate their solutions with Cloud Architectures.
We didn’t put Cloud Services into a specific category, because every service provider (Amazon, Microsoft, Google) has its own solution for every stage.
For example, AWS (which we’re working with) has Sagemaker and Recommender Tool in the ML part, and you always can build your specific solution with EC2, Auto Scaling, Load Balancer, VPC and other technologies and tools which you could use for improving reliability, scalability, and maintainability.
In the ETL part, AWS has a lot of stuff like SQS (queries), EMR (Spark, Hadoop), Glue (ETL jobs), Athena, etc.
We recommend that every Data Scientist spends some time to familiarise himself with one of the cloud services.
This is by no means an exhaustive list, and it can be always adopted for your specific domain.
However, this represents my own experience of areas and tools that you need to be at least familiar with, if not proficient in.
Have fun!.A journey of a thousand miles begins with just a single step.
.. More details