Data Scientist Knowledge and SkillsA data scientist creates knowledge from data; and has skills in statistics, programming, and the domain under study.
Erik MarkhauserBlockedUnblockFollowFollowingMar 16A data scientist creates knowledge from data through quantitative and programming methods and the knowledge of the domain under study.
Data science is field in which data scientists work.
A data scientist should have skills and knowledge in the following areas:Data, statistics, mathematics, or other quantitative methods.
Programming, computer science, or computer systems engineering.
The domain under investigation.
Each of these areas reinforce each other to make a holistic data scientist.
Being good in statistics does not necessarily make one a good data scientist without the programming skills for running advanced machine learning and deploying production models; or the domain knowledge to interpret results.
Having knowledge and skills in these areas does not necessarily mean the data scientist is a deep expert within the areas — which is an unreasonable situation called an unicorn data scientist.
Rather the more reasonable expectation is that the data scientist is well-rounded enough in all of these areas to be effective in data science.
It is the combination of broad (but not expert) knowledge and skills in these areas that makes a data scientist.
Photo of Group of People in a Meeting by www.
com is licenced under the Free to use.
Data, Statistics, or other Quantitative MethodsAt the heart of data science is the transformation of data into knowledge.
This knowledge could include a categorization or estimation of things.
Categorization or classification is prediction of discrete values (i.
integer values or categories) and could include grouping emails into spam or not-spam, Estimation or regression is the prediction of a continuous variable.
For example, predicting the future revenue of a customer.
Data is created based on what has been observed in the world.
It is almost always a sample of reality because of the impossibility of observing all of reality.
The sample of data comes from a population of data — the fully observed universe.
To create knowledge, data scientists should understand both descriptive and inferential statistics.
Descriptive statistics characterize a sample of reality and includes such measures as centre (e.
mean, median), dispersion (i.
how distributed are the observations), shape (e.
skewness of the distribution).
If more than one variable is measured, it also measures dependence between variables.
Inferential statistics makes conclusions about the population based on the description of the sample data.
Data scientists need to understand advanced inferential techniques such as machine learning — the techniques to create new knowledge based on observations and the measurement of performance of the task at hand.
Data scientists may also have knowledge of other quantitative methods including forecasting.
One example of this includes future sales forecasts in clothing stores, which depend on the season.
Data scientists follow data analysis processes to create knowledge.
One common process is the Cross-industry standard process for data mining (CRISP-DM) which includes the following six steps:Business understanding: the domain knowledge that will be described in the next section.
Data understanding: descriptive statistics and the assessment of data quality.
Data preparation: data cleaning, constructing new variables, and merging data sets.
Modelling: A model is a description of the assumed structure of the sample of data observations.
Modelling includes the selection of techniques (machine learning has many algorithms that construct models) and running them.
Evaluation: the evaluation of how well the chosen model meets business objectives.
Deployment: deploying the model so that users can use it with future data as well as developing plans for maintenance.
Data scientists need to have a good understanding of data collection and in general data management methods.
They also need to use proper data visualizations to convey the findings from the data.
These visualizations include pie charts, bar charts, and line graphs.
Person Using Laptop Computer on Brown Wooden Table by www.
com is licenced under Free to use.
Programming, Computer Science or Computer Systems EngineeringProgramming is the process of building a computer program that performs a task.
Programming typically is the centre of fields such as computer science and computer systems engineering.
Data scientists need advanced programming skills for the manipulation of data, the calculation of complex metrics, and for advanced machine learning.
These programs need to be well structured for maintainability and performance — skills and knowledge from computer science or computer systems engineering.
Programming languages include Python, R, SAS, and SPSS.
Data scientists need to have some understanding of data storage techniques including databases, data warehouses, and data lakes.
Data scientists do not necessarily need to be qualified computer scientists or computer systems engineers, but they do need to be knowledgeable enough in the techniques in these fields to do data science effectively.
Seated Woman Typing on Apple Mighty Keyboard in Front of Turned-on Silver Imac by negativespace.
co is licenced under CC0 License.
Domain KnowledgeData scientists also need a good understanding of the domain area knowledge base to contribute additional valuable knowledge to the domain.
Domain area knowledge also helps to better define the problem, determine what is already known, and accurately interpret the results.
Domain knowledge acts a short-cut so that the data scientist uses pre-existing knowledge to better create new knowledge as well as helps reduce scope of the study to what is not already known in the field so that the data scientist does not repeat studies.
Low Angle Photography of Buildings Under Blue and White Sky by Jimmy Chan is licenced under Free to use.
The combination of skills adds valueData scientists do not necessarily have to be experts in any one of these three areas.
However they definitely need to have a good cross-disciplinary knowledge to create valuable domain knowledge from the data.