Yes, a pragmatic one.
Photo by Joshua Sortino on UnsplashInstead of a precise definition of what makes a data science practitioner in a business, I believe a better way is to consider three fundamental pillars:Business expertiseTechnical know-howPersonality traitsBusiness ExpertisePhoto by Nik MacMillan on UnsplashOften, business understanding and experience is overlooked, simply assumed or just briefly mentioned in advice on becoming a data scientist, yet it is a big part of what makes an effective practitioner.
Data science for business exists to solve real problems where data is integral to the discovery and/or solutions.
There are three aspects to this expertise:Understanding of the business strategy, economics, and modelsBusiness insight and intuition specific to the individual firm and its industryAbility to navigate the firm to source projects, communicate results, and implement recommendationsThis is a tough ask of anyone — and especially when working in large firms with hierarchical or specialized structures.
While the business ABCs can be quickly learned on the job and in school, the deeper intuition comes with experience.
There are many benefits to such learning, from not having the executives waste time explaining the basics to the ability to identify and organize relevant data, spot and solve problems that business truly cares about, and convert what may be technical language into the business speak.
In addition, the business perspective helps with prioritization, where sometimes getting to the 90% of the result with 10% of the effort may be sufficient before moving on to the next project.
There is more discussion about this in The Third Wave Data Scientist article.
“…I would encourage you to think of data science not as a new domain of knowledge to learn, but as a new set of skills that you can apply within your current area of expertise.
”Jake VanderPlas, Python Data Science HandbookHowever, without the next — technical — pillar, business will fail to appreciate what can and cannot be accomplished with data science, what value can be extracted, and how to move towards (almost) objective, anecdote-neutralizing, and data-driven reality.
Technical Know-HowPhoto by Nicolas Thomas on UnsplashThis pillar is about being able to implement data-enabled solutions needed by the business, one of the early major examples being observed in Moneyball.
Chances are, if you search online along the lines of “how to become a data scientist” or “what is a data scientist”, you will mostly get articles focused on the technical side of the role.
And there appears to be an immense spectrum to learn.
To help sort things out, I would split the technical side into two perspectives: an underfit (core, will apply to most if not all of data science jobs in businesses, but may not address all of the specific company’s needs, especially if the company is advanced in data science applications) and an overfit one (adapted to the specific firm’s needs and the data science team setup, hence hard to predict prior to being employed at the firm).
Here is what I believe the “underfit” perspective contains:Statistical / machine learning — a combination of relevant statistical / math knowledge with it’s implementation in modern programming languages.
Instead of learning statistics and programming separately, they can be learned in tandem — to me, they are two sides of the same coin in the data science context.
Luckily, outstanding courses (Statistical Learning by Trevor Hastie and Rob Tibshirani, Stanford University is one of them) and books (An Introduction to Statistical Learning and, for those more mathematically inclined, The Elements of Statistical Learning) are available for the unbeatable price of free as long as you stick with the online and PDF versions.
I personally also follow Andrew Ng’s educational content in part due to his ability to eloquently explain complex concepts and increasing focus on business applications.
There is a lot of world class content available on the topic.
Data project workflow — including ethics, project design, data collection, processing, modelling, and deriving conclusions / predictions / explanations.
Usually, programming (most often Python and/or R, also SQL for relational databases) would be at the forefront of this work, but we would be amiss to ignore the evolution of “drag-and-drop”, i.
“code-free” or nearly codeless data science platforms, as well as automated machine learning solutions.
I would still view programming as essential, because existing data science libraries already provide a lot of convenience and sufficient abstraction without distraction (allowing one to stay close to the data and what is “under the hood” of the model, i.
actually understand what is happening with each line of the code), but over time this might change.
Code-free does not mean that the “science” part of “data science” gets any simpler, yet I am sure the temptation will be there to forget this.
While available content choices — through MOOCs, bootcamps, books, and academia — are vast, I would highlight R for Data Science and Python Data Science Handbook as excellent, free starting points.
In applied data science (not research), both coding and data science platforms are at the stage of accessibility now where they serve as ultimate equalizers between those with and without a STEM degree; however, I suspect some bias towards the STEM degree will continue due to many current data scientists possessing such a background (and looking to hire based on what they know) and due to dearth of non-STEM candidates.
Visualization — while typically a part of the data workflow, I would single it out, because I believe one could make an entire career of knowing how to tell a story through visuals.
Through visualization, one could make, for example, previously user-unfriendly, “dry” data — appealing, difficult models — easier to understand, and big data — clearly summarized.
In addition, I expect that at least some of the visualization software providers will expand their presence across the entire data workflow (while other data science software companies will improve their visualization component).
Again, there are many options to learn, from Edward R.
Tufte’s classics to applied books, MOOCs, as well as plentiful examples by visualization libraries and software providers.
One might say that despite being “underfit” this still seems like a lot.
I agree — data science is an interdisciplinary and rather complex field — but I think this knowledge is very achievable.
The training resources are plentiful and often affordable, especially at the beginner to intermediate levels — one only needs time and desire to study and practice.
In addition, the peer support (e.
Stackoverflow and Stackexchange) and specialized websites (links to a few of them are in this article) allow getting answers to a variety of questions that may have been too specific to be covered in whatever training one may have undertaken.
I believe that despite being “underfit”, the technical know-how coupled with the business expertise is an outstanding background for any data scientist.
OK, so what goes under the “overfit” perspective?.The answer is — everything else (probability weighted) that could be linked to the broadest data science understanding, and that’s where I think a lot of variability in data scientist definitions sets in.
For example, one may or may not need to:Know (or intuit) more math (esp.
calculus, linear algebra, probability, statistics)— but unlikely to need advanced math, because much of the hard work is done for us in respective data science librariesDeploy models in production mode — or ask IT / programmers to helpUse cloud technologiesExpand and apply knowledge of Artificial IntelligenceWork with big data / Apache Hadoop / Apache SparkUse NoSQL databases / Apache CassandraManage teamsWork with non-tabular data, including images, spatial, speech, text, and webWorry about latency, scalability, security, storage, and real-time data (e.
Apache Kafka)Utilize programming languages other than Python, R, and SQL, and various operating systemsSpecialize and dig deep in a narrow subjectMaintain knowledge map — be aware of the broad scope of data science related areas and stay up to date on the latest developments, whether e.
there is a new algorithm, cloud technology solution, or a deep learning research breakthrough.
Over time, some of the above may no longer be a “maybe”.
Needless to say, learning and continuously practicing all of them (and more) in addition to the business and technical fundamentals is hardly realistic for a single person despite being aspirational.
Having personal interest in some of the areas or anticipating demand from one’s business could be a good way to start chipping away at this broad spectrum of knowledge.
In some ways, working in data science is akin to being in a graduate school: you know there is always something more to learn, you feel like you are always behind, but time is limited hence you have to pick your battles and just deal with it.
However, despite the time and effort required to advance this skill set, it is important to not lose sight of the first — Business — pillar.
Otherwise, there is a risk of the team being perceived as a niche data-cruncher that always needs to be told what to do and what business problems to solve.
Select Personality TraitsPhoto by Andrew Seaman on UnsplashYou can be a successful business person and have a solid technical background, but there are some personality traits that glue it all together and are likely to further amplify your success as a data scientist:If you were in The Matrix, you would take the red pill every time to pursue knowledge instead of the propagation of “folk wisdom”, convenient truths, assumptions, biases, or The Matrix itself.
Martial arts are optional.
If you were a detective, you would be a real life Hercule Poirot by applying iron-clad methodology and logic in solving complex data mysteries— and by being able to go over the entire process with the listener in a clear manner despite the underlying sophistication.
You make complex sound as simple as possible, but not any simpler.
If you were a Reinforcement Learning Agent, you would balance exploitation of the knowledge you already have with exploration in pursuit of the new.
You take on novel challenges, stay creative and intellectually curious.
If you were a runner, you would usually run long distances— grit is key in many data science projects.
In conclusion, while it is hard to singularly define a data scientist role or path for the reasons mentioned above, the focus on the business and “underfit” technical background could be a great foundation for any career in the field — recognizing that whether it is viewed as sufficient by different firms will vary.
Last but not least, considering that data science is evolving so rapidly and undergoing somewhat of a paradigm shift, I would highly recommend reading Thomas Kuhn’s Structure of Scientific Revolutions.
This is a book that never gets old.