Machine Learning Engineers, Data Scientists and their respective roles.
Karl SchriekBlockedUnblockFollowFollowingMar 5Over the past decade terms such “Data Science”, “Big Data”, “Data Lake”, “Machine Learning”, “AI” and so forth have risen to the forefront (and sometimes fallen back again) of the everyday vocabulary used in the widest variety of industries.
However, regardless of their wide use (or perhaps exactly because of it!) there appears to be little consensus on what many of these terms mean.
I do not wish to engage in an extended argument on consistent nomenclature, but there are two frequently used terms that are of particular interest to me: “Data Scientist” and “Machine Learning Engineer”.
In the broadest possible sense, both of these terms could be understood as referring to “technically skilled people who build machine learning solutions”.
“Data Scientist” is a term that over the years has become associated with a sort of generalist mathematician or statistician who can also code a bit and knows how to interpret and visualise data.
More recently, the term “Machine Learning Engineer” has become associated with software developers who have picked up some mathematics along the way.
While there is certainly some truth in these interpretations, I find neither of them particularly useful.
Therefore, at the risk of adding to the confusion, I wish to illustrate my interpretation of what these roles entail, by way of a little parable of two intrepid adventures — one a scientist, and one and engineer — who head out into an unknown and possibly endless desert, in search of oil…According to their tasks, the engineer and the scientist are differently equipped.
The scientist travels lightly.
He has a rucksack, a compass, a spade and some simple but precise measuring equipment.
He makes a few forays into the desert, never very deeply, but deep enough to assess which direction looks the most promising.
Within a few days of performing measurements and digging some holes with his spade, he has an idea of where there might be oil.
The engineer on the other hand brings heavy and sophisticated machinery with him.
For now, he is not concerned with where oil might be found.
Yet once it has been found, he will have to transport it, so he spends the first few days designing a blueprint for a pipeline.
Collaboration allows the two explorers to combine their strengths and become more effective.
After a few days the explorers confer and the scientist declares that he’ll make deeper forays for the next few days, as he is now much more certain about where to find the oil.
As the scientist sets off, the engineer starts up his machinery and — following the scientist’s footsteps — proceeds to build the first section of pipeline.
Eventually he catches up with the scientist, who had discovered a small well!Together they set up a drill at the location and connect it with the pipeline.
This pipeline makes the work of the scientist much more efficient and although they haven´t found a truly major well yet, this smaller well can already deliver some profits and the engineer can use it to test his pipeline.
After some months of smooth cooperation, both explorers are ready for the first big break-through.
In preparation for the next phase of their adventure, the engineer constructs a lightweight oil drill and instructs the scientist on how to use it.
Armed with this specialised drill, the scientist is now able open up wells without the engineer being present.
For a while, the scientist continues working on finding new wells while the engineer continues connecting them to the pipeline.
Next, the engineer now also designs lightweight pipes that that the scientist can connect to the pipeline without the engineer’s help.
This allows the scientist to greatly speed up his explorations, and simultaneously frees up the engineer to redesign big parts of the pipeline, thereby making them more stable and efficient,Similarly, the scientist devises a new standardised set of measurements for the engineer to build into the measurement protocols of the main hub, so that the scientist no longer needs to travel to the drilling sites to monitor their performance.
After months of hard work, they reach the big oil well they were looking for.
Together they attach the well to the pipeline and install sophisticated measuring equipment.
By now the routine is well settled.
Measure, drill, attach, pump, measure.
In no time the big oil well is gushing.
After a while their paths separate — the engineer stays on-site, while the scientist contributes to the project off-site.
Soon after, the scientist packs up and heads home.
He is no longer needed on-site, but he’ll still be available to analyse the measurements produced by the well.
The engineer stays a little longer.
He is not yet happy that the entire system is working perfectly.
A crew of operators arrives, and together they straighten out all the remaining kinks.
Then the engineer hands the system over to the operators and heads home, promising to return if problems arise.
Reunion of the scientist and the engineer.
Our two explorers meet up again soon.
They reflect on their little desert adventure and start making plans for their next escapade together — apparently, there might be oil in the arctic and they are ready to find out!Data Scientists and Machine Learning Engineers have different roles.
Perhaps you are able to recognise from the story above some of the elements that go into a successful machine learning project.
I chose this little parable of two explorers because I believe that machine learning — even though it has seen huge advances over the last five to ten years — remains a field that largely requires forays into uncharted territory.
I believe that a spirit of exploration (quick prototyping) is as relevant as it was five years ago, but it is also becoming more and more important to start laying the groundwork for productive systems (quick scaling) right at the start of the project.
For my two explorers I chose a (data) scientists and a (machine learning) engineer to illustrate the importance of their collaboration to realise a quick and effective delivery of value.
There are doubtless many different definitions for these two roles, but for me the key defining characteristic is that the data scientist is someone that asks, “what is the best algorithm to solve this problem?” and tries to answer that question by quickly testing various hypotheses (looking for oil wells).
The machine learning engineer on the other hand asks, “what is the best system to help us solve the problem?” and tries to answer that question by building an automated process (constructing an oil pipeline) that can be used to accelerate the testing of hypotheses.
Despite the different roles between the Data Scientist and the Machine Learning Engineer, their collaboration is crucial to machine learning projects.
Whether or not those two definitions are 100% accurate, what is important is the collaboration that ensues when two talented professionals with those types of mindsets get together.
In my example the data scientist looks for new model architectures to try; new ways of measuring performance; new data sources to include etc.
, while the machine learning engineer looks for ways to integrate the data scientist’s work into a scalable system.
As the system scales the data scientist becomes more efficient because he has better tools to work with.
The machine learning engineer becomes more effective because the tools he builds are used to deliver more and more valuable results.
Machine Learning Engineers write software while Data Scientists write scripts.
Differences between both professions can also be seen in the code they write.
Of course, both actors have to be good programmers.
It is not only the machine learning engineers that have to be able to write high quality code.
Sure, a data scientist does not have to be able to build a complex software system, but he or she must be able to build clearly structured, well-documented and easily maintained code!That being said, for the data scientist the focus will be on seeing quick results rather than on building sustainable software.
As such the tool of choice for many data scientist will (since the whole thing typically plays out in the Python world) be to make extensive use of tools such as Jupyter Notebooks, which actively support an exploratory way of working — the end product of such work typically being prototype scripts.
The machine learning engineer however writes software.
I cannot stress this enough.
(If you are excellent at machine learning and are building clearly structured, well-documented and easily maintained scripts, then you are the best data scientist we at Alexander Thamm could ask for on a project — not a machine learning engineer!).
The machine learning engineer focusses on building software that can be quickly scaled.
A lot of this will also play out in the Python world, meaning that as tool of choice, IDEs such as PyCharm will typically be use.
(It does however go over and above Python since scaling also means knowing how to use tools that orchestrate resources for model training and that serve trained models to end-user systems, but that is a discussion for another time).
Different phases of the project require different involvement of Machine Learning Engineers and Data Scientists.
The last thing to talk about are the phases of the project and who is involved when.
If we consider a project progression that goes through the phases “hypothesis”; “proof of concept”; “prototype”; “production” — then it is tempting to say that the first two are the domain of the data scientists while the last two are the domain of the machine learning engineers.
I would only partly agree.
The scientist and the engineer standing on the edge of the desert, peering out into the distance and wondering if there might be oil out there: that is the “hypothesis” phase.
The scientist running around in the desert digging lots of small holes with a spade to see the first signs of oil while the engineer draws the blueprints for his pipeline is what I would call a “proof of concept”.
A fully-fledged pipeline being built, and the first small well being tapped is what I would define as a “prototype”.
And the large-scale fully-refined and highly efficient system maintained by a team of operators I would regard as “production”.
The Machine Learning Engineer and the Data Scientist are of vital importance throughout the project.
So perhaps the right way to look at it is to recognise that each of the phases will require input from both roles, but that what that input looks like will change.
In the beginning in particular, the machine learning engineer will be very dependent on the exploration done by the data scientist and later on the data scientist will be very dependent on the tools that the machine learning engineer builds.
But each will stay relevant — vital in fact — throughout.
And at the very end, both will move on to the next thing, which is likely to be yet another journey of exploration — to the arctic this time?.