To answer, we need to start talking about Machine Learning which depends heavily on data and is one of the principal components of AI.
To understand what exactly is machine learning, let’s first explain what is an algorithm.
Usually, we define an algorithm the same way we would define a cuisine recipe: a list of steps to follow which will lead you to a defined goal.
For example, if you’re having heart problems, you might go to a doctor and get tested for a few things.
When getting the results of your tests, the doctor will follow an algorithm to decide whether you have a heart disease or not.
Meaning that if your heart rate is above a certain threshold, you also have a history of heart problems in your family with a few issues when exercising on top, then the doctor might decide that you should either get tested more to verify if your heart is at risk, or that you’re fine and healthy.
Well, machine learning algorithms (or models) are a different type of algorithms.
The idea is mainly to use the result of the recipe, the dish, to work our way back up and actually learn the whole recipe automatically.
In our case, we want to learn the best algorithm automatically from a lot of patients’ data.
So, for a specific patient, we would know their heart rate, their blood pressure, and a few other things; but also if they had a heart problem or not.
Using this information, we want to train a machine to learn how to recognize heart problems.
We want to copy how humans learn, this is why machine learning models are performing well.
The algorithm will usually learn what are the important factors (heart rate, etc…) directly from the data, which is what coined the term “machine learning”.
There are a multitude of algorithms falling into this category, but the process is roughly the same each time:Pick an algorithm to useTrain it using some data by rewarding it when it got the answer right and punishing it when it has it wrongTest it on new data by measuring how often it got the answer right/wrongIf this all sounds a bit vague, no worries, we have an example ready for you down the road !But what do they do ?The job of a Data Scientist is hard to pinpoint as it is rather new.
It revolves around data, statistics, programming and communication.
Usually, they are given a specific task such as: “find the best way to diagnose a heart disease given the records of these patients“.
But sometimes, they could also be given some data with no particular instruction, for example: “leverage the data in order to create value/make money”; yes, that’s it.
The job description is so broad that it also encompasses some tasks such as: “build a conversational AI capable of scheduling a full trip for our clients, including flight or train tickets, and hotel reservations”.
However, today we’ll focus on one specific task: what if we’d have to diagnose a heart disease from a new patient given a lot of data about past patients, how would a data scientist go about it ?Well, for starters, one would start by trying to get a feel for the data, as the algorithm will only be as good as the data provided.
By that, we mean that if you have a dataset comprising mostly of healthy patients, you’re usually going to have a hard time building an algorithm capable of finding the sick ones.
Once it’s done and the data is clean enough to be exploited, the best part of the job comes in: building a model.
The data scientist will train a model to recognize a sick patient from a healthy one, which is loads of fun.
We will go further in depth on the subject during the next section.
Last but not least, when a good model has been built, another crucial part is to explain the results or predictions to the person in charge, being a doctor or someone from management, usually, not a statistics-savy person.
Communication is a huge part of the work since you have to argue for your solution which could impact the life of millions.
Enough chit-chat about definitions.
Let us show you what data scientists do.
Detecting heart diseasesA multitude of datasets are available online, by universities and other organizations.
A dataset consists of two components, the first one being the samples and the second one being the features.
It’s not as complicated as it sounds, for example in our case, the samples are the patients, and the features are the pieces of information provided about each of them.
You can find the data here, provided by UCI.
All of the subsequent work is supported by source code which you can find here.
Alright, let’s dig into it.
In this particular case we are presented with patient data comprising a lot of information collected by doctors and researchers.
We will follow roughly what a Data Scientist does so that you can understand how they draw conclusions and predict things using data.
So the first thing we’re going to do is to understand what the data is all about.
Here is an excerpt of what we have:+——-+——+————-+——–+———-+———+| age | sex | chest_pain | chol | restecg | num_bin |+——-+——+————-+——–+———-+———+| 63.
0 | 1.
0 | 1.
0 | 233.
0 | 150.
0 | 0 || 67.
0 | 1.
0 | 4.
0 | 286.
0 | 108.
0 | 1 || 67.
0 | 1.
0 | 4.
0 | 229.
0 | 129.
0 | 1 |+——-+——+————-+——–+———-+———+So, for each patient, we know a lot of things, age, sex (1 for man, 0 for woman), the amount of chest pain they experience (from 1 to 4), level of cholesterol, their resting electrocardiography results, and a few other metrics: 13 of them in total.
The last column dubbed “num_bin” is the column representing whether they had a heart problem or not, 1 meaning yes, 0 meaning no.
All of this information is usually queried by a doctor, and they use it to decide whether a patient is at risk or not.
The job of the data scientist here could be to build a mathematical model capable of predicting these heart problems.
So, contrary to a doctor, instead of knowing a few give-aways and general health rules, the model will try to learn about them on its own, by processing the data (well it’s a bit more complicated than that, but it’s the gist of it).
Do not forget that we’re talking about maths here and not magic.
The idea is quite simple actually, in a first step we want to train the model to be able to recognize the disease from the data and then test it on new, never seen data and evaluate its performance.
What do we mean by training ?Training is a specific process which consists in improving the model’s performance.
We start by showing a patient to the algorithm at random without telling it if the patient has a heart disease or not and ask the algorithm to predict whether it’s the case.
After each guess, we account for the error or the good prediction mathematically in the algorithm, by adjusting a few parameters which will help the algorithm become better and better after each example seen.
What do we mean by testing ?As you would expect, testing is simply the act of measuring the model’s performance.
We do it all the time for humans using quizzes or tests, it simply is a way to measure one’s aptitude on a subject.
This is quite close to what we do as humans, by design.
In our specific case, we want to be able to see how often does the algorithm get the answer right or wrong.
More importantly, we want to make sure that the model will not miss sick patients, as it’s quite obvious that missing sick patients is worse than announcing to a healthy one that they are sick.
PredictionsUsing the aforementioned dataset, we trained two models, both stemming from different algorithms, each with their own strengths and weaknesses.
The first one is Random Forests, an highly interpretable model, which means that when we make predictions using it, we will be able to understand how the model arrived to its decision.
Which could actually be a huge help to doctors since they will be able to draw information from the model such as which set of features was the most important to bring it to the best decisions.
The Random Forests model is based on decision trees and makes successive decisions based on the features until it can draw a conclusion on whether or not the patient is sick.
Once the training is done, you can retrieve one of the trees of decisions, and see for yourself how the model decided to classify the samples.
We trained the model ourselves and here is one of the resulting trees:One decision tree from the Random ForestAs we can see, at each step, the tree makes decisions based on a few factors, and at the end, on the leaves of the tree down below, we will see the prediction.
Imagine how awesome it could be for a doctor to include this kind of model in their diagnostics.
When we tested it, the Random Forest was right at a rate of about 80%, which means that it found out if the patient was healthy or sick without making a mistake on 8 out of 10 patients.
But what is more important is the percentage of all sick patients that we found, which is 61.
5%: in other terms, we miss a sick patient 4 times out of 10.
So as we can see, it is highly interpretable but its performance is quite lacking when it comes to predictions, in fact we are missing more than 38% of all sick patients !Alright, let’s check out the second model which is a Multi Layer Perceptron, one of the simplest Neural Networks.
Neural Networks are mathematical models based off of the human brain, more particularly their synapses.
They were in fact designed to make decisions in a similar way as humans would.
Compared to the previous model, this one will not be as interpretable, but should make better predictions in theory.
In fact, when we tested it, the model was right 86.
7% of the time which is higher than before, but the real change is on the sick patients that it found out, regarding them, it was right 84.
6% of the time, way higher than before.
We “only” miss a bit more than 15% of them now.
As we can see, there is a clear tradeoff between the two models, one is more accurate, but the other is interpretable, and depending on the application we might want to use one over the other.
For example, if we want to help doctors to make better predictions, the interpretable model will be better.
However, if we wish to make accurate ones, we will pick the other one.
Results communicationNow that we have results to show, we need to communicate them clearly to the person in charge, whether that person is a doctor, manager, scientist, or any other kind of role.
So the Data Scientist will need to tailor their explanations to match the technical depth of their audience.
But sometimes machine learning does not work for the kind of data provided, and one will need to explain why it didn’t work, and how to fix it.
Which is why usually, they tend to craft two different explanations, a high level one, and a low level one.
The first being targeted to business-related people and the other one to a more technical audience.
Voilà !.In any case, Data Scientists have quite a cryptic job, so, most probably, this quick description might not stay accurate for that long before the role shifts again.
AI is not black magic, and does not solve all problems, but it could help change our lives.
This article was co-written by Pierre Fouché, Matthias Leroy, and Romain Choukroun.
Again, you can find the code and science behind the results of the analysis here.
Help us make this article better by correcting our mistakes on the science or the writing below !.Do not hesitate to contact any of us for questions, as we will be happy to answer:pierre.