Or better, what assumptions are you trying to prove wrong?You could spend all day debating these.
But best to start with something simple, prove it wrong and add complexity as required.
Making your first Kaggle submissionYou’ve been learning data science and machine learning online.
You’ve heard of Kaggle.
You’ve read the articles saying how valuable it is to practice your skills on their problems.
Despite all the good things you’ve heard about Kaggle.
You haven’t made a submission yet.
This was me.
Until I put my newly acquired EDA skills to work.
You decide it’s time to enter a competition of your own.
You’re on the Kaggle website.
You go to the ‘Start Here’ section.
There’s a dataset containing information about passengers on the Titanic.
You download it and load up a Jupyter Notebook.
What do you do?What question are you trying to solve?‘Can I predict survival rates of passengers on the Titanic, based on data from other passengers?’This seems like a good guiding light.
An EDA checklistEvery morning, I consult with my personal assistant on what I have to do for the day.
My personal assistant doesn’t talk much.
Because my personal assistant is a notepad.
I write down a checklist.
If a checklist is good enough for pilots to use every flight, it’s good enough for data scientists to use with every dataset.
My morning lists are non-exhaustive, other things come up during the day which have to be done.
But having it creates a little order in the chaos.
It’s same with the EDA checklist below.
An EDA checklist1.
What question(s) are you trying to solve (or prove wrong)?2.
What kind of data do you have?3.
What’s missing from the data?4.
Where are the outliers?5.
How can you add, change or remove features to get more out of your data?We’ll go through each of these.
Response opportunity: What would you add to the list?What question(s) are you trying to solve?I put an (s) in the subtitle.
Start with one.
Don’t worry, more will come along as you go.
For our Titanic dataset example it’s:Can we predict survivors on the Titanic based on data from other passengers?Too many questions will clutter your thought space.
Humans aren’t good at computing multiple things at once.
We’ll leave that to the machines.
Sometimes a model isn’t required to make a prediction.
Before we go further, if you’re reading this on a computer, I encourage you to open this Juypter Notebook and try to connect the dots with topics in this post.
If you’re reading on a phone, don’t fear, the notebook isn’t going away.
I’ve written this article in a way you shouldn’t need the notebook but if you’re like me, you learn best seeing things in practice.
What kind of data do you have and how to treat different types?You’ve imported the Titanic training dataset.
Let’s check it out.
head() shows the top five rows of a dataframe.
The rows you’re seeing are from the Kaggle Titanic Training Dataset.
Column by column, there’s: numbers, numbers, numbers, words, words, numbers, numbers, numbers, letters and numbers, numbers, letters and numbers and NaNs, letters.
Similar to Johnny’s toes.
Let’s separate the features out into three boxes, numerical, categorical and not sure.
Columns of different information are often referred to as features.
When you hear a data scientist talk about different features, they’re probably talking about different columns.
In the numerical bucket we have, PassengerId, Survived, Pclass, Age, SibSp, Parch and Fare.
The categorical bucket contains Sex and Embarked.
And in not sure we have Name, Ticket and Cabin.
Now we’ve broken the columns down into separate buckets, let’s examine each one.
The Numerical BucketRemember our question?‘Can we predict survivors on the Titanic based on data from other passengers?’From this, can you figure out which column we’re trying to predict?We’re trying to predict the green column using data from the other columns.
And because it’s the column we’re trying to predict, we’ll take it out of the numerical bucket and leave it for the time being.
What’s left?PassengerId,Pclass, Age, SibSp, Parch and Fare.
Think for a second.
If you were trying to predict whether someone survived on the Titanic, do you think their unique PassengerIdwould really help with your cause?Probably not.
So we’ll leave this column to the side for now too.
EDA doesn’t always have to be done with code, you can use your model of the world to begin with and use code to see if it’s right later.
How about Pclass, SibSp and Parch?These are numbers but there’s something different about them.
Can you pick it up?What does Pclass, SibSp and Parch even mean?.Maybe we should’ve read the docs more before trying to build a model so quickly.
‘Kaggle Titanic Dataset’.
Pclassis the ticket class, 1 = 1st class, 2 = 2nd class and 3 = 3rd class.
SibSp is the number of siblings a passenger has on board.
And Parch is the number of parents someone had on board.
This information was pretty easy to find.
But what if you had a dataset you’d never seen before.
What if a real estate agent wanted help predicting house prices in their city.
You check out their data and find a bunch of columns which you don’t understand.
You email the client.
‘What does Tnummean?’They respond.
‘Tnum is the number of toilets in a property.
’Good to know.
When you’re dealing with a new dataset, you won’t always have information available about it like Kaggle provides.
This is where you’ll want to seek the knowledge of an SME.
SME stands for subject matter expert.
If you’re working on a project dealing with real estate data, part of your EDA might involve talking with and asking questions of a real estate agent.
Not only could this save you time, but it could also influence future questions you ask of the data.
Since no one from the Titanic is alive anymore (RIP (rest in peace) Millvina Dean, the last survivor), we’ll have to become our own SMEs.
There’s something else unique about Pclass, SibSp and Parch.
Even though they’re all numbers, they’re also categories.
How so?Think about it like this.
If you can group data in your head fairly easily, there’s a chance it’s part of a category.
The Pclasscolumn could be labelled, First, Second and Third and it would maintain the same meaning as 1, 2 and 3.
But since Pclass, SibSp and Parch are already all in numerical form, we’ll leave them how they are.
The same goes for Age.
That wasn’t too hard.
The Categorical BucketIn our categorical bucket, we have Sex and Embarked.
These are categorical variables because you could easily isolate passengers who were female from those who were male.
Or those who embarked on C from those who embarked from S.
To train a machine learning model, we’ll need a way of converting these to numbers.
How would you do it?Remember Pclass?.1st = 1, 2nd = 2, 3rd = 3.
How would you do this for Sex and Embarked?Perhaps you could do something similar for Sex.
Female = 1 and male = 2.
As for Embarked, S = 1 and C = 2.
We can change these using the .
LabelEncoder() function from the sklearn library.
fit_transform)Wait?.Why does C = 0 and S = 2 now?.Where’s 1?.Hint: There’s an extra category, Q, this takes the number 1.
See the data description page on Kaggle for more.
We’ve made some good progress towards turning our categorical data into all numbers but what about the rest of the columns?Challenge: Now you know Pclass could easily be a categorical variable, how would you turn Age into a categorical variable?The Not Sure BucketName, Ticket and Cabin are left.
If you were on Titanic, do you think your name would’ve influenced your chance of survival?It’s unlikely.
But what other information could you extract from someone's name?What if you gave each person a number depending on whether their title was Mr.
?You could create another column called Title.
In this column, those with Mr.
= 1, Mrs.
= 2 and Miss.
What you’ve done is created a new feature out of an existing feature.
This is called feature engineering.
Converting titles to numbers is a relatively simple feature to create.
And depending on the data you have, this process, also known as feature engineering, can get as extravagant as you like.
How does this new feature affect the model down the line?.This will be something you’ll have to investigate.
For now, we won’t worry about the Name column to make a prediction.
What about Ticket?The first few examples don’t look very consistent at all.
What else is there?training.
head(15)The first 15 entries of the Ticket column.
These aren’t very consistent either.
But think again.
Do you think the ticket number would provide much insight as to whether someone survived?Maybe if the ticket number related to what class the person was riding in, it would have an effect but we already have that information in Pclass.
To save time, we’ll forget the Ticket column for now.
Your first pass of EDA on a dataset should have the goal of not only raising more questions about the data but to get a model built using the least amount of information possible so you’ve got have a baseline to work from.
Now, what do we do with Cabin?You know, since I’ve already seen the data, my spidey-senses are telling me it’s a perfect example for the next section.
Challenge: I’ve only listed a couple examples of numerical and categorical types of data here.
Are there any other types of data?.How do they differ to these?What’s missing from the data and how do you deal with it?missingno.
matrix(train, figsize = (30,10))The missingno library is a great quick way to quickly and visually check for holes in your data, it detects where NaN values (or no values) appear and highlights them.
White lines indicate missing values.
The Cabin column looks like Johnny’s shoes.
There are a fair few missing values in Age too.
How do you predict something when there’s no data?I don’t know either.
So what are our options when dealing with missing data?The quickest and easiest way would be to remove every row with missing values.
Or remove the Cabin and Age column entirely.
But there’s a problem here.
Machine learning models like more data.
Removing large amounts of data will likely decrease the ability of our model to predict whether a passenger survived or not.
What’s next?Imputing values.
In other words, filling up the missing data with values calculated from other data.
How would you do this for the Age column?When we called .
head() the Age column had no missing values.
But when we look at the whole column, there are plenty of holes.
Could you fill missing values with average age?There are drawbacks to this kind of value filling.
Imagine you had 1000 total rows, 500 of which are missing values.
You decide to fill the 500 missing rows with the average age of 36.
What happens?Your data becomes heavily stacked with the age of 36.
How would that influence predictions on people 36-years-old?Maybe for every person with a missing age value, you could find other similar people in the dataset and use their age.
But this is time-consuming and also has drawbacks.
There are far more advanced methods for filling missing data out of scope for this post.
It should be noted, there is no perfect way to fill missing values.
If the missing values in the Age column is a leaky drain pipe the Cabin column is a cracked dam.
For your first model, this is a feature you’d leave out.
Challenge: The Embarked column has a couple of missing values.
How would you deal with these?.Is the amount low enough to remove them?Where are the outliers and why you should be paying attention to them?‘Did you check the distribution?’ Athon asked.
‘I did with the first set of data but not the second set….
’ It hit me.
There it was.
The rest of the data was being shaped to match the outlier.
If you look at the number of occurrences of unique values within a dataset, one of the most common patterns you’ll find is Zipf’s law.
It looks like this.
Zipf’s law: The highest occurring variable will have double the number of occurrences of the second highest occurring variable, triple the amount of the third and so on.
Remembering Zipf’s law can help to think about outliers (values towards the end of the tail don’t occur often and are potential outliers).
The definition of an outlier will be different for every dataset.
As a general rule of thumb, you may consider anything more than 3 standard deviations away from the mean might be considered an outlier.
You could use a general rule to consider anything more than three standard deviations away from the mean as an outlier.
Or from another perspective.
Outliers from the perspective of an (x, y) plot.
How do you find outliers?Distribution.
Four times is enough.
During your first pass of EDA, you should be checking what the distribution of each of your features is.
A distribution plot will help represent the spread of different values of data you have across.
And more importantly, help to identify potential outliers.
hist()Histogram plot of the Age column in the training dataset.
Are there any outliers here?.Would you remove any age values or keep them all?Why should you care about outliers?Keeping outliers in your dataset may turn out in your model overfitting (being too accurate).
Removing all the outliers may result in your model being too generalised (it doesn’t do well on anything out of the ordinary).
As always, best to experiment iteratively to find the best way to deal with outliers.
Challenge: Other than figuring out outliers with the general rule of thumb above, are there any other ways you could identify outliers?.If you’re confused about a certain data point, is there someone you could talk to?.Hint: the acronym contains the letters M E S.
Getting more out of your data with feature engineeringThe Titanic dataset only has 10 features.
But what if your dataset has hundreds?. More details