We are all set.
Let’s do some real work now.
Step 1 — Understand your DataOnce you sign up for the competition, you can find the data on the homepage of the competition.
To load and perform very basic manipulation of data, I am using Pandas, a data manipulation library in python.
If you’re not aware of it, I suggest you go to this 10-minute guide to get yourself familiar with it.
In Machine Learning, the data is mainly divided into two parts — Training and Testing (the third split is validation, but you don’t have to care about that right now).
Training data is for training our algorithm and Testing data is to check how well our algorithm performs.
The split ratio between the train and test data is usually around 70–30.
Hence, here we have a total of 891 entries for training and 417 entries for testing.
Loading up the data by writing will give you 12 columns, as shown below.
We will call it features.
Nothing new, just a fancy name.
I encourage you to go through the data at least one time before moving forward.
PassengerId : int : IdSurvived : int : Survival (0=No; 1=Yes)Pclass : int : Passenger ClassName : object : NameSex : object : SexAge : float : AgeSibSp : int : Number of Siblings/Spouses AboardParch : int : Number of Parents/Children AboardTicket : object : Ticket NumberFare : float : Passenger FareCabin : object : CabinEmbarked : object : Port of Embarkation (C=Cherbourg; Q=Queenstown; S=Southampton)Also, an understanding of the data type of each feature is important.
Now that we’ve loaded our data and understood what it looks like, we will move forward to feature engineering.
In other words, measuring the impact of each feature on our output, that is whether a passenger survived or not.
Step 2 — Feature EngineeringAs we discussed, feature engineering is measuring the impact of each feature on the output.
But the more important thing is that it is not just about using existing features, it is about creating new ones that can make a significant improvement in our output.
Andrew Ng said, “Coming up with features is difficult, time-consuming, requires expert knowledge.
Applied machine learning is basically feature engineering.
” We will go through each feature we are using so that you can understand how to use existing features and how to create new ones.
1 — Passenger ClassIt is obvious that the class of passenger is directly proportional to survival rate.
If the importance of a person is more than others, they’ll get out of the disaster first.
And our data tells the same story.
63% of people survived from Class 1.
Therefore, this feature is definitely impactful.
Data in Pclass column is complete hence no need to manipulate.
2 — SexSex is again important and directly proportional to survival rate.
Female and children were saved first during this tragedy.
We can see that 74% of all females were saved and only 18% of all males were saved.
Again, this will impact our outcome.
Feature 2 Output2.
3 — Family SizeNext two columns are SibSp and Parch, which are not directly related to whether a person has survived or not.
That is where the idea of creating a new feature came in.
For each row/passenger, we will determine his/her family size by adding SibSp + Parch + 1(him/her self).
Family size differs from a minimum of 1 to a maximum of 11, where the family size of 4 having the highest survival rate of 72%.
It seems to have a good effect on our prediction but let’s go further and categorize people to check whether they are alone in this ship or not.
And after looking through it too, it seems to have a considerable impact on our output.
4 — EmbarkedFrom which place a passenger embarked has something to do with survival (not always).
So, let’s take a look.
In this column, there are plenty of NAs.
To deal with it, we are going to replace NAs with ‘S’ because it is the most occurred value.
Feature 4 Output2.
5 — FareThere are missing data in this column as well.
We can not deal with every feature in the same way.
To fix the issue here, we are going to take the median value of the entire column.
When you cut with qcut, the bins will be chosen so that you have the same number of records in each bin (equal parts).
Looking through the output, it is considerable.
Feature 5 Output2.
6 — AgeAge has some missing values.
We will fill it with random numbers between (average age minus average standard deviation) and (average age plus average standard deviation).
After that, we will group it in the set of 5.
It has a good impact as well.
Feature 6 Output2.
7 — NameThis one is a little tricky.
From the name, we have to retrieve the title associated with that name, i.
Mr or Captain.
To do that, we have to use the regular expression library of Python (Regular expression how-to ).
First, we get the title from the name and store them in a new list called title.
After that, let’s clean the list by narrowing down to common titles.
We have cleaned our features and they are now ready to use.
However; there is one more step before we feed our data to ML algorithm.
The thing about ML algorithms is that they only take numerical values and not strings.
So, we have to map our data to numerical values and convert the columns to the integer data type.
Step 3 — Mapping DataMapping data is easy.
By looking through the code you’ll have the idea how it works.
Once done, now we have to select which features to use.
Feature selection is as important as feature creation.
We will drop unnecessary columns so that it doesn’t affect our final outcome.
Final data that we will feed to ML algorithmThat is it.
You have completed the hard part.
Look at your data, it looks so beautiful.
Now, we only have to predict our outcome which is easy stuff.
Or at least I’ll make it easy for you to understand.
Jack, come on buddy, we’re almost thereStep 4 — PredictionAs we discussed, we require training and testing data.
Yeah, Dhrumil, we have it, what now?.Ok perfect.
Now we need to train our model.
To do that, we need to provide data in two parts — X and Y.
X : X_train : Contains all the featuresY : Y_train : Contains the actual output (Survived)To elaborate further, we need to tell our model that we are looking for this output.
So, it will train that way.
For instance, your friend is out shopping and you want goggles, you send a photo of goggles to your friend saying you want the same.
You are training him/her, so he can bring similar goggles, by explaining features (Aviator, Wayfarers) and providing the exact output (picture of goggles).
We have data separated, now we call our classifier, fit data (training) with help of .
fit method of the scikit-learn library, and predict the output on testing data, with .
Note — As this tutorial is for beginners, I am not including other classifiers but the process remains the same.
Call classifier, fit data, predict.
Just in case you want to explore further.
There are several other classifiers, but I used Decision Tree because according to my knowledge it works best with this dataset.
To know more about decision trees, refer to this article.
Yes, Tony, that’s great for the first timeStep 5 — Your First SubmissionAnd finally, submitting our output.
Our output .
csv file should only have two columns — Passenger Id and Survived — as mentioned on the competition page.
Creating that and submitting by heading over to competition page, my submission was scored 0.
79425 which is in Top 25% at the time of writing this article.
The position I got on the leaderboard, where do you sit?I encourage you to explore different features to improve your model accuracy and your rank in this competition as well.
I’d love to hear from you that you’ve made it to the Top 5% or even better, Top 1%.
You will find the entire code on my GitHub repository.
EndnotesI hope this article has answered your primary question “How to start with Kaggle?” Adequate knowledge, good resources, and willingness to learn new things is all you need to move ahead.
You don’t have to be master from the beginning.
It all comes with persistence.
If you are reading this, you have all the energy to fulfill your goals, just don’t stop, no matter what.
If you have doubts regarding this article, reach me through email or Twitter or even Linkedin.
And even if you don’t have any doubts, I’d still love seeing you in my inbox with your valuable feedback or suggestions, if any.