My Machine Learning Journey and First Kaggle CompetitionHow i started Data Science and First Experience.
VolkanBlockedUnblockFollowFollowingJan 22Image taken from Pexel Gavin TracyBeginning of the JourneyAfter working as Electronic Engineer, I decided to change my career path to Data Scientist .
To reach my Data Science career goal I have started to review Moocs about this field.
Here the list that I found helpful in my journey,Intro to Machine Learning https://www.
com/course/intro-to-machine-learning–ud120Machine Learning A-Z https://www.
com/machinelearning/Machine Learning https://www.
org/learn/machine-learningAll these courses are explain core machine learning algorithms.
Also, in Coursera’s Machine Learning course Andrew NG explained the mathematical background of these algorithms.
If you want to learn what Machine Learning is and the way that you can use it, i strongly suggest you to take these entire three courses.
As I have learned the fundamentals of Machine Learning I have started the search for platform to test my knowledge.
This is the time that i have met with Kaggle.
com)There are lots of dataset in Kaggle that you can test your knowledge.
You can download the data or you can use Kaggle Kernel to write and test your code.
The best part of the Kaggle platform is, it is completely FREE!I’ve started to do some EDAs, draw plots and run some basic Machine Learning algorithms ( linear regression, logistic regression, random forest, etc.
But all these works become routine and i need a motivational push to continue my journey.
That was the time i decided to enter a competition.
Let the Game Begin…All the time that i have spent on Kaggle I have seen the competitions but never dared to apply ones.
I was prejudiced about the competition that if you want to apply a competition you should be at least 8 of 10.
Which is a nonsense idea?My first application to a competition was accidental.
While I read the rules of the competition I have just click I agree button and bam I am in the competition.
Kaggle TimeThe competition that I have entered was Microsoft Malware Prediction (https://www.
com/c/microsoft-malware-prediction/) which is about predict the probability of a Microsoft Machine to be infected by a malware.
There is train and test data on the data section.
Each of them has nearly 5 Gb size.
The columns of the data include machine id and lots of feature about this machine.
And of course in train data the last column includes if the machine infected or not.
After first panic moments I’ve started to think how to address the problem.
All the competition on Kaggle has a discussion page that people talk about alternative ways to solve the problem.
Also there is a Kernel page that people share their code to inspire other people.
I have read every discussion thread.
This was extremely helpful.
As i am new in this field, reading experienced people’s approaches helped me to figure out where to start.
My advice to all newbies is first read people’s experiences for not to do the same mistakes.
Then i started to code with using Kaggles Kernel.
This is the time that i faced my first real life problem.
The power of the Kernel computer is not enough to read the train data.
This issue was discussed in the discussion page and some solution was offered.
But after load the data I will have the same computation power problem because of the Machine Learning algorithms.
So i decided to download the data and work in my personal computer.
I load the data; choose the most famous Machine Learning algorithm from the discussion page and press run.
And I have my second real life problem.
My computer’s power also not enough for running this algorithm.
As the number of features (columns) also the number observations (rows) are huge, there is definitely need for Feature Engineering.
Also there are lots feature with missing values that is not so common in Kaggle’s dataset.
But this is real life and in real life there is always missing values.
I read the data page more detailed.
Try to reduce number of features also separate categorical and non-categorical data.
And do Feature Engineering separately to each datasets.
As i reduce the number of feature I fit the data to a Machine Learning algorithm.
And check the confusion matrix.
Tadaa!!.I have %54 (Better than flip a coin Yess!) Roc Auc score for test set.
It is time to run model on the test set for create submission file.
I tried to run the model with the test data where i have my third real life experience.
Till that time I always work with one dataset and split it to train and test data for model.
But in real life you have train and test data seperately anf if you do some feature engineering (deleting feature, create new feature, change the data type etc.
) you should do the same thing for the test data.
I learned this in hard way.
I tried to run my trained model on test set and have errors after 2 hours of wait.
Then figure it out the situation that i mentioned above.
I forget to the some feature engineering for test data.
This was my first real life problem solution which is so important for me.
Then I try to improve my solution and start playing with features.
For now I have improve my score to %60 and still have something to do.
ConclusionIn conclusion, if you want to work in Data Science field never be pessimistic, never discourage yourself.
Always stay motivated, hungry for new information, read quality content.
Hope to see you in my other machine learning experience.
Also, if you have any suggestion for my journey please write response to this story.