Let us analyse.
We saw the median value(or most commonly occured value) and applied that to the prediction file hoping every other player will be on the try hards category.
How confident we are,huh !Welcome to OverfittingThe goal of a good machine learning model is to generalize well from the training data to any data from the problem domain.
This allows us to make predictions in the future on data the model has never seen.
Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model.
The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.
In our case, by seeing the median value we thought that it will be applicable to all other values in the test data.
That is overfitting of data.
How to tackle overfitting ?A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project.
After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.
EDA and Feature EngineeringHmn, Ideally speaking Walk Distance should be a critical factor in winning.
As the running speed is almost static, you cannot do much about that.
So if you have to be in top winning places you have to keep on moving.
The walking distance is directly proportional to chance of winning.
Boosts also should be a important factor.
If you want to live more time,it is highly likely that you use one or more boosts.
We also see the expected clumps in the number of groups denoting the game mode, e.
Squad, Duo and Solo modes.
The values less than 10 probably warrant further investigation as it is most likely these are custom games or disconnect errors.
CorrelationLets have a look at the correlation between the variables.
This may give us good start in case of what all variables to be seriously considered in the long run.
Positively Correlated — WalkDistance,WeaponsAcquired,BoostsNegatively Correlated — KillPlacekillPlace — Ranking in match of number of enemy players killed.
The lesser rank, the more is the chance of winning the game.
NumGroups — Number of groups we have data for in the match.
Its interesting that where we have numGroups as 1, we have Only values of 0.
Lets keep this in mind and make changes in test data.
Walk Distance — Total distance traveled on foot measured in meters.
It seems that the data has lot of outliers which makes (or cheaters) in the training data.
Its quite impractical to travel more than 20 km/hour.
We will consider this as a a cheatcode and move forward.
3 — KillsHere i am creating a flag for those id’s who have more than 40 kills4.
HeadShot RatesHeadshot kills may have lot to tell about how good that a player is.
I have created a variable headshot_rate to understand the rate of head shot kills from the total kills.
Here, we have created test1 as the validation set to further test before testing on the real test dataset.
ZombiesProblem statementThere are mainly two types of machine learning problems:-Classification — the output variable takes class labels.
Regression — the output variable takes continuous valuesHere, as the target variable (winPlacePerc) is a continuous variable, the problem comes under Regression problem.
Linear regression is the first go to method when it comes to regression.
Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X.
The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known.
Using our PUBG data, suppose we wish to model the linear relationship between the assets,kills and winPlacePerc.
Y = β1 + β2X + ϵwhere, β1 is the intercept and β2 is the slope.
Collectively, they are called regression coefficients.
ϵ is the error term, the part of Y the regression model is unable to explain.
Linear it is !In the background the lm, which stands for “linear model”, is producing the best-fit linear relationship by minimizing the least squares criterion.
For initial assessment of our model we can use summary.
This provides us with a host of information about our model, which we’ll walk through.
StepAICIn stepwise regression, the selection procedure is automatically performed by statistical packages.
The criteria for variable selection include adjusted R-square, Akaike information criterion (AIC), Bayesian information criterion (BIC), Mallows’s Cp, PRESS, or false discovery rate (1,2).
Main approaches of stepwise selection are the forward selection, backward elimination and a combination of the twoMAEMean Absolute Error (MAE) is another loss function used for regression models.
MAE is the sum of absolute differences between our target and predicted variables.
So it measures the average magnitude of errors in a set of predictions, without considering their directions.
(If we consider directions also, that would be called Mean Bias Error (MBE), which is a sum of residuals/errors).
H2O — The libraryH2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.
Along with, it uses in-memory compression to handle large data sets even with a small cluster.
It also include provisions to implement parallel distributed network training.
Deep LearningHope you guys learnt something.
Constructive feedbacks are always welcome.
Follow me on kaggle: https://www.
com/arjundas.. More details