Because of a simple truth in machine learning:Better data beats fancier algorithms.
In other words… garbage in gets you garbage out.
Even if you forget everything else from this course, please remember this point.
In fact, if you have a properly cleaned dataset, even simple algorithms can learn impressive insights from the data!Obviously, different types of data will require different types of cleaning.
However, the systematic approach laid out in this article can always serve as a good starting point.
Remove unwanted observationsThe first step to data cleaning is removing unwanted observations from your dataset.
This includes duplicate or irrelevant observations.
Our dataset contains quite a few duplicate entries which will be removed.
Handle missing valuesMissing data is a deceptively tricky issue in applied machine learning.
First, just to be clear, you cannot simply ignore missing values in your dataset.
You must handle them in some way for the very practical reason that most algorithms do not accept missing values.
“Common sense” is not sensible hereThe following are the most commonly recommended ways of dealing with missing data:Dropping observations that have missing valuesImputing the missing values based on other observationsInterpolation and ExtrapolationUsing KNNMean/ Median ImputationRegression ImputationStochastic regression imputationHot-deck imputationIf you want to know about them in greater detail, please refer to this article.
In our case, two of the features (title proximity tfidf and description proximity tfidf ) contains mostly 0, hence I will replace missing value by 0.
For city match feature, distribution of 1’s and 0’s are almost equal.
Here I have the option of either dropping the values which contain null or using substituting by mean.
I have removed the values in this case.
Feature EngineeringFeature engineering is about creating new input features from your existing ones.
In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.
This is often one of the most valuable tasks a data scientist can do to improve model performance, for 3 big reasons:You can isolate and highlight key information, which helps your algorithms “focus” on what’s important.
You can bring in your own domain expertise.
Most importantly, once you understand the “vocabulary” of feature engineering, you can bring in other people’s domain expertise!Below are some of the ways we can perform feature engineering but please note that this is not an exhaustive compendium of all feature engineering because there are limitless possibilities for this step.
The good news is that this skill will naturally improve as you gain more experience.
Infuse domain knowledgeCreate interactive featuresCombine sparse classesAdd dummy variablesRemove unused featuresIn our case, since there is not much domain knowledge about the dataset, we are restricted in our application of feature engineering.
The only feature engineering that I have applied is multiplying the two features which were correlated (title_proximity_tfid and main_query_tfidf) to create a new column named main title tfidf.
Algorithm SelectionSome of the factors affecting the choice of a model are:Whether the model meets the business goalsHow much pre-processing the model needsHow accurate the model isHow explainable the model isHow fast the model is: How long does it take to build a model, and how long does the model take to make predictions.
How scalable the model isAn important criterion affecting the choice of algorithm is model complexity.
Generally speaking, a model is more complex is:It relies on more features to learn and predict (e.
using two features vs ten features to predict a target)It relies on more complex feature engineering (e.
using polynomial terms, interactions, or principal components)It has more computational overhead (e.
a single decision tree vs.
a random forest of 100 trees).
Besides this, the same machine learning algorithm can be made more complex based on the number of parameters or the choice of some hyperparameters.
For example,A regression model can have more features, or polynomial terms and interaction terms.
A decision tree can have more or less depth.
Making the same algorithm more complex increases the chance of overfitting.
Commonly used Machine Learning algorithms for classificationLogistic RegressionLogistic Regression models fit a “straight line”.
In practice, they rarely perform well.
We actually recommend skipping them for most machine learning problems.
Their main advantage is that they are easy to interpret and understand.
However, our goal is not to study the data and write a research report.
Our goal is to build a model that can make accurate predictions.
In this regard, logistic regression suffers from two major flaws:It’s prone to overfit with many input features.
It cannot easily express non-linear relationships.
RegularizationAs mentioned above, logistic regression suffers from overfitting and difficulty in handling non-linear relationships.
Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients.
It can discourage large coefficients (by dampening them).
It can also remove features entirely (by setting their coefficients to 0).
The “strength” of the penalty is tunable.
(More on this tomorrow…)Types of regularization are Lasso (L1), ridge (L2) and elastic net (compromise between ridge and lasso)Decision TreesDecision trees model data as a “tree” of hierarchical branches.
They make branches until they reach “leaves” that represent predictions.
Due to their branching structure, decision trees can easily model nonlinear relationships.
Unfortunately, decision trees suffer from a major flaw as well.
If you allow them to grow limitlessly, they can completely “memorize” the training data, just from creating more and more and more branches.
As a result, individual unconstrained decision trees are very prone to overfitting.
So, how can we take advantage of the flexibility of decision trees while preventing them from overfitting the training data?Tree EnsemblesEnsembles are machine learning methods for combining predictions from multiple separate models.
There are a few different methods for ensembling, but the two most common are:Bagging: attempts to reduce the chance of overfitting complex models.
BaggingIt trains a large number of “strong” learners in parallel.
A strong learner is a model that’s relatively unconstrained.
Bagging then combines all the strong learners together in order to “smooth out” their predictions.
Commonly used technique is Random Forest2.
Boosting: attempts to improve the predictive flexibility of simple models.
BoostingIt trains a large number of “weak” learners in sequence.
A weak learner is a constrained model (i.
you could limit the max depth of each decision tree).
Each one in the sequence focuses on learning from the mistakes of the one before it.
Boosting then combines all the weak learners into a single strong learner.
Commonly used technique is XGBoost and LightGBM3.
LightGBM: Light GBM is a gradient boosting framework that uses a tree-based learning algorithm.
Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise.
It will choose the leaf with max delta loss to grow.
When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.
How LightGBM worksHow other boosting algorithm worksThere are many other algorithms as well like Support Vector machine, Neural Networks, etc.
but we won't be taking it here.
For our case, I will be using XGBoost, Random Forest and LightGBM.
Model training and tuningTraining Model: Between 01/21/2018 and 01/26/2018Testing Model: 01/27/2018Metric that we will be using is AUC.
Initial AUC values that we gotXGBoost: 0.
5846Random Forest: 0.
5806Hyperparameter tuning using Bayesian OptimizationSearch for parameters of machine learning models that result in best cross-validation performance is necessary in almost all practical cases to get a model with the best generalization estimate.
A standard approach in scikit-learn is using GridSearchCV class, which takes a set of values for every parameter to try, and simply enumerates all combinations of parameter values.
The complexity of such search grows exponentially with the addition of new parameters.
A more scalable approach is using RandomizedSearchCV, which however does not take advantage of the structure of a search space.
Scikit-optimize provides a drop-in replacement for GridSearchCV, which utilizes Bayesian Optimization where a predictive model referred to as “surrogate” is used to model the search space and utilized to arrive at good parameter values combination as soon as possible.
Bayesian optimization, a model-based method for finding the minimum of a function, has recently been applied to machine learning hyperparameter tuning, with results suggesting this approach can achieve better performance on the test set while requiring fewer iterations than random search.
After applying Bayesian Optimization along with Cross Validation, AUC values:XGBoost: 0.
5849Random Forest: 0.
5810Although the improvement is not quite significant, the Bayesian optimizer was able to perform the tuning operation with greater speed.
InsightsSuch low AUC score of 0.
5849 may be attributed to the fact that we don't have many features in the dataset, which makes it difficult for the algorithm to classify the target variable correctly.
We did not have much domain knowledge because of which we were not able to perform much feature engineering.
Things TODO:We can use stacking of the above three algorithms which can further improve the AUCInclude last column (class_id) to improve the resultsIf you are interested in the code, you can find my notebook here.
ReferencesDetecting Credit Card Fraud Using Machine LearningCatching Bad Guys with Data Sciencetowardsdatascience.
comChapter 5: Algorithm Selection in Machine Learning – Data Science PrimerIn this guide, we'll show you how to choose the most effective machine learning algorithms among the dozens of options…elitedatascience.
comAutomated Machine Learning Hyperparameter Tuning in PythonA complete walk through using Bayesian optimization for automated hyperparameter tuning in Pythontowardsdatascience.