Let’s try our logistic regression again with the balanced training data.
Our recall score increased, but F1 is much lower than with either our baseline logistic regression or random forest from above.
Let’s see if undersampling might perform better here.
Resampling techniques — Undersample majority classUndersampling can be defined as removing some observations of the majority class.
Undersampling can be a good choice when you have a ton of data -think millions of rows.
But a drawback is that we are removing information that may be valuable.
This could lead to underfitting and poor generalization to the test set.
We will again use the resampling module from Scikit-Learn to randomly remove samples from the majority class.
Again, we have an equal ratio of fraud to not fraud data points, but in this case a much smaller quantity of data to train the model on.
Let’s again apply our logistic regression.
Undersampling underperformed oversampling in this case.
Let’s try one more method for handling imbalanced data.
Generate synthetic samplesA technique similar to upsampling is to create synthetic samples.
Here we will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique.
SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.
Again, it’s important to generate the new samples only in the training set to ensure our model generalizes well to unseen data.
After generating our synthetic data points, let’s see how our logistic regression performs.
Our F1 score is increased and recall is similar to the upsampled model above and for our data here outperforms undersampling.
ConclusionWe explored 5 different methods for dealing with imbalanced datasets:Change the performance metricChange the algorithmOversample minority classUndersample majority classGenerate synthetic samplesIt appears for this particular dataset random forest and SMOTE are among the best of the options we tried here.
These are just some of the many possible methods to try when dealing with imbalanced datasets, and not an exhaustive list.
Some others methods to consider are collecting more data or choosing different resampling ratios — you don’t have to have exactly a 1:1 ratio!You should always try several approaches and then decide which is best for your problem.