A Deep Dive Into Imbalanced Data: Over-Sampling

To illustrate my point, I’ve put together a fictional data set:As you can see, there are way more triangles than squares in this fictional data set.

Now, in order to train a more accurate classifier, we would like to utilize SMOTE to oversample the squares.

First, SMOTE finds the k-nearest-neighbors of each member of the minority class.

Let’s visualize that for one of the squares and assume that k equals three:In the visualization above we have identified the three nearest neighbors of the orange square.

Now, depending on how much oversampling is desired, one or more of these nearest neighbors are going to be used to create new observations.

For the purpose of this explanation, let us assume that we are going to use two of the three nearest neighbors to create new observations.

The next and final step is to create new observations by randomly choosing a point on the line connecting the observation with its nearest neighbor:The dashed lines represent the connection between the orange square and its green nearest neighbors.

The two red squares denote the new observations added to the data set by SMOTE.

SMOTE’s main advantage compared to traditional random naive over-sampling is that by creating synthetic observations instead of reusing existing observations, your classifier is less likely to overfit.

At the same time, you should always make sure that the observations created by SMOTE are realistic.

The technique SMOTE uses to create new observations only helps if the synthetic observations are realistic and could have been observed in reality.

After going through the theory, it is time to implement SMOTE in imblearn:Running the code chunk above will output a class count that proves that there is an equal amount of observations in both classes.

Adaptive Synthetic (ADASYN)Adaptive Synthetic (ADASYN) provides an alternative to using SMOTE.

Let us walk through the algorithm step by step.

First, ADASYN calculates the ratio of minority to majority observationsNext, ADASYN computes the total number of synthetic minority data to generate:Here, G is the total number of synthetic minority data to generate and ß denotes the ratio of minority to majority observations.

Thus, ß = 1 would mean that there are equally as many observations in both classes after using ADASYN.

Third, ADASYN finds the k-nearest neighbors for each of the minority observations and computes an r value:The rᵢ value measures the dominance of the majority class in the neighborhood.

The higher rᵢ, the more dominant the majority class and the more difficult the neighborhood is to learn for your classifier.

Let us calculate rᵢ for some fictional minority observation:Three of the highlighted minority observation’s five nearest neighbors belong to the majority class.

Thus, r for this observation is equal to 3/5.

Before actually computing the number of synthetic observations that are going to be created, ADASYN normalizes the r values for all minority observations so that their sum is one.

Next, ADASYN computes the number of synthetic observations to generate in each neighborhood:Since Gᵢ is calculated using the respective rᵢ value, ADASYN will create more synthetic observations in neighborhoods with a greater ratio of majority to minority observations.

Therefore, the classifier will have more observations to learn from in these difficult areas.

Finally, ADASYN generates synthetic observations:One could do so using a linear combination of the observation of interest and a neighbor similar to SMOTE or utilize more advanced techniques like drawing a plane between three minority observations and randomly selecting a point on that plane.

ADASYN’s main advantage lies in its adaptive nature: by basing the number of synthetic observations on the ratio of majority to minority observations, ADASYN places a higher emphasis on more challenging regions of the data.

As with SMOTE, after discussing the theory, it is time to look at the code:As you can see, imblearn’s syntax is easy to memorize and resembles sklearn’s syntax.

Thus, if you are used to sklearn’s syntax, you won’t have to put in a lot of effort to include imblearn in your machine learning workflow.

SMOTE ExtensionsAs with most algorithms, there are several extensions of SMOTE.

These aim to improve SMOTE by adding to its functionality or lessening its weaknesses.

Examples of SMOTE extensions that can be found in imblearn include:BorderlineSMOTE: Instead of oversampling between all minority observations, BorderlineSMOTE aims to increase the number of minority observations that border majority observations.

The goal here is to allow the classifier to be able to distinguish between these borderline observations more clearly.

SVMSMOTE: SVMSMOTE, as its name implies, utilizes the Support Vector Machine algorithm to generate new minority observations close to the border between the majority and minority classes.

Here is exemplary code for BorderlineSMOTE in imblearn:ConclusionDealing with imbalanced data can be extremely challenging.

However, imblearn provides a neat way to incorporate techniques that battle imbalance into your machine learning workflow in sklearn.

By first understanding these techniques and then utilizing them, imbalanced data should become a lot less intimidating.

Besides over-sampling, there are several other ways to attack minority, such as under-sampling or combinations of the two.

In the next post of this deep-dive, I am going to tackle under-sampling in a similar fashion.

References:[1] N.

V.

Chawla, K.

W.

Bowyer, L.

O.

Hall, W.

P.

Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321–357, 2002.

[2] He, Haibo, Yang Bai, Edwardo A.

Garcia, and Shutao Li.

“ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp.

1322–1328, 2008.

[3] H.

Han, W.

Wen-Yuan, M.

Bing-Huan, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” Advances in intelligent computing, 878–887, 2005.

[4] H.

M.

Nguyen, E.

W.

Cooper, K.

Kamei, “Borderline over-sampling for imbalanced data classification,” International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), pp.

4–21, 2009.

[5] imbalanced-learn’s documentation.

. More details

Leave a Reply