The main issue with identifying Financial Fraud using Machine Learning (and how to address it)

The main issue with identifying Financial Fraud using Machine Learning (and how to address it)Strategies for dealing with imbalanced dataGustavo ChávezBlockedUnblockFollowFollowingMar 6The sheer amount of financial transactions that payment processors deal with on a daily basis is staggering, and only increasing: in the order of 70 million credit card transactions per day in 2012 and with losses in the billions of dollars in 2017.

Determining if a transaction is legitimate or fraud is a job exclusively for a computer system simply due to volume.

The traditional machine learning approach is to build a classifier that helps the human in the loop to reduce the number of transactions that it has to look at.

The goal of the machine learning classifier is to reduce the number of transactions that a human has to investigate.

The challenge for machine learning classifiers is that the percentage of fraudulent transactions is in the order of 1–2%, which means that classifiers have to consider a severe imbalance in the training data.

This is an awesome video that shows the challenges that machine learning engineers have to go through while systematically detecting fraudulent transactions:Even synthetic datasets for financial fraud are skewed, see for instance the imbalance on this Kaggle dataset for predicting fraud in financial payment services:The data set contains more than 6 million transactions and 11 features:The data imbalance issue is not exclusive to machine learning applications in finance, but also in applications such as detecting images from patients with rare diseases, image classification of restricted objects from X-ray images at ports of entries, image recognition of oil-spills, etc.

Photos courtesy of UnsplashIn this article, I will describe the techniques that can be used to alleviate imbalance of data, with the goal of training binary classifiers with a balanced dataset.

To ease exposition, we will create a synthetic datasets in two dimensions, although your typical financial dataset typically has more features, the previous Kaggle dataset, has for instance 28 features.

In general, this is how an imbalanced dataset looks like:Dataset depicting a 99:1 class imbalance ratio.

There are 1,000 samples on the yellow class (majority), and 10 samples on the orange class (minority).

There are two general strategies to alleviate the imbalance: to reduce the majority class (undersample), or to generate synthetic data from the minority class (oversample).

We will discuss both.

Strategy 1.

Undersampling the majority classIn this strategy, the idea is to reduce the number of samples of the majority class.


1 Random undersamplingThe first strategy is to ignore data samples from the majority class.

The following animation disregards random samples from the majority class until reaching a balanced dataset.

Random undersampling1.

2 Clustering undersamplingAnother strategy is to reduce the majority samples to k samples, which correspond to the k centroids of the majority class.

These centroids are computed by the k-means clustering unsupervised algorithm.

The number of centroids are usually set to the number of samples in the minority class so that the whole dataset is balanced.

Clustering undersampling1.

3 Tomek linksAnother strategy for undersampling the majority class is to remove the so-called “Tomek links”.

These points are the nearest neighbor (closest from the opposite class) to the minority class.

By removing these points one is giving more weight to the minority class samples by “decluttering” its surrounding space from points of the majority class.

Tomek links undersamplingStrategy 2.

Oversampling the minority classIn this strategy, the idea is to augment the number of samples of the minority class.


1 Oversampling by random duplication of the minority classIn the first strategy, also the simplest to implement, one chooses random samples from the minority class and duplicate them.

Although straightforward, this might lead to overfitting.

Random oversampling2.

2 Synthetic Minority Over-sampling (SMOTE)The idea behind the SMOTE algorithm is to create “synthetic” data points along the vector between two samples from the minority class, chosen by its nearest neighbor (i.



The new points are set at a random length along this vector.

See the following animation for an example on a simplified dataset:SMOTE interpolates over nearest neighbors.

The following animation is applied to our previous example for illustration:Synthetic Minority Over-sampling (SMOTE) example.


3 Adaptive Synthetic (ADASYN)Similar to SMOTE, the ADASYN algorithm also generates new synthetic points, but by applying different weights to the minority samples to compensate for the skewed distributions.

In the SMOTE algorithm, an equal numbers of synthetic samples are generated for each minority data example.

Adaptive Synthetic (ADASYN)3.

Oversampling and undersampling (SMOTE + Tomek links)Lastly, a combination of oversampling (e.


via SMOTE) and undersampling (e.


via Tomek links) is perhaps the ideal path for dealing with imbalanced data.

The oversampling algorithm creates new instances to match the balance, and the undersampling procedure removes points from the majority class that would otherwise subtract weight to the precious samples of the minority class.

An example of such strategy a can be seen in the following animation:SMOTE + Tomek linksSummaryThe selection of a strategy will be problem dependent, and will ultimately be guided by what the business decides to favor: precision or recall.

Fortunately, experimenting the performance of these techniques is straightforward via the Python library imbalanced-learn.

Further reading:Learning from imbalanced classes by Tom FawcettResampling strategies for imbalanced datasets by Rafael AlencarSource code:In this repository https://github.

com/gchavez2/code_machine_learning_algorithms you can find the Python code that was used to generate the experiments in this article, in the form of a Jupyter Notebook:https://github.

com/gchavez2/code_machine_learning_algorithmsI am a postdoctoral fellow at the Lawrence Berkeley National Laboratory, where I work at the intersection of machine learning and high-performance computing.

If you find this article interesting, feel free to say hello over LinkedIn, I’m always happy to connect with other professionals in the field.

And as always: comments, questions and shares are highly appreciated!.❤️No Spam, ever.

.. More details

Leave a Reply