Overcoming challenges when designing a fraud detection systemA word on how oversampling our data, and choosing the right model and metric can improve our prediction systemsJuan De Dios SantosBlockedUnblockFollowFollowingMar 23The recent advances in the field AI have opened the doors to a plethora of smart and personalized methods of fraud detection.

What once was an area that required a numerous amount of manual labor, is now one of the many that have experienced the progress of machine learning.

Notwithstanding, it is important to note that this technology, like any other, is not perfect and susceptible to problems, and with so individuals, enterprises, and services offering solutions based on predictive techniques to detect fraudulent transactions, we have to take a step back and consider the possible challenges that might arise.

I believe it is safe to say that most of the transactions are non-fraudulent.

While this is excellent, of course, it creates the most significant and prevalent issue in the area of fraud detection: imbalanced data.

Data is the most critical component of a machine learning predictive model, and an imbalanced dataset — which in this case is a dataset mostly made of non-fraudulent records — might result in a prediction system that won’t be able to learn about the fraudulent transactions properly.

In such a case, the straightforward solution would be to obtain more data; however, in practice, this is either expensive, time-consuming or borderline impossible.

Fortunately for us, some procedures and algorithms help with the problem of an imbalanced dataset.

The technique of over-sampling and under-sampling allow us to modify the class distribution of a dataset.

As the name implies, over-sampling is a procedure used to create synthetic data that resembles the original dataset, while the goal of under-sampling is the opposite, removal of data; It is important to note that in practice, over-sampling is more common than under-sampling, that’s why you have probably heard more about the former than the latter.

The most popular over-sampling algorithm is Synthetic Minority Over-sampling Technique or SMOTE.

SMOTE is an algorithm that relies on the concept of nearest neighbors to create its synthetic data.

For example, in the image below we have an imbalanced dataset made of 12 observations and two continuous features, x and y.

Of these 12 observations, ten belong to the class X, and the remaining two to Y.

What SMOTE do is selecting a sample from the under-represented data, and for each point, it computes its K nearest neighbors — which for the sake of simplicity, let’s assume its K=1.

Then, it takes the vector that is between the data point and its nearest neighbor, and multiplies it by a random value between 0 and 1, resulting in a new vector “a.

” This vector is our new synthetic data point.

The nearest neighbor of point number 1, is point number 2 and vice-versa, thus, the newly synthetic data point is to be created somewhere along the dotted lineDataset enhanced with synthetic dataThe following gist shows how to perform SMOTE in R.

A second concept I’d like to bring to the table is the choice of the predictive system.

At first, we might think that a supervised learning system that classifies an action into fraud or not would be the most appropriate approach for this kind of problem.

While this might sound attractive, it is important to note that sometimes this is not enough and that other areas such as anomaly detection and unsupervised learning could help to find those noisy and anomalous points that might represent fraud.

Anomaly detection is the field of data mining that deals with the discovery of abnormal and rare events — also known as outliers — that diverges from what it is considered to be normal.

At a basic level, these detection approaches are more statistically intensive, as they deal more closely to the topic of distributions and how much a data point varied from it.

For example, equations such as the upper and lower inner fence, defined as UF = Q3 + (1.

5 * IQR) and LF = Q1 — (1.

5 * IQR), where IQR is the interquartile range, is one of the many techniques used to create a barrier that divides the data into what is considered normal and what is considered an outlier.

Cute outlier.

Image by Hans Braxmeier from PixabayLet’s explain this with a realistic albeit silly example.

Imagine there’s this guy, John, and John is an early bird who regularly wakes up around 5:30 am (a real champ), however, last night while John was celebrating his 32th birthday, he had a little way too much fun and woke up the next morning at 6 am (and felt terrible about it).

This “6 am” represents an anomaly in our dataset.

The following graph represents John’s last ten wake up times, and we can see that the last one seems to be out of place.

Correctly, we could say that this point falls far from the distribution of the other nine times; the three sigma or standard deviations rule, a rule with applications for detecting outlier, shows this.

Without the 10th time, the mean and standard deviation would be 318.

89 and 2.

09 (if we compute the time as minutes after 12).

However, the 10th time (6 am, or 420), is way beyond the range of the mean plus three times the standard deviation (318.

89 + (2.

09 * 3) = 325.

16), indicating that time is indeed an anomaly.

On the 10th day, John had a rough wake upUnsupervised learning, particularly clustering, can also be used to detect anomalous and noisy content in a dataset.

This kind of machine learning algorithms works under the assumption that similar observations tend to be group under the same clusters, while the noisy ones, won’t.

Going back to John, and his fantastic sleeping pattern, if we cluster his data using k-means with k=2 we can observe how the anomalous point falls into a cluster of its own.

The lonely red point on the right represents an anomaly, a point that is dissimilar to the rest of the datasetLastly, the choice of performance metric plays a significant role at the time of training our system.

A standard score like accuracy, won’t be of much use if the dataset is imbalanced.

For example, suppose that only 2% of the content of the test dataset, are actual fraudulent transactions.

If the model classifies all of those cases are non-fraudulent, the accuracy would be 98%; a good number, but insignificant in this case.

Some metrics that would be more appropriate for this use case are precision, recall, and Cohen’s Kappa coefficient.

Moreover, the type of error — false positives, or false negative — that we are optimizing for is also of great importance.

There will be cases in which we should favor a higher false positive ratio in exchange for a lower false negative score and the other way around.

The demand to combat fraud, scam and spam activities will always be there.

Even with all the recent advances, and breakthroughs in the area of AI, there will be some difficulties to be encountered during our problem-solving quests.

In this article, I talked about three of these difficulties: the lack of a balanced dataset, the choice of predictive system and the selection of an appropriate evaluation metric, and offered some pointers and options to consider with the goal of improving the quality of our predictions and detections.

Appendix with the code used to generate the images are available on my GitHub.

juandes/fraud-challenges-appendixContribute to juandes/fraud-challenges-appendix development by creating an account on GitHub.

github.

comThanks for reading.

Juan De Dios Santos (@jdiossantos) | TwitterThe latest Tweets from Juan De Dios Santos (@jdiossantos).

Machine Learning/Data Engineer.

Also, Pokemon Master, and…twitter.

com.