Extremely Imbalanced data — Fraud detection

This is because we can predict all the isFraud=0 cases perfectly, but none of the isFraud=1 cases.

So out of the two classes, we can predict only 1 (which gives us an ROC AUC of 0.


To level the playing field for our models, we can over-sample the fraud transactions, or under-sample the clean ones.

We can do this using the imbalance-learn library.

from imblearn.

under_sampling import RandomUnderSamplerX = df.

drop(['isFraud', 'type', 'nameOrig', 'nameDest'], axis = 1)y = df.

isFraudrus = RandomUnderSampler(sampling_strategy=0.

8)X_res, y_res = rus.

fit_resample(X, y)print(X_res.

shape, y_res.


value_counts(y_res))The sampling_strategy for the RandomUnderSampler is set to 0.


This is just to show what happens when we do this.

It allows us to specify the ratio of minority class samples to majority class samples.

It gives us 18479 rows of data with the following value counts:Let’s see how our table looks like after the resampling, and dropping those columns:cols_numeric = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'isFlaggedFraud']df_rus = pd.

DataFrame(X_res, columns = cols_numeric)df_rus.

head()Now let’s split the dataset into 3 parts — training, validation, and test datasets.

The validation dataset we can use again and again with different models.

Once we think we’ve got the best model, we will use our testing dataset.

The reason we do it this way is that our model should not just give us good results with a part of the training dataset, but should also provide good results with data that we have never seen before.

This is what will happen in real life.

By keeping aside the test dataset to be used only once, we force ourselves to not overfit on the validation dataset.

This is what Kaggle competitions do as well.

The trainsize/valsize/testsize show the fraction of the total dataset that should be reserved for training/validation/testing.

from sklearn.

model_selection import train_test_splitdef train_validation_test_split( X, y, train_size=0.

8, val_size=0.

1, test_size=0.

1, random_state=None, shuffle=True): assert int(train_size + val_size + test_size + 1e-7) == 1 X_train_val, X_test, y_train_val, y_test = train_test_split( X, y, test_size=test_size, random_state=random_state, shuffle=shuffle) X_train, X_val, y_train, y_val = train_test_split( X_train_val, y_train_val, test_size=val_size/(train_size+val_size), random_state=random_state, shuffle=shuffle) return X_train, X_val, X_test, y_train, y_val, y_testX_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split( X_res, y_res, train_size=0.

8, val_size=0.

1, test_size=0.

1, random_state=1)class_weight = {0: 4, 1: 5}model = LogisticRegression(class_weight=class_weight)model.

fit(X_train, y_train)y_pred = model.

predict(X_val)print(classification_report(y_val, y_pred))print('accuracy', accuracy_score(y_val, y_pred))roc_auc_score(y_val, y_pred)Notice the class_weight parameter.

We put that there because the number of under-sampled rows were 10000 for isFraud=0 and 8000 for isFraud=1.

We want to weigh them so they are balanced.

The ratio to do that is 4:5, which are the class weights used here.

If we had under-sampled without sampling_strategy=0.

8, we would have balanced classes, and would not need the class_weight parameter.

If we do get a fresh dataset that has slightly imbalanced parameters, we could use LogisticRegression with class weights to balance it, without resampling.

Now we got an accuracy score of 0.

90 — which is a good score.

Our ROC AUC score is also 0.


Now let’s try our model on the test dataset:y_pred = model.

predict(X_test)print(classification_report(y_test, y_pred))print('Accuracy', accuracy_score(y_test, y_pred))print('ROC AUC score:', roc_auc_score(y_test, y_pred))Again a good score of 0.


Looks like the RandomUnderSampler has done a good job.

We have to apply our model on the full (unsampled) dataset.

Let’s do that next.

y_pred = model.

predict(X)print(classification_report(y, y_pred))print('Accuracy:', accuracy_score(y, y_pred))print('ROC AUC score:', roc_auc_score(y, y_pred))This is more interesting.

The Accuracy and ROC AUC scores are really good, as is the precision/recall/f1-score for isFraud=0.

The problem is the precision for isFraud=1 is very very low at 0.


Since the f1-score is a weighted average of precision and recall, it is low also at 0.


Maybe we do not have enough data here to do well with Logistic Regression.

Or maybe we should have oversampled instead of undersampled.

We will continue with this dataset, applying many techniques to it in future blog posts.

Many thanks to Ryan Herr, an instructor at Lambda School for providing the train_validation_test_split() function.


. More details

Leave a Reply