Bank Loan Default Prediction

Bank Loan Default PredictionTianhao WuBlockedUnblockFollowFollowingMar 14IntroductionLoans are usually one of the most important products of a bank.

Acting as a provider of loans is one of the main activities of financial institutions such as banks.

Practically, the funds of a bank are mainly used for lending activities.

On the other hand, a bank will face a huge loss when a loan turns default.

Therefore, banks always pay much attention to detect and predict the default behaviors of their customers.


Terminology explanation(a).

Bank loanA bank loan is the lending of money by a bank to other individuals, organizations, corporations, etc.

The recipient (i.


, the borrower) incurs a debt, and is usually liable to pay interest on that debt until it is repaid, and also to repay the principal amount borrowed.

Bank loans are good for financing investment in fixed assets (such as plant & machinery, land, and buildings).

The interest rate can be either fixed or variable.


DefaultThe term default means fail to meet the legal obligations (or conditions) of a loan, for example when a home buyer fails to make a mortgage payment, or when a corporation or government fails to pay a bond which has reached maturity.


Project objectiveIn order to prevent a loan from turning default, banks need to figure out how to make predictions based on customers’ behaviors.

Machine learning models appear to be one of the most effective solutions for predicting loans default.

Therefore, the objective of this project is to build supervised models for loans default predictions and to explore the impact of customer behavioral factors on making predictions further.


Project WorkflowData Preprocessing1.

Understanding the datasetThe data set consists of eight tables related to the clients and their accounts:The columns “status” in table “loan” is the target variable which shows the current status of a loan.

It has been classified into four categories:A.

Contract completed — Loan paid and closedB.

Contract completed — Loan not paidC.

Running contract — Customer making regular paymentsD.

Running contract — Customer in debtEach account has both static characteristics (e.


, the date of creation, the address of the branches) given in “account” and dynamic characteristics (e.


, payments debited or credited, balances) given in “permanent order” and “transaction”.

One client can have more accounts; more clients can manipulate with a single account; clients and accounts are related together in relation “disposition”.

“loan” and “credit card” describe some services which the bank offers to its clients; more credit cards can be issued to an account, at most one loan can be granted for an account.

“demographic data” gives some publicly available information about the districts (e.


, the unemployment rate); additional information about the clients can be deduced from this.


Loading data into MySQL databaseThese eight tables are loaded in MySQL database separately.

The data in these tables are not clean enough for modeling.

Almost all the date columns are not in the correct format.

Some columns contain unnecessary punctuations.

For further exploration, we need to cleanse the data.

Data ExplorationAfter cleansing the data in MySQL Workbench, we use python to connect to MySQL server and transform the data to Pandas DataFrame for exploration and visualization.

Because the objective is to make predictions on default, the loan table which has loan status should be the main table.

Therefore, we need to join all the other tables to the loan table based on the common account IDs.

Then, explore the whole data set to compare the relevance between loan status and other data.


Labeling the datasetBefore making the comparison, we need to verify what are the classes in loan status:The distribution of loan amount of each status class is:Loan Amount Distribution — Statue LevelWhere:1.

“A” stands for finished contracts, no problems.


“B” stands for finished contracts, loan not paid.


“C” stands for running contracts, OK so far.


“D” stands for running contracts, clients in debt.

Instead of building a multiclassification supervised model, a binary labeled model is more suitable for predicting a loan is turning default or not.

As a result, we label the two classes “A” and “C” as “0” which means the loans do not default; and label the other two classes “B” and “D” as “1” which represents the defaulted loans.

The default rate of the dataset is around 11.

14%:Loan Status ProportionAs the labels are converted, the next step is to compare the relevance between variables.


Variables exploration(1).


Loan Monthly Payments vs.

Loan AmountWe plot the distribution comparing monthly loan payments and loan amount for each status.

We can see a huge difference between good and default loans so these two columns could be strong predictors for machine learning model.

Monthly Loan Payment vs.

Loan Amount(b).

Approved Year vs.

Loan AmountWe compare the approved year of loans and loan amount and split each year into two sections representing each status.

The number of good and defaults loans are quite different in each year except 1996.

The default rate shows a downtrend from 1993 to 1998 while there is a flat trend shown between 1994 and 1997.

Approved Year vs.

Loan Amount — Status Level(c).

Loan Duration vs.

Loan AmountSimilar to the previous plot, we compare the loan duration and loan amount.

We can see that loans with 12 months duration have the lowest default rate (around 8.

40%) whereas loans with 24 months duration have the highest default rate (around 12.


Loan Duration vs.

Loan Amount — Status Level(2).

GenderSometimes gender can be useful when making predictions, so we plot the gender distribution of good loans and defaulted loans as well as the default proportion of males and females.

The proportion of loans held by females are a bit more than those held by males, but the proportion of defaulted loans of females is significantly lower than that of males.

However, the similarity of default rate under each gender (about 13.

37% from male and about 11.

73% from female).

Unfortunately, it seems gender might not be very helpful for default predictions.


Loan Amount vs.

Default AmountLoan Amount vs.

Default Amount — Gender(b).

Default ProportionDefault Proportion — Gender(3).

Order AmountOrder amount is the permanent orders made by each debit account.

It reflects how active is an account.

There is a big difference between good and defaulted loans regarding order amount so it might be a good predictor for making predictions.

Order Amount(3).

Transaction Amount vs.

Transaction BalanceTransaction table records the majority of activities an account has been made.

It, therefore, may have the most important information for default predictions.

We take two columns out from the transaction table and plot the distribution for defaulted loans and not defaulted loans.

The transaction amount column records the amount of each transaction while the transaction balance represents the account balance after a transaction.

There is a certain area that does not contain any instance in the default heat map, which means the loan might not turn default with the transaction amount and balance within this range.

Consequently, these two variables could be very useful.

Transaction Amount vs.

Transaction Balance(4).

GeographyStatistically, geographical data always tell stories.

Therefore, it is necessary to explore the geographical data.

As illustrations, we take regions columns and districts columns to plot the default rate.


RegionThere are three major default rate range can be found at the region level.

“north Bohemia” has the lowest rate (1.

64% rate approximately) with a huge gap from the second lowest.

Default Rate — Region(b).

DistrictThe default rate shows an even wider difference between each district than between each region, where 0 default rate can be found in some districts.

Default Rate — District(6).

DemographicDemographic data also matters in analysis and making predictions.

Some of them can be strong predictors when making predictions on default.

For illustrations, we take a few logically related columns from the demographic table and plot the distribution for both good loans and defaulted loans regarding these variables.


The ratio of Urban Inhabitants vs.

Number of InhabitantsThe distribution of each loan status does not seem to have a big difference regarding these two variables.

However, the number of default loans with a higher ratio of urban inhabitants and a larger number of inhabitants seems to be fewer than the lower side.

The ratio of Urban Inhabitants vs.

Number of Inhabitants(b).

The ratio of Urban Inhabitants vs.

Average SalarySimilar to above, observations with a higher ratio of urban inhabitants and a higher average salary seem to have fewer defaulted loans.

The ratio of Urban Inhabitants vs.

Average Salary(c).

The ratio of Urban Inhabitants vs.

Number of Entrepreneurs per 1000 InhabitantsThe ratio of Urban Inhabitants vs.

Number of Entrepreneurs per 1000 Inhabitants(d).

Entrepreneurs per 1000 Inhabitants vs.

Average SalaryEntrepreneurs per 1000 Inhabitants vs.

Average SalaryMachine LearningBased on the understanding and the exploration of the dataset, we will build a supervised machine learning model using Python and Scikit-learn.


Categorical values to binary variablesNotice that after joining the tables together, there are some columns have categorical values which need to be converted to binary variables.


Model selection with cross-validationIn this section, we will try a few supervised models with 10-fold cross validation using all the feature columns and the default hyper-parameter setting.

The purpose is to select a model that fits the dataset better.


Logistic regressionDue to the poor performance of logistic regression, other linear models may perform similarly.

Therefore, we try a tree-based model next.


Random forest classifierRandom forest classifier performs a lot better than logistic regression.

It seems we are on the right track.

However, the model is still overfitted, and we need to tune the hyper-parameters later.

Next, we will try a gradient boosting classifier as a comparison to this random forest classifier.


Gradient boosting classifierGradient boosting classifier performs even better than random forest classifier.

Therefore, we will tune a gradient boosting classifier to build a model for default predictions.


ROC curve comparisonThe AUC score of random forest classifier and gradient boosting classifier are almost 100% which are much higher than that of logistic regression.

ROC curve3.

Grid search with cross-validationNow that we have selected a model for supervised learning, the next step is to perform grid search to find the best hyper-parameter setting.

A five-fold cross-validation is also applied to avoid overfitting.

The parameters we will be tuning are:The best hyper-parameter values found by searching through the previous settings are:4.


Classification reportThe classification report with the best parameter setting:The F1 score on the training set is slightly lower than the score before applying grid search, but the performance on the test set stays the same.

It is because the model with default setting was probably overfitted.

If we compare the average F1 score and the average standard deviation between both models before and after searching, we can see the average F1 score after grid searching has increased and the average standard deviation has decreased.

As a result, the model with the best hyper-parameter setting has a lower variance and a higher bias than the model with the default setting.

Hence, the model using the best settings found by searching through the grids is more robust.


Variable importanceNow we have trained a decent model, but we still do not know how the variables perform individually, so we plot a bar chart to visualize the variable importance.

As an illustration, we only plot the top twenty variables of the models.

The variables are sorted by importance scores generated by the best model found by grid search.

Notice that in the variable importance chart above, the first variable “tra_account_SANKC.

UROK” has significantly more weight than the second.

Variable Importance(3).

Decision boundariesTo have a basic idea of how the variables help make predictions, we take a few variables from the variables importance chart above to visualize the decision boundaries.

Two variables are needed as a pair each time to plot the boundaries.

Each point in the figures represents each loan instance.

The blue points represent the defaulted loans whereas the white points represent the good loans.

The points fall in the blue area will be predicted as defaulted loans while the points fall in the red area will be predicted as good loans.

Decision boundariesSummaryIn this project, we built a supervised machine learning model from scratch for predicting loan default.

We used MySQL to load the dataset, cleansed the dataset in MySQL, transformed the data to Pandas DataFrame using Python, preprocessed the data for exploration and modeling, and trained a gradient boosting classifier for default prediction using Scikit-learn.

The model, with the best hyper-parameter setting found by grid search, has an excellent performance with a 1.

00 F1 score on both labels (default and not default).


Linear models such as logistic regression did not fit the dataset well whereas tree-based models like random forest classifier and gradient boosting classifier can provide decent performance.

ROC curve2.

Gradient boosting classifier seems to fit the dataset best, so we use this classifier to build a machine learning model.

By applying grid search to the selected hyper-parameter ranges, the model overcame overfitting and became more robust with a relatively higher bias (lower standard deviation).

F1 Score Comparison3.

The variable “tra_account_SANKC.

UROK” showed a significant ability in helping predict defaulted loans.

Variable Importance4.

The decision boundaries of “tra_account_SANKC.

UROK” and other variables, with a relatively high importance score, are quite clear.

Decision Boundary.

. More details

Leave a Reply