Machine learning for Customer Analytics — 1Customer response prediction using Logistic RegressionVivek VijayaraghavanBlockedUnblockFollowFollowingApr 28ContextPersonal loans are a major revenue generating mechanism for banks and all banks reach out to potential customers to campaign for their loan offerings.
Most of these campaigns reach out to a random database of customers and hence end up into annoying tele-marketing calls than being efficient means for lead conversion.
In this article, we will see how we could harness the power of machine learning to target the campaigns towards the right set of customers, thereby increasing conversion propensity.
We will be using past data available on the demography, bank details, and transaction patterns of customers who have responded and not responded to a personal loan campaign, as training data to predict the probability if a customer will respond to the campaign.
In machine learning terminology, this is a classification problem and there are several classification algorithms available to build a prediction model out of which we will be using Logistic regression.
About Logistic RegressionLogistic Regression is a popular and powerful supervised machine learning technique used to build a model relating the independent predictors (x variable) with the response variable (y) that is categorical in nature.
Where the class is known already, it can help find factors distinguishing between records in different classes in terms of the predictor variables in the dataset.
When the outcome variable has just two classes (Eg: Pass/Fail; Fraudulent/Not Fraudulent; Default/No Default) binomial logistic regression is applied and multinomial logistic regression is applied if we have more than two classes (Eg: Buy/Sell/Hold).
Logistic regression is a statistical technique and provides a detailed statistical summary in terms of statistical significance of the predictor variables and how each predictor variable impacts the probability of the classes of the Y variable.
These unique qualities make this algorithm highly relevant to the Banking & Finance domain to provide detailed and numerical interpretation of the predictor variables.
DatasetWe have a dataset that provides details from a bank about a ‘Personal Loan’ campaign that was executed by the bank.
20,000 customers were targeted with an offer of personal loan at 14% interest rate, out of which 2512 customers responded positively.
The dataset and the data dictionary can be downloaded here.
High level approachThe flow chart below depicts the high level approach we will pursue to finding a solution to this classification problem using Logistic Regression, starting with defining the problem statement to calculating the accuracy of the built model towards classifying new customers.
High Level ApproachProblem DefinitionWe are going to use Logistic Regression to build a model which will predict propensity (probability) of customers responding to a personal loan campaign.
The probabilities from the model will be used to classify the outcome and identify variables that influence the response.
The goal is to build a model that identifies customers who are most likely to accept the loan offer in future personal loan campaigns.
Data Clean-up and PreparationAs the first step, we set up the working directory and read the dataset in csv format into R:## Setting working directory and reading the csv filesetwd("F:/DataScience/BLOG/LogRegproject1")loanorg <- read.
csv")Next, we view the dataset using basic commands in R to understand the columns and their data types, and number of records present.
## View the dataset using View, str, names in RView(loanorg)(output not shown here)dim(loanorg) 20000 40names(loanorg)##  "CUST_ID" "TARGET" ##  "AGE" "GENDER" ##  "BALANCE" "OCCUPATION" ##  "AGE_BKT" "SCR" ##  "HOLDING_PERIOD" "ACC_TYPE" ##  "ACC_OP_DATE" "LEN_OF_RLTN_IN_MNTH" ##  "NO_OF_L_CR_TXNS" "NO_OF_L_DR_TXNS" ##  "TOT_NO_OF_L_TXNS" "NO_OF_BR_CSH_WDL_DR_TXNS" ##  "NO_OF_ATM_DR_TXNS" "NO_OF_NET_DR_TXNS" ##  "NO_OF_MOB_DR_TXNS" "NO_OF_CHQ_DR_TXNS" ##  "FLG_HAS_CC" "AMT_ATM_DR" ##  "AMT_BR_CSH_WDL_DR" "AMT_CHQ_DR" ##  "AMT_NET_DR" "AMT_MOB_DR" ##  "AMT_L_DR" "FLG_HAS_ANY_CHGS" ##  "AMT_OTH_BK_ATM_USG_CHGS" "AMT_MIN_BAL_NMC_CHGS" ##  "NO_OF_IW_CHQ_BNC_TXNS" "NO_OF_OW_CHQ_BNC_TXNS" ##  "AVG_AMT_PER_ATM_TXN" "AVG_AMT_PER_CSH_WDL_TXN" ##  "AVG_AMT_PER_CHQ_TXN" "AVG_AMT_PER_NET_TXN" ##  "AVG_AMT_PER_MOB_TXN" "FLG_HAS_NOMINEE" ##  "FLG_HAS_OLD_LOAN" "random"Upon viewing and exploring the dataset, we infer:There are 20,000 observations and 40 variables.
We have a mix of integer, numeric, and factor variables.
The categorical response variable, representing if a customer responded to the campaign or not, is the variable called ‘TARGET’.
[0 – did not respond / 1 – responded]Based on data exploration, we also note that the following actions are required to prepare the dataset for further analysis:the columns CUST_ID and “random” are not required as these columns are for denoting the survey participant with some IDs (they are neither x nor y variables).
##remove unwanted columns CUST_ID and randomloancamp <- loanorg[,-c(1,40)]There are variables for which same data is represented by other variables in the dataset.
These are AGE_BKT (represented by AGE) and ACC_OP_DATE (represented by ‘LEN_OF_RLTN_IN_MNTH’).
Therefore, AGE_BKT and ACC_OP_DATE can be removed from the dataset.
## remove columns AGE_BKT and ACC_OP_DATEloancamp$AGE_BKT <- NULLloancamp$ACC_OP_DATE <- NULLThe categorical variables TARGET, FLG_HAS_CC, FLG_HAS_ANY_CHGS, FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN are denoted as integer data types.
These are converted to categorical type in R.
## Convert variables into correct datatypes (FLG_HAS_CC, FLG_HAS_ANY_CHGS, FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN, TARGET should be categorical)loancamp$TARGET <- as.
factor(loancamp$TARGET)loancamp$FLG_HAS_CC <- as.
factor(loancamp$FLG_HAS_CC)loancamp$FLG_HAS_ANY_CHGS <- as.
factor(loancamp$FLG_HAS_ANY_CHGS)loancamp$FLG_HAS_NOMINEE <- as.
factor(loancamp$FLG_HAS_NOMINEE)loancamp$FLG_HAS_OLD_LOAN <- as.
factor(loancamp$FLG_HAS_OLD_LOAN)str(loancamp)The four unwanted columns have been removed and data types have been corrected for the five variables listed above.
Let’s check for any missing values in the dataset:## Check for missing values in any of the columnscolSums(is.
na(loancamp))There are no missing values in the dataset.
Now that the dataset is ready for modelling, let’s check the baseline customer response rate for the observations in the dataset:##Calculate baseline conversion raterespons_rate <- round(prop.
table(table(loancamp$TARGET)),2)respons_rate0 1 0.
13We can see that the dataset is clearly imbalanced — only 13% of the customer records have a response (class 1) against no response from the remaining 87% records (class 0).
This will have an impact on model performance measures that we will see in detail in the later part of this analysis.
Exploratory Data AnalysisIn this section, I will explore the data set further in the form of data visualization using ggplot2 in R.
This will help provide initial insights on the distribution of the numeric variables as well as important features impacting the response variable.
Univariate analysisHistogram plots for important independent numeric variables## EDA – histogram plot for important numeric variableslibrary(ggplot2)library(cowplot)ggp1 = ggplot(data = loancamp, aes(x = AGE))+ geom_histogram(fill = "lightblue", binwidth = 5, colour = "black")+ geom_vline(aes(xintercept = median(AGE)), linetype = "dashed")Histogram of Numeric VariablesFrom the histograms we can see thatThe frequency distribution for AGE shows that the targeted customers are highest in the age group between 26–30.
The Holding Period (Ability to hold money in the account) and length of relationship with the bank are more or less evenly distributed.
The customers had most of their credit transactions in the range between 0–15 and debit transactions less than 10.
Boxplots for important independent numeric variablesBox plot of Numeric VariablesFrom the box plots, we can visualize and infer:Box plot shows the following median values for the numeric variables: age around 38 years, holding period i.
ability to hold money in the account of 15 months, Length of relationship with bank at 125 months, No.
of credit transactions 10, and no.
of debit transactions = 5.
There are many outliers for the variables no.
of credit transactions and no.
of debit transactions.
Bar plot for important categorical variablesBar plot of Categorical VariablesWe can infer from the bar plots thatNearly 3/4th of the customers targeted in the loan campaign belonged to male gender.
Salaried and Professional class form majority of the targeted customers.
A quarter of the customers had a credit cardequal proportion of customers in the dataset have an old loan or did not have one.
Bivariate analysisBoxplot for numeric variables vs TARGET (y variable)#bivariate analysis of AGE, LEN_OF_RLTN_IN_MNTH,HOLDING_PERIOD against TARGETggbi1 = ggplot(data = loancamp, aes(x = TARGET, y = AGE, fill = TARGET))+geom_boxplot()ggbi2 = ggplot(data = loancamp, aes(x = TARGET, y = LEN_OF_RLTN_IN_MNTH, fill = TARGET))+geom_boxplot()ggbi3 = ggplot(data = loancamp, aes(x = TARGET, y = HOLDING_PERIOD, fill = TARGET))+geom_boxplot()plot_grid(ggbi1, ggbi2, ggbi3, labels = "AUTO")Box plot for numeric variables vs TARGETWhen we look at the data visualization from bivariate analysis of numeric variables against the categorical target variable, we get the following insights:AGE vs Target (TARGET: responded to loan campaign = 1; Did not respond to loan campaign = 0)The median age of customers who responded to the campaign is slightly higher than the age of those who didn’t respond.
There is not much differentiation though between the two classes based on Age, an inference we also draw from Length of Relationship with Bank in Months vs TARGET class.
Customers who had lesser median holding period (Ability to hold money in the account) of around 10 months are the ones who had responded to the personal loan campaign.
Stacked Bar plot for Categorical x variables vs TARGET (y variable)Stacked Bar plot for Categorical x variables vs TARGETBivariate analysis of categorical x variables against the TARGET y variable using stacked barplot helps us visualize the following:We see that more male customers (~ 25%) have responded to the campaign when compared to the female customers (total count of customers under “Other” category is very small for comparison).
Self-employed customers were more interested in availing personal loans when compared with the salaried class of customers.
17% of customers who had current account with the bank and was contacted for the loan expressed interest when compared to 11% for customers having a savings account.
Customers having pending charges to be paid to the bank or not having an pending charges do not seem to have made any difference to responding or not responding to the campaign.
We see a similar pattern for customers having an old loan with the bank.
Those customers who held a credit card were more interested in availing the loan when compared with customers who didn’t have a credit card.
Split dataset into development (train) and holdout (validation or test) setsNext, we we will split the dataset into training and test datasets.
We will build the model based on the training dataset and test the model performance using the test dataset.
## split the dataset into training and test samples at 70:30 ratiolibrary(caTools)set.
seed(123)split = sample.
split(loancamp$TARGET, SplitRatio = 0.
7)devdata = subset(loancamp, split == TRUE)holddata = subset(loancamp, split == FALSE)## Check if distribution of partition data is correct for the development datasetprop.
table(table(holddata$TARGET)) The prop.
table output above confirms that the imbalanced dataset characteristic that we saw in the original dataset is maintained at the same proportions in the development and hold out samples as well.
The training dataset is now ready to build the model.
Build the Logistic Regression based Prediction ModelSteps in Logistic RegressionStep 1: Run Logistic Regression on the train dataThe Generalized Linear Model command glm is used in R to build the model using Logistic Regression.
library(glmnet)logreg = glm(TARGET ~.
, data=devdata, family="binomial")Step 1a: Variable Inflation Factor checkLet’s check for multi-collinearity among the x variables through the Variable Inflation Factor (VIF) check based on the model we have generated.
library(car)vif(logreg)Feature (x) variables with a VIF value above 4 indicate high degree of multi-collinearity.
From the table below, we note the following variables have VIF value above 4: NO_OF_L_CR_TXNS, NO_OF_L_DR_TXNS, NO_OF_ATM_DR_TXNS, AMT_MOB_DR, AVG_AMT_PER_MOB_TXN.
VIF output showing variables with VIF value above 4We remove these x variables that have high multi-collinearity from the train and test datasets and re-run the model.
This should help identify variables that clearly impact the response variable (TARGET — y variable) and also build a model that can classify records more accurately.
This process is repeated until we get a clean VIF output with values below 4.
Step 2: Overall significance of the modellibrary(lmtest)lrtest(logreg1)We can see that the low p value indicates the model is highly significant i.
the likelihood of a customer responding to the campaign (TARGET) depends on independent x variables in the dataset.
Step 3: McFadden or pseudo R² interpretationlibrary(pscl)pR2(logreg1)Based on the value of McFadden R², we can conclude that 8.
7% of the uncertainty of the intercept-only model is explained by the full model (current x variables).
This value indicates a low goodness of fit.
This could be indicative of the fact that more x variables need to be added to explain the variation in the response (y) variable.
Step 4: Individual coefficients significance and interpretationsummary(logreg1)From the summary of the generated model, we infer that there are some x variables that are significant based on their p value.
These are the x variables that influence the customer responding to the campaign and are shown in the table below.
The Odds Ratio and Probability of each x variable is calculated based on the formulae,Odds Ratio = exp(Co-efficient estimate)Probability = Odds Ratio / (1 + Odds Ratio)Summary of Logistic Regression Model with Odds Ratio and ProbabilityInferences from the Logistic Regression summary:If the customer is self-employed, the odds that he responds is 1.
88 times higher than that he does not respond.
In other words, the probability that he responds to the campaign is 65%.
If he has a credit card, the odds is at 1.
84 for responding to the campaign.
If the customer has an old loan or is of salaried class, then the probability that they respond to a campaign drops below 50% i.
they are unlikely to respond.
Predict for the validation dataUsing the model we just built, let’s try predicting the propensities and class for records in the validation sample.
Note that we already know the class for these records and will use that input to compare and find out how good the new model is when it comes to classifying new records.
score=predict(logreg1,newdata = holddata1, type="response")The model has output propensities for the individual records in the validation set.
To determine the cut-off value for classifying the records based on the probabilities, we look at the distribution of the probabilities predicted.
If we look at the distribution, most of the records have probability values below 10%.
So, we use a cut-off value of 0.
1 to to predict the class for the records.
## Assgining 0 / 1 class based on cutoff value of 0.
1holddata1$Class = ifelse(holddata1.
1,1,0)The model has predicted the class (TARGET value 0 or 1) for each record.
Let’s create the Confusion Matrix to see how well the model has classified the records against the known TARGET values for these records in the validation set.
Model Performance — Confusion MatrixThe Confusion Matrix provides a clear snapshot of the accuracy of the model.
## Creating the confusion matrixtabtest=with(holddata1, table(TARGET,Class))tabtestConfusion Matrix of Test DataSensitivity is more important than accuracy here as we are more interested in knowing all those records where a customer will respond to a new campaign rather than those records where a customer may not respond at all.
The lost opportunity cost in contacting a potential customer, who would avail a loan and thereby pay interest to the bank, due to incorrect classification by the model (sensitivity) is more critical than potentially non-responding customers receiving mailers or calls due to incorrect classification (specificity).
When we compare the model performance measures between the train and test data (see table below), the model holds good for the test data without a significant drop in performance measures.
This indicates that there is no overfit in the model.
For predicting the class with this model, we have consciously compromised on accuracy to have a better sensitivity i.
not to lose out on potential customers who will respond to the loan campaign.
Actionable InsightsBased on the model built above, we can make the following recommendations to the bank:Instead of cold-calling a random set of customers, target customers with one or more of the following attributes — Male customers, self-employed, holding a credit card, has or had some charges with the bank.
Avoid contacting customers who already have a loan and/or is of salaried class.
ConclusionHere, we have used Binomial Logistic Regression technique not only to predict the class of customers responding to the loan campaign but also get a list of statistically significant independent variables that influence the customer response to the campaign.
We are also able to predict the probability of a customer responding or not responding based on the model summary.
These give powerful insights which will help in improving the response rate from the customers and thereby conversion to availing loans with the bank.
ResourcesThe original dataset, data dictionary and full code in R are available here.