ML basics : loan prediction

ML basics : loan predictionTariq MassaoudiBlockedUnblockFollowFollowingJun 6The problem:Dream Housing Finance company deals in all home loans.

They have presence across all urban, semi urban and rural areas.

Customer first apply for home loan after that company validates the customer eligibility for loan.

The Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form.

These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others.

To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

It’s a classification problem , given information about the application we have to predict whether the they’ll be to pay the loan or not.

We’ll start by exploratory data analysis , then preprocessing , and finally we’ll be testing different models such as Logistic regression and decision trees.

The data consists of the following rows:Loan_ID : Unique Loan IDGender : Male/ FemaleMarried : Applicant married (Y/N)Dependents : Number of dependents Education : Applicant Education (Graduate/ Under Graduate)Self_Employed : Self employed (Y/N)ApplicantIncome : Applicant incomeCoapplicantIncome : Coapplicant incomeLoanAmount : Loan amount in thousands of dollarsLoan_Amount_Term : Term of loan in monthsCredit_History : credit history meets guidelines yes or noProperty_Area : Urban/ Semi Urban/ RuralLoan_Status : Loan approved (Y/N) this is the target variableExploratory data analysis:We’ll be using seaborn for visualisation and pandas for data manipulation.

You can download the dataset from here : https://datahack.


com/contest/practice-problem-loan-prediction-iii/We’ll import the necessary libraries and load the data :import matplotlib.

pyplot as pltimport pandas as pdimport seaborn as sns%matplotlib inlineimport numpy as nptrain=pd.




csv")We can look at few top rows using the head functiontrain.

head()We can see that there’s some missing data , we can further explore this using the pandas describe function:train.

describe()Some variables have missing values that we’ll have to deal with , and also there seems to be some outliers for the Applicant Income , Coapplicant income and Loan Amount .

We also see that about 84% applicants have a credit_history.

Because the mean of Credit_History field is 0.

84 and it has either (1 for having a credit history or 0 for not)It would be interesting to study the distribution of the numerical variables mainly the Applicant income and the loan amount.

To do this we’ll use seaborn for visualization.



ApplicantIncome,kde=False)The distribution is skewed and we can notice quite a few outliers.

Since Loan Amount has missing values , we can’t plot it directly.

One solution is to drop the missing values rows then plot it, we can do this using the dropna functionsns.



dropna(),kde=False)People with better education should normally have a higher income, we can check that by plotting the education level against the income.


boxplot(x='Education',y='ApplicantIncome',data=train)The distributions are quite similar but we can see that the graduates have more outliers which means that the people with huge income are most likely well educated.

Another interesting variable is credit history , to check how it affects the Loan Status we can turn it into binary then calculate it’s mean for each value of credit history .

A value close to 1 indicates a high loan success rate#turn loan status into binary modified=trainmodified['Loan_Status']=train['Loan_Status'].

apply(lambda x: 0 if x=="N" else 1 )#calculate the meanmodified.


mean()['Loan_Status']OUT : Credit_History0.

0 0.


0 0.

795789Name: Loan_Status, dtype: float64People with a credit history a way more likely to pay their loan, 0.

07 vs 0.

79 .

This means that credit history will be an influential variable in our model.

Data preprocessing:The first thing to do is to deal with the missing value , lets check first how many there are for each variable.


apply(lambda x: sum(x.

isnull()),axis=0)OUT:Loan_ID 0Gender 13Married 3Dependents 15Education 0Self_Employed 32ApplicantIncome 0CoapplicantIncome 0LoanAmount 22Loan_Amount_Term 14Credit_History 50Property_Area 0Loan_Status 0dtype: int64For numerical values a good solution is to fill missing values with the mean , for categorical we can fill them with the mode (the value with the highest frequency)#categoricaltrain['Gender'].


mode()[0], inplace=True)train['Married'].


mode()[0], inplace=True)train['Dependents'].


mode()[0], inplace=True)train['Loan_Amount_Term'].


mode()[0], inplace=True)train['Credit_History'].


mode()[0], inplace=True)train['Self_Employed'].


mode()[0], inplace=True)#numericaldf['LoanAmount'].


mean(), inplace=True)Next we have to handle the outliers , one solution is just to remove them but we can also log transform them to nullify their effect which is the approach that we went for here.

Some people might have a low income but strong CoappliantIncome so a good idea is to combine them in a TotalIncome column.


log(train['LoanAmount'])train['TotalIncome']= train['ApplicantIncome'] +train['CoapplicantIncome'] train['TotalIncome_log']=np.

log(train['TotalIncome'])plotting the histogram of loan amount log we can see that it’s a normal distribution!Modeling:We’re gonna use sklearn for our models , before doing that we need to turn all the categorical variables into numbers.

We’ll do that using the LabelEncoder in sklearnfrom sklearn.

preprocessing import LabelEncodercategory= ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status'] encoder= LabelEncoder() for i in category: train[i] = encoder.

fit_transform(train[i]) train.

dtypesOUT:Loan_ID objectGender int64Married int64Dependents int64Education int64Self_Employed int64ApplicantIncome int64CoapplicantIncome float64LoanAmount float64Loan_Amount_Term float64Credit_History float64Property_Area int64Loan_Status int64LoanAmount_log float64TotalIncome float64TotalIncome_log float64dtype: objectNow all our variables have became numbers that our models can understand.

To try out different models we’ll create a function that takes in a model , fits it and mesures the accuracy which means using the model on the train set and mesuring the error on the same set .

And we’ll use a technique called Kfold cross validation which splits randomly the data into train and test set, trains the model using the train set and validates it with the test set, it will repeat this K times hence the name Kfold and takes the average error.

The latter method gives a better idea on how the model performs in real life.

#Import the modelsfrom sklearn.

linear_model import LogisticRegressionfrom sklearn.

cross_validation import KFold #For K-fold cross validationfrom sklearn.

ensemble import RandomForestClassifierfrom sklearn.

tree import DecisionTreeClassifier, export_graphvizfrom sklearn import metricsdef classification_model(model, data, predictors, outcome): #Fit the model: model.

fit(data[predictors],data[outcome]) #Make predictions on training set: predictions = model.

predict(data[predictors]) #Print accuracy accuracy = metrics.

accuracy_score(predictions,data[outcome]) print ("Accuracy : %s" % "{0:.


format(accuracy))#Perform k-fold cross-validation with 5 folds kf = KFold(data.

shape[0], n_folds=5) error = [] for train, test in kf: # Filter training data train_predictors = (data[predictors].

iloc[train,:]) # The target we're using to train the algorithm.

train_target = data[outcome].

iloc[train] # Training the algorithm using the predictors and target.


fit(train_predictors, train_target) #Record error from each cross-validation run error.



iloc[test,:], data[outcome].

iloc[test])) print ("Cross-Validation Score : %s" % "{0:.



mean(error)))Now we can test different models we’ll start with logistic regression:outcome_var = 'Loan_Status'model = LogisticRegression()predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area']classification_model(model, train,predictor_var,outcome_var)OUT : Accuracy : 80.

945%Cross-Validation Score : 80.

946%We’ll try now a Decision tree which is should give us more accurate resultmodel = DecisionTreeClassifier() predictor_var = ['Credit_History','Gender','Married','Education'] classification_model(model, df,predictor_var,outcome_var)OUT:Accuracy : 80.

945%Cross-Validation Score : 78.

179%We’ve got the same score on accuracy but a worse score in cross validation , a more complex model doesn’t always means a better score.

Finally we’ll try random forestsmodel = RandomForestClassifier(n_estimators=100)predictor_var = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'LoanAmount_log','TotalIncome_log']classification_model(model, train,predictor_var,outcome_var)OUT: Accuracy : 100.

000%Cross-Validation Score : 78.

015%The model is giving us perfect score on accuracy but a low score in cross validation , this a good example of over fitting.

The model is having a hard time at generalizing since it’s fitting perfectly to the train set.

Solutions to this include : Reducing the number of predictors or Tuning the model parameters.

Conclusion:We’ve gone through a good portion of the data science pipe line in this article, namely EDA , preprocessing and modeling and we’ve used essential classification models such as Logistic regression , Decision tree and Random forests.

It would be interesting to learn more about the backbone logic behind these algorithms, and also tackle the data scraping and deployment phases.

We’ll try to do that in the next articles.


. More details

Leave a Reply