Building an Employee Churn Model in Python to Develop a Strategic Retention Plan

(Photo by rawpixel on Unsplash)Data Description and Exploratory VisualisationsFirst, we import the dataset and make of a copy of the source file for this analysis.

The dataset contains 1,470 rows and 35 columns.

The dataset contains several numerical and categorical columns providing various information on employee’s personal and employment details.

Let’s break down the columns by their type (i.


int64, float64, object):Data sourceThe data provided has no missing values.

In HR Analytics, employee data is unlikely to feature large ratio of missing values as HR Departments typically have all personal and employment data on-file.

However, the type of documentation data is being kept in (i.


whether it is paper-based, Excel spreadsheets, databases, etc) has a massive impact on the accuracy and the ease of access to the HR data.

Numerical features overviewA few observations can be made based on the information and histograms for numerical features:Several numerical features are tail-heavy; indeed several distributions are right-skewed (e.


MonthlyIncome DistanceFromHome, YearsAtCompany).

Data transformation methods may be required to approach a normal distribution prior to fitting a model to the data.

Age distribution is a slightly right-skewed normal distribution with the bulk of the staff between 25 and 45 years old.

EmployeeCount and StandardHours are constant values for all employees.

They’re likely to be redundant features.

Employee Number is likely to be a unique identifier for employees given the feature’s quasi-uniform distribution.

source code: df_HR.

hist() — isn’t Python a beautiful thing?Feature distribution by target attributeIn this section, a more details Exploratory Data Analysis is performed.

For complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

AgeThe age distributions for Active and Ex-employees only differs by one year; with the average age of ex-employees at 33.

6 years old and 37.

6 years old for current employees.

Let’s create a kernel density estimation (KDE) plot colored by the value of the target.

A kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable.

GenderGender distribution shows that the dataset features a higher relative proportion of male ex-employees than female ex-employees, with normalised gender distribution of ex-employees in the dataset at 17.

0% for Males and 14.

8% for Females.

Marital StatusThe dataset features three marital status: Married (673 employees), Single (470 employees), Divorced (327 employees).

Single employees show the largest proportion of leavers at 25%.

Role and Work ConditionsA preliminary look at the relationship between Business Travel frequency and Attrition Status shows that there is a largest normalized proportion of Leavers for employees that travel “frequently”.

Travel metrics associated with Business Travel status were not disclosed (i.


how many hours of Travel is considered “Frequent”).

Several Job Roles are listed in the dataset: Sales Executive, Research Scientist, Laboratory Technician, Manufacturing Director, Healthcare Representative, Manager, Sales Representative, Research Director, Human Resources.

Years at the Company and Since Last PromotionThe average number of years at the company for currently active employees is 7.

37 years and ex-employees is 5.

13 years.

Years with Current ManagerThe average number of years wit current manager for currently active employees is 4.

37 years and ex-employees is 2.

85 years.

OvertimeSome employees have overtime commitments.

The data clearly show that there is significant larger portion of employees with OT that have left the company.

Monthly IncomeEmployee Monthly Income varies from $1009 to $19999.

Target Variable: AttritionThe feature “Attrition” is what this Machine Learning problem is about.

We are trying to predict the value of the feature ‘Attrition’ by using other related features associated with the employee’s personal and professional history.

In the supplied dataset, the percentage of Current Employees is 83.

9% and of Ex-employees is 16.


Hence, this is an imbalanced class problem.

Machine learning algorithms typically work best when the number of instances of each classes are roughly equal.

We will have to address this target feature imbalance prior to implementing our Machine Learning algorithms.

CorrelationLet’s take a look at some of most significant correlations.

It is worth remembering that correlation coefficients only measure linear correlations.

As shown above, “Monthly Rate”, “Number of Companies Worked” and “Distance From Home” are positively correlated to Attrition; while “Total Working Years”, “Job Level”, and “Years In Current Role” are negatively correlated to Attrition.

EDA Concluding RemarksThe dataset does not feature any missing or erroneous data values, and all features are of the correct data type.

The strongest positive correlations with the target features are: Performance Rating, Monthly Rate, Num Companies Worked, Distance From Home.

The strongest negative correlations with the target features are: Total Working Years, Job Level, Years In Current Role, and Monthly Income.

The dataset is imbalanced with the majoriy of observations describing Currently Active Employees.

Single employees show the largest proportion of leavers, compared to Married and Divorced counterparts.

About 10% of leavers left when they reach their 2-year anniversary at the company.

People who live further away from their work show higher proportion of leavers compared to their counterparts.

People who travel frequently show higher proportion of leavers compared to their counterparts.

People who have to work overtime show higher proportion of leavers compared to their counterparts.

Employees that have already worked at several companies previously (already “bounced” between workplaces) show higher proportion of leavers compared to their counterparts.

Pre-processing PipelineIn this section, we undertake data pre-processing steps to prepare the datasets for Machine Learning algorithm implementation.

For complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

EncodingMachine Learning algorithms can typically only have numerical values as their predictor variables.

Hence Label Encoding becomes necessary as they encode categorical labels with numerical values.

To avoid introducing feature importance for categorical features with large numbers of unique values, we will use both Lable Encoding and One-Hot Encoding as shown below.

Feature ScalingFeature Scaling using MinMaxScaler essentially shrinks the range such that the range is now between 0 and n.

Machine Learning algorithms perform better when input numerical variables fall within a similar scale.

In this case, we are scaling between 0 and 5.

Splitting data into training and testing setsPrior to implementating or applying any Machine Learning algorithms, we must decouple training and testing datasets from our master dataframe.

Building Machine Learning ModelsBaseline AlgorithmsLet’s first use a range of baseline algorithms (using out-of-the-box hyper-parameters) before we move on to more sophisticated solutions.

The algorithms considered in this section are: Logistic Regression, Random Forest, SVM, KNN, Decision Tree Classifier, Gaussian NB.

Let’s evaluate each model in turn and provide accuracy and standard deviation scores.

For complete code, please refer to this GitHub repo and/or the Kaggle Kernel.

Classification Accuracy is the number of correct predictions made as a ratio of all predictions made.

It is the most common evaluation metric for classification problems.

However, it is often misused as it is only really suitable when there are an equal number of observations in each class and all predictions and prediction errors are equally important.

It is not the case in this project, so a different scoring metric may be more suitable.

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.

The AUC represents a model’s ability to discriminate between positive and negative classes, and is better suited to this project.

An area of 1.

0 represents a model that made all predictions perfectly.

An area of 0.

5 represents a model as good as random.

Based on our ROC AUC comparison analysis, Logistic Regression and Random Forest show the highest mean AUC scores.

We will shortlist these two algorithms for further analysis.

See below for more details on these two algos.

Logistic RegressionGridSearchCV allows use to fine-tune hyper-parameters by searching over specified parameter values for an estimator.

As shown below, the results from GridSearchCV provided us with fine-tuned hyper-parameter using ROC_AUC as the scoring metric.

Confusion MatrixThe Confusion matrix provides us with a much more detailed representation of the accuracy score and of what’s going on with our labels — we know exactly which/how labels were correctly and incorrectly predicted.

The accuracy of the Logistic Regression Classifier on test set is 75.


Label ProbabilityInstead of getting binary estimated target features (0 or 1), a probability can be associated with the predicted target.

The output provides a first index referring to the probability that the data belong to class 0 (employee not leaving), and the second refers to the probability that the data belong to class 1 (employee leaving).

Predicting probabilities of a particular label provides us with a measure of how likely an employee is to leave the company.

Random Forest ClassifierLet’s take a closer look at using the Random Forest algorithm.

I’ll fine-tune the Random Forest algorithm’s hyper-parameters by cross-validation against the AUC score.

Random Forest allows us to know which features are of the most importance in predicting the target feature (“Attrition” in this project).

Below, we plot features by their importance.

Random Forest helped us identify the Top 10 most important indicators (ranked in the table below) as: (1) MonthlyIncome, (2) OverTime, (3) Age, (4) MonthlyRate, (5) DistanceFromHome, (6) DailyRate, (7) TotalWorkingYears, (8) YearsAtCompany, (9) HourlyRate, (10) YearsWithCurrManager.

The accuracy of the RandomForest Regression Classifier on test set is 86.


Below the corresponding Confusion Matrix is shown.

Predicting probabilities of a particular label provides us with a measure of how likely an employee is to leave the company.

The AUC when predicting probabilities using RandomForestClassifier is 0.


ROC GraphsAUC — ROC curve is a performance measurement for classification problem at various thresholds settings.

ROC is a probability curve and AUC represents degree or measure of separability.

It tells how much model is capable of distinguishing between classes.

The green line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

As shown above, the fine-tuned Logistic Regression model showed a higher AUC score compared to the Random Forest Classifier.

Concluding RemarksRisk CategoryAs the company generates more data on its employees (on New Joiners and recent Leavers) the algorithm can be re-trained using the additional data and theoritically generate more accurate predictions to identify high-risk employees of leaving based on the probabilistic label assigned to each feature variable (i.


employee) by the algorithm.

Employees can be assigning a “Risk Score” based on the predicted label such that:Low-risk for employees with label < 0.

6Medium-risk for employees with label between 0.

6 and 0.

8High-risk for employees with label > 0.

8Good work completing the walk-through this analysis — but what now?.How does this help decision-makers?.(Photo by rawpixel on Unsplash)Strategic Retention PlanThe stronger indicators of people leaving include:Monthly Income: people on higher wages are less likely to leave the company.

Hence, efforts should be made to gather information on industry benchmarks in the current local market to determine if the company is providing competitive wages.

Over Time: people who work overtime are more likelty to leave the company.

Hence efforts must be taken to appropriately scope projects upfront with adequate support and manpower so as to reduce the use of overtime.

Age: Employees in relatively young age bracket 25–35 are more likely to leave.

Hence, efforts should be made to clearly articulate the long-term vision of the company and young employees fit in that vision, as well as provide incentives in the form of clear paths to promotion for instance.

DistanceFromHome: Employees who leave further from home are more likely to leave the company.

Hence, efforts should be made to provide support in the form of company transportation for clusters of employees leaving the same area, or in the form of Transportation Allowance.

Initial screening of employees based on their home location is probably not recommended as it would be regarded as a form of discrimination as long as employees make it to work on time every day.

TotalWorkingYears: The more experienced employees are less likely to leave.

Employees who have between 5–8 years of experience should be identified as potentially having a higher-risk of leaving.

YearsAtCompany: Loyal companies are less likely to leave.

Employees who hit their two-year anniversary should be identified as potentially having a higher-risk of leaving.

YearsWithCurrManager: A large number of leavers leave 6 months after their Current Managers.

By using Line Manager details for each employee, one can determine which Manager have experienced the largest numbers of employees resigning over the past year.

Several metrics can be used here to determine whether action should be taken with a Line Manager:# of years the Line Manager has been in a particular position: this may indicate that the employees may need management training or be assigned a mentor (ideally an Executive) in the organisationPatterns in the employees who have resigned: this may indicate recurring patterns in employees leaving in which case action may be taken accordingly.

(Photo by rawpixel on Unsplash)Final ThoughtsA strategic “Retention Plan” should be drawn for each Risk Category group.

In addition to the suggested steps for each feature listed above, face-to-face meetings between a HR representative and employees can be initiated for medium- and high-risk employees to discuss work conditions.

Also, a meeting with those employee’s Line Manager would allow to discuss the work environment within the team and whether steps can be taken to improve it.

I hope you enjoyed reading this article as much as I had writing it.

Once again, for complete code, please refer to this GitHub repo and/or the Kaggle Kernel.


. More details

Leave a Reply