In other words, after how much time this customer will churn?How long will this machine last, after successfully running for a year ?What is the relative retention rate of different marketing channels?What is the likelihood that a patient will survive, after being diagnosed?If you find any of the above questions (or even the questions remotely related to them) interesting then read on.
The purpose of this article is to build an intuition, so that we can apply this technique in different business settings.
Table of ContentsIntroductionDefinitionsMathematical IntuitionKaplan-Meier EstimateCox Proportional Hazard ModelEnd NoteAdditional ResourcesIntroductionSurvival Analysis is a set of statistical tools, which addresses questions such as ‘how long would it be, before a particular event occurs’; in other words we can also call it as a ‘time to event’ analysis.
This technique is called survival analysis because this method was primarily developed by medical researchers and they were more interested in finding expected lifetime of patients in different cohorts (ex: Cohort 1- treated with Drug A, & Cohort 2- treated with Drug B).
This analysis can be further applied to not just traditional death events, but to many different types of events of interest in different business domains.
We will discuss more on the definition of events and time to events in the next section.
DefinitionsAs mentioned above that the Survival Analysis is also known as Time to Event analysis.
Thus, from the name itself, it is evident that the definition of Event of interest and the Time is vital for the Survival Analysis.
In order to understand the definition of time and event, we will define the time and event for various use cases in the industry.
Predictive Maintenance in Mechanical Operations: Survival Analysis applies to mechanical parts/ machines to answer about ‘how long will the machine last?’.
Predictive Maintenance is one of its applications.
Here, Event is defined as the time at which the machine breaks down.
Time of origin is defined as the time of start of machine for the continuous operations.
Along with the definition of time we should also define time scale (time scale could be weeks, days, hours.
The difference between the time of event and the time origin gives us the time to event.
Customer Analytics (Customer Retention): With the help of Survival Analysis we can focus on churn prevention efforts of high-value customers with low survival time.
This analysis also helps us to calculate Customer Life Time Value.
In this use case, Event is defined as the time at which the customer churns / unsubscribe.
Time of origin is defined as the time at which the customer starts the service/subscription with a company.
Time scale could be months, or weeks.
The difference between the time of event and the time origin gives us the time to event.
Marketing Analytics (Cohort Analysis): Survival Analysis evaluates the retention rates of each marketing channel.
In this use case, Event is defined as the time at which the customer unsubscribe a marketing channel.
Time of origin is defined as the time at which the customer starts the service / subscription of a marketing channel.
Time scale could be months, or weeks.
Actuaries: Given the risks of a population, survival analysis evaluates the probability of the population to die in a particular time range.
This analysis helps the insurance companies to evaluate the insurance premiums.
Guess, the event and time definition for this use case!!!I hope the definition of a event, time origin, and time to event is clear from the above discussion.
Now its time to delve a bit deeper into the mathematical formulation of the analysis.
Mathematical IntuitionLets assume a non-negative continuous random variable T, representing the time until some event of interest.
For example, T might denote:• the time from the customer’s subscription to the customer churn.
• the time from start of a machine to its breakdown.
• the time from diagnosis of a disease until death.
Since we have assumed a random variable T (a random variable is generally represented in capital letter), so we should also talk about some of its attributes.
T is a random variable, ‘what is random here ?’.
To understand this we will again use our earlier examples as follows.
• T is the time from customer’s(a randomly selected customer) subscription to the customer churn.
• T is the time from start of a randomly selected machine to its breakdown.
• T is the time from diagnosis of a disease until death of a randomly selected patient.
T is continuous random variable, therefore it can take any real value.
T is non-negative, therefore it can only take positive real values (0 included).
For such random variables, probability density function (pdf) and cumulative distribution function (cdf) are commonly used to characterize their distribution.
Thus, we will assume that this random variable has a probability density function f(t) , and cumulative distribution function F(t) .
pdf : f(t)cdf : F(t) : As per the definition of cdf from a given pdf, we can define cdf as F(t) = P (T< t) ; here , F(t) gives us the probability that the event has occurred by duration t.
In simple words, F(t) gives us the proportion of population with the time to event value less than t.
cdf as the integral form of pdfSurvival Function: S(t) = 1 – F(t)= P(T ≥t); S(t) gives us the probability that the event has not occurred by the time t .
In simple words, S(t) gives us the proportion of population with the time to event value more than t.
Survival Function in integral form of pdfHazard Function : h(t) : Along with the survival function, we are also interested in the rate at which event is taking place, out of the surviving population at any given time t.
In medical terms, we can define it as “out of the people who survived at time t, what is the rate of dying of those people”.
Lets make it even more simpler:Lets write it in the form of its definition:h(t) = [( S(t) -S(t + dt) )/dt] / S(t)limit dt → 02.
From its formulation above we can see that it has two parts.
Lets understand each partInstantaneous rate of event: ( S(t) -S(t + dt) )/dt ; this can also be seen as the slope at any point t of the Survival Curve, or the rate of dying at any time t.
Also lets assume the total population as P.
Here, S(t) -S(t + dt) , this difference gives proportion of people died in time dt, out of the people who survived at time t.
Number of people surviving at t is S(t)*P and the number of people surviving at t+dt is S(t+dt)*P.
Number of people died during dt is (S(t) -S(t + dt))*P.
Instantaneous rate of people dying at time t is (S(t) -S(t + dt))*P/dt.
Proportion Surviving at time t: S(t); We also know the surviving population at time t, S(t)*P.
Thus dividing number of people died in time dt, by the number of people survived at any time t, gives us the hazard function as measure of RISK of the people dying, which survived at the time t.
The hazard function is not a density or a probability.
However, we can think of it as the probability of failure in an inﬁnitesimally small time period between (t) and (t+ dt) given that the subject has survived up till time t.
In this sense, the hazard is a measure of risk: the greater the hazard between times t1 and t2, the greater the risk of failure in this time interval.
We have : h(t) = f(t)/S(t) ; [Since we know that ( S(t) -S(t + dt) )/dt = f(t)] This is a very important derivation.
The beauty of this function is that Survival function can be derived from Hazard function and vice versa.
The utility of this will be more evident while deriving a survival function from a given hazard function in Cox Proportional Model (Last segment of the article).
These were the most important mathematical definitions and the formulations required to understand the survival analysis.
We will end our mathematical formulation here and move forward towards estimation of survival curve.
Kaplan-Meier EstimateIn the Mathematical formulation above we assumed the pdf function and thereby derived Survival function from the assumed pdf function.
Since we don’t have the true survival curve of the population, thus we will estimate the survival curve from the data.
There are two main methods to estimate the survival curve.
The ﬁrst method is a parametric approach.
This method assumes a parametric model, which is based on certain distribution such as exponential distribution, then we estimate the parameter, and then finally form the estimator of the survival function.
A second approach is a powerful non-parametric method called the Kaplan-Meier estimator.
We will discuss it in this section.
In this section we will also try to create the Kaplan-Meier curve manually as well as by using the Python library (lifelines).
Here, ni is deﬁned as the population at risk at time just prior to time ti; and di is defined as number of events occurred at time ti.
This, will become more clear with the example below.
We will discuss an arbitrary example from a very small self created data, to understand the creation of Kaplan Meier Estimate curve, manually as well as using a python package.
Event, Time and Time Scale Definition for the Example:The example below(Refer Fig 1) shows the data of 6 users of a website.
These users visit the website and leaves that website after few minutes.
Thus, event of interest is the time in which a user leaves the website.
Time of origin is defined as the time of opening the website by a user and the time scale is in minutes.
The study starts at time t=0 and ends at time t=6 minutes.
Censorship:Point worth noting here is that during the study period , event happened with 4 out of 6 users(shown in red), while two users (shown in green) continued and the event didn’t happened till the end of the study; such data is called the Censored data.
In case of censorship, as here in case of user 4 and user 5, we don’t know at what time the event will occur, but still we are using that data to estimate the probability of survival.
If we choose not to include the censored data, then it is highly likely that our estimates would be highly biased and under-estimated.
The inclusion of censored data to calculate the estimates, makes the Survival Analysis very powerful, and it stands out as compared to many other statistical techniques.
Calculations for KM Curve and the interpretation:Now, lets talk about the calculations done to create the KM Curve below (Refer Fig 1).
In figure 1, Kaplan Meier Estimate curve, x axis is the time of event and y axis is the estimated survival probability.
From t=0 till t<2.
5 or t∈[0 , 2.
5), number of users at risk(ni) at time t=0 is 6 and number of events occurred(di) at time t=0 is 0, therefore for all t in this interval, estimated S(t) = 1.
From the definition of the event we can say that 100% is the probability that the time between a user opens the website and exit the website is greater than 2.
5 till t<2.
4 or t ∈ [2.
5 , 4), number of users at risk(ni) at time just before time 2.
5 minutes (2.
4999* mins) is 6 and number of events occurred(di) at time t=2.
5 minutes is 1, therefore therefore for all t in this interval, estimated S(t)= 0.
From the definition of the event we can say that 83% is the probability that the time between a user opens the website and exit the website is greater than 3.
From t=4 till t<5 or t ∈[4 , 5), number of users at risk(ni) at time just before time 4 minutes (3.
999* mins) is 5 and number of events occurred(di) at time t=4 minutes is 2, therefore for all t in this interval, estimated S(t) = 0.
This result can also be verified by simple mathematics of relative frequency.
For any t∈[4,5), lets say t=4.
5, total number of users at the start were 6, total number remaining at t are 3.
Therefore, the probability of the users spending more than 4.
5 (or any time t ∈[4,5)) minutes on website is (3/6), which is 50%.
Similarly, we can estimate the probability for other time intervals (refer table calculations in fig 1)Mathematically, for any time t ∈ [t1, t2), we haveS(t) = P(survive in [0, t1)) × P(survive in [t1, t] | survive in [0, t1))fig 1: a.
Shows the user level time data in color.
Shows Kaplan Meier (KM)Estimate Curve; c.
Formula for estimation of KM curve; d.
Table showing the calculations# Python code to create the above Kaplan Meier curvefrom lifelines import KaplanMeierFitter## Example Data durations = [5,6,6,2.
5,4,4]event_observed = [1, 0, 0, 1, 1, 1]## create a kmf objectkmf = KaplanMeierFitter() ## Fit the data into the modelkmf.
fit(durations, event_observed,label='Kaplan Meier Estimate')## Create an estimatekmf.
plot(ci_show=False) ## ci_show is meant for Confidence interval, since our data set is too tiny, thus i am not showing it.
Real World Example:As mentioned earlier that Survival Analysis can be used for the cohort analysis, to gain insights.
So, here we will be using the Telco-Customer-Churn data set, to gain insight about the lifelines of customers in different cohorts.
Github link for the code: LinkLets create two cohorts of customers based on whether a customer has subscribed for Streaming TV or not.
We want to know that which cohort has the better customer retention.
The required code for plotting the Survival Estimates is given below.
kmf1 = KaplanMeierFitter() ## instantiate the class to create an object## Two Cohorts are compared.
Streaming TV Not Subscribed by users, and Cohort 2.
Streaming TV subscribed by the users.
groups = df['StreamingTV'] i1 = (groups == 'No') ## group i1 , having the pandas series for the 1st cohorti2 = (groups == 'Yes') ## group i2 , having the pandas series for the 2nd cohort## fit the model for 1st cohortkmf1.
fit(T[i1], E[i1], label='Not Subscribed StreamingTV')a1 = kmf1.
plot()## fit the model for 2nd cohortkmf1.
fit(T[i2], E[i2], label='Subscribed StreamingTV')kmf1.
plot(ax=a1)Fig 2: Kaplan Meier Curve of the two cohorts.
We have two survival curves , one for each cohort.
From the curves, it is evident that the customers, who have subscribed for the Streaming TV, have better customer retention as compared to the customers, who have not subscribed for the Streaming TV.
At any point t across the timeline, we can see that the survival probability of the cohort in blue is less than the cohort in red.
For the cohort in blue, the survival probability is decreasing with high rate in first 10 months and it gets relatively better after that; however, for the red cohort, the rate of decrease in survival rate is fairly constant.
Therefore, for the cohort , which has not subscribed for the Streaming TV, efforts should be made to retain the customers in first 10 volatile months.
We can do more such cohort analysis from the survival curves of the different cohorts.
This cohort analysis represents the limited use case of the potential of the survival analysis because we are using it for the aggregated level of the data.
We can create the Survival Curves for even the individual users based on the effects of covariates on the baseline Survival Curves.
This will be our focal point of the next section of this article.
Cox Proportional Hazard ModelThe time to event for an individual in the population is very important for the survival curves at the aggregate level; however, in real life situations along with the event data we also have the covariates (features) of that individual.
In such cases, it is very important to know about the impact of covariates on the survival curve.
This would help us in predicting the survival probability of an individual, if we know the associated covariates values.
For example, in the telco-churn example discussed above, we have each customer’s tenure when they churned (the event time T) and the customer’s Gender, MonthlyCharges, Dependants, Partner, PhoneService etc.
The other variables are the covariates in this example.
We are often interested in how these covariates impacts the survival probability function.
In such cases, it is the conditional survival function S(t|x) = P(T > t|x).
Here x denotes the covariates.
In our example, we are interested in S(tenure > t|(Gender, MonthlyCharges, Dependants, Partner, PhoneService etc)).
The Cox (proportional hazard) model is one of the most popular model combining the covariates and the survival function.
It starts with modeling the hazard function.
Here, β is the vector of coeﬃcients of each covariate.
The function ho(t) is called the baseline hazard function.
The Cox model assumes that the covariates have a linear multiplication eﬀect on the hazard function and the eﬀect stays the same across time.
The idea behind the model is that the log-hazard of an individual is a linear function of their static covariates, and a population-level baseline hazard that changes over time.
[Source: lifelines documentation]From the above equation we can also derive cumulative conditional hazard function as below:As we are already aware that we can derive survival function from the hazard function with the help of expression derived in above section.
Thus, we can get the survival function for each subject/individual/customer.
Basic implementation in python:We will now discuss about its basic implementation in python with the help of lifelines package.
We have used the same telco-customer-churn data-set, which we have been using in the above sections.
We will run a python code for predicting the survival function at customer level.
from lifelines import CoxPHFitter## My objective here is to introduce you to the implementation of the model.
Thus taking subset of the columns to train the model.
## Only using the subset of the columns present in the original datadf_r= df.
loc[:['tenure', 'Churn', 'gender', 'Partner', 'Dependents', 'PhoneService','MonthlyCharges','SeniorCitizen','StreamingTV']]df_r.
head() ## have a look at the data## Create dummy variablesdf_dummy = pd.
head()# Using Cox Proportional Hazards modelcph = CoxPHFitter() ## Instantiate the class to create a cph objectcph.
fit(df_dummy, 'tenure', event_col='Churn') ## Fit the data to train the modelcph.
print_summary() ## HAve a look at the significance of the featuresThe summary statistics above indicates the significance of the covariates in predicting the churn risk.
Gender doesn’t play any significant role in predicting the churn, whereas all the other covariates are significant.
Interesting point to note here is that , the β (coef ) values in case of covariates MonthlyCharges and gender_Male is approximately zero (~-0.
01), but still the MonthlyCharges plays a significant role in predicting churn , while the latter is insignificant.
The reason is that the MonthlyCharges is continuous value and it can vary from the order of tens, hundreds to thousands, when multiplied by the small coef (β=-0.
01), it becomes significant.
On the other hand, the covariate gender can only take the value 0 or 1, and in both the cases [exp(-0.
01 * 0), exp(-0.
01*1)] it will be insignificant.
## We want to see the Survival curve at the customer level.
Therefore, we have selected 6 customers (rows 5 till 9).
tr_rows = df_dummy.
iloc[5:10, 2:]tr_rows## Lets predict the survival curve for the selected customers.
## Customers can be identified with the help of the number mentioned against each curve.
It shows the Survival Curves at customer level of customer number 5,6,7,8, and 9Fig 2 .
shows the survival curves at customer level.
It shows the survival curves for customer number 5,6,7,8, & 9.
Creating the survival curves at each customer level helps us in proactively creating a tailor made strategy for high-valued customers for different survival risk segments along the timeline.
End NoteThough, there are many other things which are still remaining to be covered in survival analysis such as ‘checking proportionality assumption’, & ‘model selection’ ; however, with a basic understanding of the mathematics behind the analysis, and the basic implementation of the survival analysis (using the lifelines package in python) will help us in implementing this model in any pertinent business use case.
Additional ResourcesThe following resources were extremely helpful not only in motivating me to study the survival analysis but also in making this article.
Check them out for more on survival analysis.
Lifelines Python DocumentationSciPy 2015 lecture by Allen DowneyIPPCR 2015: Conceptual Approach to Survival AnalysisNonparametric Statistics by Yen-Chi ChenPrinceton University Lecture Notes: Survival Models.