# How to Conquer Cohort Analysis With a Powerful Clinical Research Tool

Directly from the data.

KM says our best estimate of the probability of survival from one month to the next is exactly the weighted average retention rate for that month in our dataset (also called the maximum likelihood estimator in statistics parlance).

So if in a group of cohorts we have 1000 customers from month one, of which 600 survive until month two, our best guess of the “true” probability of survival from month 1 to 2 is 60%.

We do the same for the next month.

Divide the number of customers that survived through month 3 by the number of customers who survived through month 2 to get the estimated probability of survival from month 2 to 3.

If we don’t have month 3 data for a cohort because it’s only two months old, we exclude those customers from our calculations for month 3 survival.

Repeat for as many cohorts / months as you have, excluding in each calculation any cohorts missing data for the current period.

Then, to calculate the probability of survival through any given month, multiply the individual monthly (conditional) probabilities up through that month.

Though a morbid thought, measuring patient survival is functionally equivalent to measuring customer retention, so we can easily transfer KM to customer cohort analysis!Putting Kaplan-Meier to the testLet’s make this clearer by applying the Kaplan-Meier estimator to our previous example.

The probability of surviving month 1 is 69% (total customers alive in month 1 divided by total in month 0).

The probability of surviving month 2, given a customer survived month 1, is 72% (total customers alive in month 2 divided by total in month 1, excluding the last cohort which is missing month 2 data).

So the cumulative probability of surviving at least two months is 69% x 72% = 50%.

Rinse, wash, and repeat for each subsequent month.

Side-by-side comparison reveals the superiority of KM:What’s great about KM is it leverages all the data we have, even the younger cohorts for whom we have fewer observations.

For example, while the average of all the available cohorts at month 3 only uses the data for cohorts 1–3, due to its cumulative nature, the KM estimator effectively incorporates the improved early retention of the newer cohorts.

This yields a 3-month retention estimate of 38%, which is higher than any of the cohorts we can actually measure at month 3.

This is exactly what we want -cohorts 4 and 5 are both larger and better retaining than 1–3.

Hence, it is likely that the 3-month retention rate for a random customer picked among these cohorts will exceed the historical average, as the customer will likely be in cohorts 4 or 5.

Using all the data is also nice because it makes our estimates of the tail probabilities much more precise than if we could only rely on the data of customers who we retained that long.

Kaplan-Meier curves also fixes the wonky behavior in the right tail of the retention curve by respecting a fundamental law of probability: cumulative probabilities can only decline as you multiply more numbers.

Recommended by 95% of doctorsThis analysis could easily be extended.

Let’s go back to the 2016 vs 2017 example-we could run the Kaplan-Meier calculation on each respective group of cohorts and then compare the resulting survival curves, highlighting differences in expected retention between the two groups.

While I won’t cover it here, you can also calculate p-values, confidence intervals, and statistical significance tests for Kaplan-Meier curves.

This lets you to make rigorous statements like “the improvement of cohort retention in 2018 relative to 2017 was statistically significant (at the 5% level)”-cool stuff:Kaplan-Meier is a powerful tool for anyone who spends time analyzing customer cohort data.

KM has been battle-tested in rigorous clinical trials-if anything it’s surprising it hasn’t caught on more among technology operators and investors.

If you’re a product manager, growth hacker, marketer, data scientist, investor, or anyone else who understands the deep importance of customer retention analysis, the Kaplan-Meier estimator should be a valuable weapon in your analytics arsenal.

Originally published at https://whoisnnamdi.