First, a period of interest is selected (ex: last 12 months), that is the period for which we are going to get a retention story.
Second, this period is broken down into smaller windows of time (ex: weekly or monthly), (that is how we achieve the when).
Finally users are grouped into cohorts, (that is how we achieve the who).
Then all we have to do is calculate the retention rate of each user cohort during each window.
Here is a simple example: say our period of interest is last year, broken down by month (the windows) and that a user’s cohort is the month when they made first contact.
Then range retention is going to be a table containing these retention rates:Users who made first contact in January: what proportion came back 1, 2 … 11 months later?Users who made first contact in February: what proportion came back 1, 2 … 10 months later?(…)Users who made first contact in November: what proportion came back 1 month later?You can see we are about to calculate a triangular table containing relevant retention rates in the cells shaded in green and NAs in elsewhere.
Calculating the range retention tableThis section, I will detail an approach to calculate the range retention table.
To make things less theoretical, we will be using an example: we want to get to the retention story of a ficticious store by analyzing its orders data.
Our period of interest is the year 2004, our windows are going to be monthly and a user cohort is the month when they made their first purchse at the store.
Note: data has been filtered to retain only customers who made first contact in 2004, we want to limit our analysis to new customers.
1 Raw dataLet’s start with a table containing the orders data.
While all we need for our pupose is a customer ID and an order date, our raw orders data has more information and looks like this:2.
2 Add column capturing the window IDIt all starts simply by creating one column capturing the window ID.
For our use case, this is going to be the month extracted from theorderdate.
3 Define the cohortsLet’s assign each user to a cohort.
For us the cohort ID is going to be the month of the first purchase so we will name our cohort ID column join_month (the month the user joined the store).
We create a table mapping each unique customerid with its join_month.
4 Get the cohort sizesNext, we want to get the number of unique customerid in each cohort.
The result is a simple table containing 12 rows, one per join_month (the cohort ID).
Note: while there is a cohort for month 12, we will not be reporting any retention values for it since we are looking at 12 months of data (we would need to have 13 months to report how cohort #12 is doing on month 12 + 1).
5 Assign a cohort ID and activity index to each interactionThe next step is to go back to the raw orders data and add a couple of useful columns to it.
For each order row, we will want to add: • join_month (based on the customerid and the first table we created) and • activity_index (based on order date month and joined_month).
We will filter out rows where activity_index = 0.
The activity_index captures how many windows after join_monththe order was placed at.
Example: for a join_month of Janurary (1), an order placed in March will have an activity_index of 2 (activity is 2 months after first contact).
We now have a table containing one row per order with columns identifying which cohort placed the order (join_month) and the order date as a number of months after first contact (activity_index).
6 Get the activity sizeWith a bit of aggregation, it is easy to count the number of customers of each cohort who were active n months after joining.
Note: if some customers of a particular cohort did not return at all during an entire month you would want to fill in the blanks to ensure you have a 0 for the missingjoin_month and activity_index combinations (although you may no longer have a business at this point…).
7 Form the range retention tableThis is a last part, we join the counts (cohort size and activity size) and make a simple division to get the retention percentages.
For the final touches: pivot the table, add some color and you’re done!Note: the 0s in the lower triangle are in fact NAs which have been replaced to allow for the nice cell-by-cell shading, see this issue for details.
Turning the table into a retention storyLet’s look at the first row, it deals with a specific cohort of users: the customers who made their first purchase in January.
This cohort has 979 distinct customers.
The green values can be read from left to right to see which percentage of these customers made (at least) one purchase 1, 2, … 11 months after first contact (4.
39% , 4.
01%) — that is the evolution of retention for this particular user cohort over the period of interest (2004).
Realistically, for our store use case, we want this to stay leveled over time after an inital decrease.
But we can imagine a company in the business of releasing products with a limited lifetime (ex: free mobile games).
In this case, it is normal and acceptable to lose users steadily during the lifetime of the product (as long as the next product which will attract users again is released soon enough).
Now, let’s look at the first green column from the left, instead of being for a fixed cohort, it is for a fixed activity_index.
That is quite useful to see if customers come back one month after their first purchase and how that is evolving throughout the period of interest (2004).
Ideally, this is something we slowly but constantly get better at.
The best stuff is always freeWanna get hands-on and see some code?.We’re glad you asked!.This Notebook provides a complete code example in Python (use Github’s “try again” if the notebook fails to load — don’t ask me why this happens sometimes).
The easiest way to get all the artifacts locally is to clone this Github repo.
$ git clone https://github.
gitThanks for reading.
Please come again!.. More details