# Survival analysis and the stratified sample

This is determined by the hazard rate, which is the proportion of events in a specific time interval (for example, deaths in the 5th year after beginning cancer treatment), relative to the size of the risk set at the beginning of that interval (for example, the number of people known to have survived 4 years of treatment).

As an example of hazard rate: 10 deaths out of a million people (hazard rate 1/100,000) probably isn’t a serious problem.

But 10 deaths out of 20 people (hazard rate 1/2) will probably raise some eyebrows.

The present study examines the timing of responses to a hypothetical mailing campaign.

While the data are simulated, they are closely based on actual data, including data set size and response rates.

The population-level data set contains 1 million “people”, each with between 1–20 weeks’ worth of observations.

The data are normalized such that all subjects receive their mail in Week 0.

Subjects’ probability of response depends on two variables, age and income, as well as a gamma function of time.

Thus, the unit of analysis is not the person, but the person*week.

Above: The formula used to calculate the log-odds of response probability in a given week.

z is the log-odds of response probability.

The function g(week) is a gamma function of time.

And a quick check to see that our data adhere to the general shape we’d predict:Above: the gamma function used in the hazard function (line) overlaid against a density histogram of hazard rates in the simulated data set (bars).

As time passes after mailing, subjects become less and less likely to respond.

An individual has about a 1/10,000 chance of responding in each week, depending on their personal characteristics and how long ago they were contacted.

Below is a snapshot of the data set.

It zooms in on Hypothetical Subject #277, who responded 3 weeks after being mailed.

As described above, they have a data point for each week they’re observed.

The probability values which generate the binomial response variable are also included; these probability values will be what a logistic regression tries to match.

Part II: Case-control sampling and regression strategyDue to resource constraints, it is unrealistic to perform logistic regression on data sets with millions of observations, and dozens (or even hundreds) of explanatory variables.

Luckily, there are proven methods of data compression that allow for accurate, unbiased model generation.

Traditional logistic case-controlCase-control sampling is a method that builds a model based on random subsamples of “cases” (such as responses) and “controls” (such as non-responses).

Regardless of subsample size, the effect of explanatory variables remains constant between the cases and controls, so long as the subsample is taken in a truly random fashion.

For example, if women are twice as likely to respond as men, this relationship would be borne out just as accurately in the case-control data set as in the full population-level data set.

Thus, we can get an accurate sense of what types of people are likely to respond, and what types of people will not respond.

After the logistic model has been built on the compressed case-control data set, only the model’s intercept needs to be adjusted.

While relative probabilities do not change (for example male/female differences), absolute probabilities do change.

For example, take​​​ a population with 5 million subjects, and 5,000 responses.

If the case-control data set contains all 5,000 responses, plus 5,000 non-responses (for a total of 10,000 observations), the model would predict that response probability is 1/2, when in reality it is 1/1000.

When all responses are used in the case-control set, the offset added to the logistic model’s intercept is shown below:Here, N_0 is equal to the number of non-events in the population, while n_0 is equal to the non-events in the case-control set.

As a reminder, in survival analysis we are dealing with a data set whose unit of analysis is not the individual, but the individual*week.

The following very simple data set demonstrates the proper way to think about sampling:This technique incorrectly picks a few individuals and follows them over time.

This technique captures much more variability by randomly selecting individual observations from the data set.

Survival analysis case-control and the stratified sampleThings become more complicated when dealing with survival analysis data sets, specifically because of the hazard rate.

For example, if an individual is twice as likely to respond in week 2 as they are in week 4, this information needs to be preserved in the case-control set.

And the best way to preserve it is through a stratified sample.

With stratified sampling, we hand-pick the number of cases and controls for each week, so that the relative response probabilities from week to week are fixed between the population-level data set and the case-control set.

This way, we don’t accidentally skew the hazard function when we build a logistic model.

This can easily be done by taking a set number of non-responses from each week (for example 1,000).

This method requires that a variable offset be used, instead of the fixed offset seen in the simple random sample.

The offset value changes by week and is shown below:Again, the formula is the same as in the simple random sample, except that instead of looking at response and non-response counts across the whole data set, we look at the counts on a weekly level, and generate different offsets for each week j.

Because the offset is different for each week, this technique guarantees that data from week j are calibrated to the hazard rate for week j.

Code for logistic regressionThe following R code reflects what was used to generate the data (the only difference was the sampling method used to generate sampled_data_frame):glm_object = glm(response ~ age + income + factor(week), data = sampled_data_frame, family = "binomial")Using factor(week) lets R fit a unique coefficient to each time period, an accurate and automatic way of defining a hazard function.

It is possible to manually define a hazard function, but while this manual strategy would save a few degrees of freedom, it does so at the cost of significant effort and chance for operator error, so allowing R to automatically define each week’s hazards is advised.

Part III: Comparing sampling methodsBy this point, you’re probably wondering: why use a stratified sample?.What’s the point?.And it’s true: until now, this article has presented some long-winded, complicated concepts with very little justification.

The point is that the stratified sample yields significantly more accurate results than a simple random sample.

To prove this, I looped through 1,000 iterations of the process below:First I took a sample of a certain size (or “compression factor”), either SRS or stratified.

I then built a logistic regression model from this sample.

I used that model to predict outputs on a separate test set, and calculated the root mean-squared error between each individual’s predicted and actual probability.

Below are the results of this iterated sampling:While both techniques become more accurate as the sample approaches the size of the original data set (smaller compression factor), it is clear that the RMSE for stratified samples is lower regardless of data set size.

It can easily be seen (and is confirmed via multi-factorial ANOVA) that stratified samples have significantly lower root mean-squared error at every level of data compression.

Again, this is specifically because the stratified sample preserves changes in the hazard rate over time, while the simple random sample does not.

Part IV: ConclusionsFirst, we looked at different ways to think about event occurrences in a population-level data set, showing that the hazard rate was the most accurate way to buffer against data sets with incomplete observations.

Then, we discussed different sampling methods, arguing that stratified sampling yielded the most accurate predictions.

This was demonstrated empirically with many iterations of sampling and model-building using both strategies.

This strategy applies to any scenario with low-frequency events happening over time.

In social science, stratified sampling could look at the recidivism probability of an individual over time.

In medicine, one could study the time course of probability for a smoker going to the hospital for a respiratory problem, given certain risk factors.

In engineering, such an analysis could be applied to rare failures of a piece of equipment.

While these types of large longitudinal data sets are generally not publicly available, they certainly do exist — and analyzing them with stratified sampling and a controlled hazard rate is the most accurate way to draw conclusions about population-wide phenomena based on a small sample of events.

.