At one extreme we can treat every individual survey response as entirely independent ignoring the fact that we surveyed individuals multiple times and pretending each survey is totally unique.

At the other end, we can assume that the relationship between survey responses and demographics come from a higher-level distribution and specific people’s estimates are instances of this distribution, preserving the fact that each person’s own responses are more similar to each other than they are to anyone else’s responses.

I’ll return to this a bit more below.

Simulations can help us build intuitions.

Often in cases like this we can use simulated data, designed to vary in particular ways, to help us gain some insight as to how these things influence our different analysis strategies.

So let’s see how that looks.

I’m going to be primarily using the pymer4 Python package that I wrote to simulate some data and compare these different models.

I wrote this package originally so I could reduce the switch cost I kept experiencing bouncing between R and Python for my work.

I quickly realized that my primary need for R was using the fantastic lme4 package for multi-level modeling and so I wrote this Python package as a way to use lme4 from within Python while playing nicely with the rest of the scientific Python stack (e.

g.

pandas, numpy, scipy, etc).

Since then the package has grown quite a bit (Jolly, 2018), including the ability to fit the different types of models discussed above and simulate different kinds of data.

Ok let’s get started:# Import what we need import pandas as pd import numpy as np from pymer4.

simulate import simulate_lmm, simulate_lm from pymer4.

models import Lm, Lmer, Lm2 import matplotlib.

pyplot as plt import seaborn as sns sns.

set_context('poster') sns.

set_style("whitegrid") %matplotlib inlineStarting smallLet’s start out with a single simulated dataset and fit each type of model discussed above.

Below I’m generating multi-level data similar to our toy example above.

The dataset is comprised of 50 “people” with 50 “replicates” each.

For each person, we measured 3 independent variables (e.

g.

3 survey questions) and would like to relate them to 1 dependent variable (e.

g.

1 demographic outcome).

num_obs_grp = 50 num_grps = 50 num_coef = 3 formula = 'DV ~ IV1 + IV2 + IV3' data, blups, betas = simulate_lmm(num_obs_grp, num_coef, num_grps) data.

head()We can see that the overall dataset is generated as described above.

Simulating data this way also allows us to generate the best-linear-unbiased-predictions (BLUPs) for each person in our dataset.

These are the coefficients for each individual person.

blups.

head()Finally, we can also checkout the “true” coefficients that generated these data.

These are the “correct answers” we hope that our models can recover.

Since these data have been simulated using the addition of noise to each individual’s data (mu = 0, sigma² = 1), and with variance across individuals (pymer4’s default is sigma² = 0.

25) we don’t expect perfect recovery of these parameters, but something pretty close (we’ll explore this more below).

# Regression coefficients for intercept, IV1, IV2, and IV3print(f"True betas: {betas}")True betas: [0.

18463772 0.

78093358 0.

97054762 0.

45977883]Evaluating performanceOk time to evaluate some modeling strategies.

For each model type I’ll fit the model to the data as described, and then compute 3 metrics:Absolute Error of Coefficient Recovery — this is simply the sum of the absolute value differences between the real coefficients and the estimated ones.

It gives us the total error of our model with respect to the data-generating coefficients.

We could have computed the average instead of the sum, but since our simulated data are all on the same scale, the sum provides us the exact amount we’re off from what we were expecting to recover.

Sum of Model Standard Errors — this and the next measure are more related to the inferences we want to make on our parameters.

SE and the associated confidence intervals tell us the total amount of variance around our estimates given this particular modeling strategy.

Once again, we could have computed the average, but like above, the sum gives us the total variance across all our parameters.

Sum of Model T-statistics — this is the sum of the absolute value of the t-statistics of our model estimates.

This gives us a sense of how likely we would be to walk away with the inference that there is a statistically significant relationship between our independent variables and dependent variable.

All else being equal, larger t-stats generally mean smaller p-values so we can build an intuition about how sensitive our modeling strategy is to tell us “yup this is a statistically significant effect.

”Multi-level modelsLet’s begin with fitting a multi-level model specifying the complete set of all possible parameters we can estimate.

This has the effect of letting each individual have their own set of regression estimates while still treating these estimates as coming from a common distribution.

You can see below we can recover the parameters pretty well and as we expect all our are results are “significant.

”# Fit lmer with random intercepts, slopes, and their correlations lmer = Lmer(formula + '+ (IV1 + IV2 + IV3 | Group)',data=data) lmer.

fit(summarize=False) print(f"Absolute Error of Coef Recovery: {diffs(betas, lmer.

coefs['Estimate'])}") print(f"Sum of Model Standard Errors: {lmer.

coefs['SE'].

sum()}") print(f"Sum of Model T statistics: {lmer.

coefs['T-stat'].

abs().

sum()}")Absolute Error of Coef Recovery: 0.

10582723675727804Sum of Model Standard Errors: 0.

16626160033359066Sum of Model T statistics: 54.

46274837271574Next, let’s see what happens when we fit a multi-level model with the simplest possible “random effects” structure.

Notice that by not letting each individual be free to have their own estimates (aside from their own mean/intercept), our coefficient recovery drops a little bit, but our t-statistics increase dramatically.

This looks to be driven by the fact that the variance estimates of the coefficients (standard errors) are quite a bit smaller.

All else being equal, we would be much more likely to identify “significant” relationships using a simpler, or in this case, “misspecified” multi-level model, since we know that the data were generated such that each individual did in fact, have different BLUPs.

# Fit lmer with random-intercepts only lmer_mis = Lmer(formula + '+ (1 | Group)',data=data) lmer_mis.

fit(summarize=False) print(f"Absolute Error of Coef Recovery: {diffs(betas,lmer_mis.

coefs['Estimate'])}") print(f"Sum of Model Standard Errors: {lmer_mis.

coefs['SE'].

sum()}") print(f"Sum of Model T statistics: {lmer_mis.

coefs['T-stat'].

abs().

sum()}")Absolute Error of Coef Recovery: 0.

1311098578975485 Sum of Model Standard Errors: 0.

10466304433776347 Sum of Model T statistics: 101.

04453264690632Cluster-robust modelsNext, let’s evaluate the cluster-robust-error modeling approach.

Remember, this involves estimating a single regression model to obtain coefficient estimates, but then applying a correction factor to the SEs, and thereby the t-statistics to adjust our inferences.

It looks like our coefficient recovery is about the same as our simple multi-level model above, but our inferences are far more conservative due to the larger standard-errors and smaller t-statistics.

In fact, these are even a bit more conservative than the fully specified multi-level model we estimated first.

# Fit clustered errors LM lm = Lm(formula,data=data) lm.

fit(robust='cluster',cluster='Group',summarize=False)print(f"Absolute Error of Coef Recovery: {diffs(betas,lm.

coefs['Estimate'])}") print(f"Sum of Model Standard Errors: {lm.

coefs['SE'].

sum()}") print(f"Sum of Model T statistics: {lm.

coefs['T-stat'].

abs().

sum()}")Absolute Error of Coef Recovery: 0.

13278703905990247 Sum of Model Standard Errors: 0.

17543089877808005 Sum of Model T statistics: 51.

03327490657406Two-stage-least-squaresLastly, let’s use the two-stage-least-squares approach.

We’ll fit a separate regression to each of our 50 people and then compute another regression on those 50 coefficients.

In this simple example, we’re really just computing a one-sample t-test on these 50 coefficients.

Notice that our coefficient recovery is a tiny bit better than our fully-specified multi-level model and our inferences (based on T-stats and SEs) would largely be similar.

This suggests that for this particular dataset we could have gone with either strategy and walked away with the same inference.

# Fit two-stage OLS lm2 = Lm2(formula,data=data,group='Group') lm2.

fit(summarize=False)print(f"Absolute Error of Coef Recovery: {diffs(betas, lm2.

coefs['Estimate'])}") print(f"Sum of Model Standard Errors: {lm2.

coefs['SE'].

sum()}") print(f"Sum of Model T statistics: {lm2.

coefs['T-stat'].

abs().

sum()}")Absolute Error of Coef Recovery: 0.

09216204521686983 Sum of Model Standard Errors: 0.

16523098907361963 Sum of Model T statistics: 55.

00162203260664Simulating a universe.

Now, this was only one particular dataset with a particular size and particular level of between-person variability.

Remember the dimensions outlined above?.The real question we want to answer is how these different modeling strategies vary with respect to each of those dimensions.

So let’s expand our simulation here.

Let’s generate a “grid” of settings such that we simulate every combination of dimensions we can in a reasonable amount of time.

Here’s the grid we’ll try to simulate:Going down the rows we’ll vary dimension 1 the sample size of the units we’re making inferences over (number of people) from 5 -> 100.

Going across columns we’ll vary dimension 2, the sample size of the units nested within the units we’re making inferences over (number of observations per person) from 5 -> 100.

Going over the z-plane we’ll vary dimension 3 the variance between the units we’re making inferences over (between-person variability) from 0.

10 -> 4 standard deviations.

Since varying dimension 1 and dimension 2 should make intuitive sense (they’re different aspects of the sample size of our data) let’s explore what varying dimension 3 looks like.

Here are plots illustrating how changing our between person variance influences coefficients.

Each figure below depicts a distribution of person level coefficients; these are the BLUPs we discussed above.

When simulating a dataset with two parameters described by an intercept and a slope (IV1), notice how each distribution is centered on the true value of the parameter, but the width of the distribution increases as we increase the between-group variance.

These distributions are the distributions that our person level parameters come from.

So while they average out to be the same value, they are increasingly dispersed around that value.

As these distributions become wider it becomes more challenging to recover the true coefficients of the data if a dataset is too small, as models need more data in order to stabilize their estimates.

For the sake of brevity I’ve removed the plotting code for the figures below, but am happy to share them on request.

Setting it upThe next code block sets up this parameter grid and defines some helper functions to compute the metrics defined above.

Since this simulation took about ~50 minutes to run on a 2015 quad-core Macbook Pro, I also defined some functions to save each simulation to a csv file.

# Define the parameter grid nsim = 50 # Number of simulated datasets per parameter combination num_grps = [10, 30, 100] # Number of "clusters" (i.

e.

people) obs_grp = [5, 25, 100] # Number of observations per "cluster" grp_sigmas = [.

1, .

25, 1.

, 2.

, 4.

] # Between "cluster" variance num_coef = 3 # Number of terms in the regression equation noise_params = (0, 1) # Assume each cluster has normally distributed noise seed = 0 # to repeat this simulation formula = 'DV ~ IV1 + IV2 + IV3' # The model formula # Define some helper functions.

diffs() was used above examining each model in detail def diffs(a, b): """Absolute error""" return np.

sum(np.

abs(a – b)) def calc_model_err(model_type, formula, betas, data): """ Fit a model type to data using pymer4.

Return the absolute error of the model's coefficients, the sum of the model's standard errors, and the sum of the model's t-statistics.

Also log if the model failed to converge in the case of lme4.

""" if model_type == 'lm': model = Lm(formula, data=data) model.

fit(robust='cluster',cluster='Group',summarize=False) elif model_type == 'lmer': model = Lmer(formula + '+ (IV1 + IV2 + IV3 | Group)',data=data) model.

fit(summarize=False, no_warnings=True) elif model_type == 'lmer_mis': model = Lmer(formula + '+ (1 | Group)',data=data) model.

fit(summarize=False, no_warnings=True) elif model_type == 'lm2': model = Lm2(formula,data=data,group='Group') model.

fit(n_jobs = 2, summarize=False) coef_diffs = diffs(betas, model.

coefs['Estimate']) model_ses = model.

coefs['SE'].

sum() model_ts = model.

coefs['T-stat'].

abs().

sum() if (model.

warnings is None) or (model.

warnings == []): model_success = True else: model_success = False return coef_diffs, model_ses, model_ts, model_success, model.

coefs def save_results(err_params, sim_params, sim, model_type, model_coefs, df, coef_df, save=True): """Aggregate and save results using pandas""" model_coefs['Sim'] = sim model_coefs['Model'] = model_type model_coefs['Num_grp'] = sim_params[0] model_coefs['Num_obs_grp'] = sim_params[1] model_coefs['Btwn_grp_sigma'] = sim_params[2] coef_df = coef_df.

append(model_coefs) dat = pd.

DataFrame({ 'Model': model_type, 'Num_grp': sim_params[0], 'Num_obs_grp': sim_params[1], 'Btwn_grp_sigma': sim_params[2], 'Coef_abs_err': err_params[0], 'SE_sum': err_params[1], 'T_sum': err_params[2], 'Fit_success': err_params[3], 'Sim': sim }, index = [0]) df = df.

append(dat,ignore_index=True) if save: df.

to_csv('.

/sim_results.

csv',index=False) coef_df.

to_csv('.

/sim_estimates.

csv') return df, coef_df # Run it results = pd.

DataFrame() coef_df = pd.

DataFrame() models = ['lm', 'lm2', 'lmer', 'lmer_mis'] for N in num_grps: for O in obs_grp: for S in grp_sigmas: for I in range(nsim): data, blups, betas = simulate_lmm(O, num_coef, N, grp_sigmas=S, noise_params=noise_params) for M in models: c, s, t, success, coefs, = calc_model_err(M, formula, betas, data) results, coef_df = save_results([c,s,t, success], [N,O,S], I, M, coefs, results, coef_df)ResultsFor the sake of brevity I’ve removed the plotting code for the figures below, but am happy to share them on request!Coefficient RecoveryOk, let’s take first take a look at our coefficient recovery.

If we look from the top left of the grid to the bottom right the first thing to jump out is that when we increase our overall sample size (number of clusters * number of observations per cluster), and our between cluster variability is medium to low, all model types do a similarly good job of recovering the true data generating coefficients.

In other words, under good conditions (lots of data that isn’t too variable) we can’t go wrong picking any of the analysis strategies.

In the converse, going from bottom left to top right, when between cluster variability is high, we quickly see the importance of having more clusters rather than more observations per cluster; without enough clusters to observe, even a fully specified multi-level model does a poor job of recovering the true coefficients.

When we have small to medium sized datasets and lots of between-cluster variability all models tend to do a poor job of recovering the true coefficients.

Interestingly, having particularly few observations per cluster (left-most column) disproportionately affects two-stage-least-squares estimation (orange boxplots).

This is consistent with Gelman, 2005 who suggests that with few per cluster observations, the first-level OLS estimates are pretty poor with high-variance and there are none of the multi-level modeling benefits to help offset the situation.

This situation also seems to favor fully-specified multi-level models the most (green boxplots), particularly when between cluster variability is high.

It’s interesting to note that cluster-robust, and misspecified (simple) multi-level models seem to perform similarly in this situation.

In medium data situations (middle column) cluster-robust models seem to do a slightly worse job across the board of recovering coefficients.

This is most likely due to the fact that the estimates completely ignore the clustered nature of the data and have no smoothing/regularization applied to them either through averaging (in the case of the two-stage-ls models) or through random-effects estimation (in the case of the multi-level models).

Finally, in the high observations per cluster situation (right-most column), all models seem to perform rather similarly suggesting that each modeling strategy is about as good as any other when we densely sampling the unit of interest (increasing number of observations per cluster) even if the desire is to make inferences about the clusters themselves.

Making Inferences (SEs + T-stats)Next, let’s look at both standard errors and t-statistics to see how our inferences might vary.

The effect of increased between cluster variance has a very notable effect on SEs and t-stats values generally making it less likely to identify a statistically significant relationship regardless of the size of the data.

Interestingly, what two-stage-least-squares models exhibit in terms of poorer coefficient recovery in situations with few observations per cluster, they make up for with higher standard error estimates.

We can see that their t-statistics are low in these situations suggesting that in these situations this approach may tip the scales towards lower false-positive, higher false-negative inferences.

However, unlike other model types, they do not necessarily benefit from more clusters overall (bottom left panel) and run the risk of an inflated level of false negatives.

Misspecified multilevel models seem to have the opposite properties: they have higher t-stats and lower SEs in most situations with medium to high between-cluster variability and benefit the most from situations with a high number of observations per cluster.

This suggests they might run the risk of introducing more false positives in situations where other models may behave more conservatively, but also may be more sensitive to detecting true relationships in the face of high between-cluster variance.

They also seem to benefit most from increasing observations within cluster.

Inferences from cluster-robust and full-specified multi-level models seem to be largely comparable which is consistent with the proliferate use of both these model types across multiple literatures.

Bonus: When fully-specified multi-level models failFinally, we can take a brief look at what situations most often cause convergence failures for our fully-specified multi-level models (note: the simple multi-level models examined earlier never failed to converge in these simulations).

In general, this seems to occur when between cluster variability is low, or the number of observations per cluster is very small.

This makes sense because even though the data were generated in a multi-level manner, clusters are quite similar and simplifying models by discarding terms which try to model variance that may not be exhibited by the data in a meaningful way (e.

g.

dropping “random-slopes”) achieve better estimation overall.

In other words, the model may be trying to fit a variance parameter that is small enough to cause it to run out of optimizer iterations before it reaches a suitably small change in error.

This is like trying to find the lowest point on a “hill” that has a very shallow declining slope by comparing the height of your current step to the height of your previous step.

ConclusionsSo what have we learned?.Here are some intuitions that I think this exercise has helped flesh out:Reserve the two-stage-least squares for situations when there are enough observations per cluster.

This is because modeling each cluster separately without any type of smoothing/regularization/prior imposed by multi-level modeling, produces poor first-level estimates in these situations.

Be careful using misspecified/simple multi-level models.

While they may remove some of the complexity involved in specifying the “random effects” part of the model, and they converge pretty much all the time, they are more likely to lead to statistically significant inferences relative to other approaches (all else being equal).

This may be warranted if your data don’t exhibit enough between-cluster variance.

It may be generally preferable then to specify a model structure that accounts for variance confounded with predictors of interest (Barr et al, 2013) (i.

e.

dropping the correlation term between a random intercept and random slope rather than dropping the random slope), in other words the most “maximal” structure you can get away with, with respect to the inferences you want to make of your data.

Cluster-robust models appear to be an efficient solution if your primary goal is making inferences and you can live with coefficient estimates that are a bit less accurate than other approaches.

These are harder to specify if there are multiple levels of clustering in the data (e.

g.

survey responses within-person, within-city, within-state, etc) or if accounting for item-level effects are important (Baayen et al, 2008).

However, there are techniques to incorporate two-way or multi-way cluster-robust errors and such approaches are reasonably common in economics.

This lecture and this paper discuss these approaches further.

Pymer4 used for this post only implements one-way clustering.

Consider using two-stage-least-squares or cluster-robust errors instead of misspecified multi-level models as your inferences maybe be largely similar to fully-specified multi-level models that converge successfully.

This may not be true if item-level variance or multiple-levels of clustering need to be taken into account, but for relatively straight forward cases illustrated in this post, they seem to fair just fine.

Generally, simulations can be helpful ways to build statistical intuitions especially if the background mathematics feels daunting.

This is has been one of my preferred approaches for learning statistical concepts in more depth and has made reading literature heavy on mathematical formalisms far more approachable.

Caveats and cautionsI don’t want to end this post with the feeling that we’ve figured everything out and are expert analysts now, but rather appreciate that there are some limitations to this exercise that are worth keeping in mind.

While we can build some general intuitions, there are conditions under which these intuitions may not always hold and it’s incredibly important to be aware of them:In many ways, the data generated in these simulations were “ideal” for playing around with these different approaches.

Data points were all on the same scale, came from a normal distribution with known means and variances, contained no missing data points, and adhered to other underlying assumptions of these statistical approaches not discussed.

Similarly, I built-in “real relationships” into the data and then tried to recover them.

I made mention of false-positives and negatives throughout the post, but I did not formally estimate the false-positive or false-negative rate for any of these approaches.

Again this was by design to leave you with some general intuitions, but there exist several papers that utilize this approach to more explicitly defend the use of certain inference techniques (e.

g.

Luke, 2017).

The space of parameters we explored (i.

e.

the different “dimensions” of our data) spanned a range I thought reasonably covered a variety of datasets often collected in empirical social science laboratory studies.

In the real world, data are far messier and increasingly, far larger.

More data is almost always better, particularly if it’s of high quality, but what constitutes quality can be very different based on the inferences one wants to make.

Sometimes high variability between clusters is desirable, other times densely sampling a small set of clusters is more important.

These factors will vary based on the questions one is trying to answer.

The metrics I chose to evaluate each model with are simply the ones that I wanted to know about.

There are certainly other metrics that could be more or less informative based on what intuitions you would like to build.

For example, what is the prediction accuracy of each model?ConclusionI hope this was useful for some folks out there and even if it wasn’t, it certainly helped me build some intuitions about the different analysis strategies that are available.

Moreover, I hope if nothing else, this might motivate people who feel like they have limited formal training in statistics/machine-learning to take a more tinker/hacker approach to their own learning.

I remember when as a kid, breaking things and taking them apart was one of my favorite ways to learn about how things worked.

With the mass availability of free and open-source tools like scientific Python and R, I see no reason why statistical education can’t be the same.

AppendixThis is a nice quick guide that defines much of the terminology across different fields and reviews a lot of the concepts covered here (plus more) in a much more pithy way.

For those interested, p-values for multi-level models were computed using the lmerTest R package using Satterthwaite approximation for degrees of freedom calculation; note that based on the random-effects structure specified, these degrees of freedom can change dramatically.

P-values for other model types were computed using a standard t-distribution, but pymer4 also offers non-parametric permutation testing and boot-strapped confidence intervals for other styles of inference.

At the time of this writing, fitting two-stage-least-squares models is only available in development branch on github, but should be incorporated in a new release in the future.

Eshin Jolly © 2019Originally published at eshinjolly.

com on February 18, 2019.

.