Most people seem to think about power as being computed for the true value if there is indeed an effect.
This is a misconception.
It would make a significance test into a sort of pseudo-Bayesian test, where we’d need to specify some quantity that we believed were plausible.
But that’s not how significance tests work.
The same design and statistical test combination has the same power (curve) regardless of whether we apply it to a Stroop experiment with a large, easily-replicable effect or, say, an extrasensory perception experiment which has none.
Some people find this strange, but this is exactly how we think of sensitivity in other contexts.
The same design of smoke alarm in a burning house has the same sensitivity as one in a house that is not on fire because the sensitivity is based on a counterfactual state, not the actual state of the world.
What could be, not what is.
Imagine if it were otherwise: a person in a house that was not on fire might say that their smoke alarm has “low power” (sensitivity) merely because it has never gone off.
This is analogous to the way that some try to “estimate” power from data (e.
, post hoc power, or estimating “unknown” power).
In truth, none of that sort of work has anything to do with statistical power, except in the trivial sense that the mathematical equations are the same.
We’d really be better off thinking about “design sensitivity curves” rather than the confused monster “power” has become.
Talking about Type I and Type II errors/probabilities (and the “state of nature”) has made people think that these are discrete, true probabilities rather than reflecting possibilities on a curve.
Let’s build up an example.
For a few minutes, forget about α, β, and Type I and Type II error “rates”.
Power by exampleSuppose worked for a candy company and had determined that our new candy would be either green or purple.
We’ve been tasked with finding out whether people like green or purple candy better, so we construct an experiment where we give people both and see which one they reach for first.
For each person, the answer is either “green” or “purple”.
Let’s call θ the probability of picking purple first, so we’re interested in whether θ>.
5 (that is, purple is preferred).
There’s no reason we can’t test other hypotheses that might be interesting to a candy maker (e.
7, “purple is substantially preferred”); we’re just building this test for demonstration.
Power/sensitivity curves for the candy example.
The green region (left) represents when green candies are preferred; the purple region (right) represents when purple candies are preferred).
A is the curve for deciding that “purple is preferred” when 31 or more people pick purple first; B is the curve for deciding that “purple is preferred” when 26 or more people pick purple first.
Suppose we fix our design at N=50 people picking candy colors.
We now need a test.
Obviously, in this case the evidence in the data is carried by the number of people we observe who pick purple first.
So we set a criterion on that number, for example: “If 31 or more people pick purple, we’ll say that purple is preferred (i.
We can now draw the power/sensitivity curve for the design and test, given all the potential, hypothetical effect sizes (shown in the figure to the left, as curve “A”).
A “power analysis” is simply noting the features of this curve (perhaps along with changing the potential design by increasing N).
Look at curve A.
If green candies are preferred (θ<.
5) we have a very low chance of mistakenly saying that purple candies are preferred (this is good!).
If purple is substantially preferred (θ>.
7), we have a good chance of correctly saying that purple is preferred (also good!).
These are all counterfactuals that would hold in any situation where we would apply this test with this design: candies, true/false tests, coin flips, whatever.
The power doesn’t depend on what’s true, only what could be true, and how we set up the design/test.
Now let’s consider another test for this design: “If 26 or more people pick purple, we’ll say that purple is preferred (θ>.
This could be motivated by saying that we’ll claim that purple is truly preferred whenever the data seem to “prefer” purple.
This is curve “B” in the figure above.
Let’s do a power analysis.
If purple is substantially preferred (θ>.
7), we are essentially sure to correctly say that purple is preferred (good!).
If green candies are preferred, (θ<.
5) we could have a high chance (over 40%) of mistakenly saying that purple candies are preferred (this is bad!).
Crucially, what determines our judgment about a power/sensitivity curve is whether it meets our purposes.
We want to be sufficiently protected from often making false claims (saying “purple is preferred” when green actually is), and we want to be able to say “purple is preferred” when it is to some important degree.
What is important is determined by us.
Do I care if purple candies are preferred at θ=.
I don’t care about detecting effects that small.
But if 70% of the time people reach for the purple candy, (θ>.
7), I would care.
So I make sure my power/sensitivity curve is high in that region.
A design sensitivity analysis — what is often called a power analysis — is just making sure the sensitivity is low in the region where the “null” is true (in common lingo, “controlling” α), and making sure the power/sensitivity is high where we’d care about it.
None of this has anything to do with “estimating” power from previous results, or anything to do with the actually true effect.
Critiquing powerNow we that we’ve explored the concept of power from the proper perspective, we can better understand what a good power critique looks like.
Here are some things we could say to critique a study on power grounds.
Power critiques are design critiques!Pick the effect size first: “This design has poor sensitivity to even very large effect sizes of X; the experiment should never have been done in the first place.
” This is perfectly cogent design critique, even after the study is done.
You don’t get credit for a badly designed study just because it happened to be “significant”.
Importantly, though, this critique involves the critic committing to some effect size of interest.
You can’t make a power critique without that committment.
“This sample size feels small” is not good enough.
Pick the power first: “This design has no more than 0.
5 power to detect effect sizes as large as X.
It seems like the authors would want to detect an effect if it were indeed that large”.
Many people have difficulty understanding what an “important” effect size might be.
I sympathise; that’s why I think it is easier to choose the “basement” power of 0.
5 and then work out to what effect size the design has 0.
5 power than it is to pick an effect size that is just important enough.
If power is 0.
5, you’ll be as often to miss an effect as detect it, so a power of 0.
5 is a good starting place to evaluate a design.
It’s not the end-point, but it can easily catch very bad designs.
Focus on other aspects of the design: “The test of interest is appropriate, but the sample size in the chosen design is so small that checking the assumptions of the test is difficult or impossible.
” This is a critique of the sensitivity to things that might cause you to question the results.
A crucial part of a data analysis is quality control, and if you can’t do quality control before your analysis of interest, you should have little faith in your analysis.
A design plan should not only encompass the test/effect of interest, but also any quality checks.
Importantly, these critiques are of the experimental design, not the results.
You could go beyond the design critique; the flip side is if the effect in question is significant, it implies that the effect size must be very large, because if it were smaller, it would have little chance of being detected.
Scientific judgment then comes into play about the plausibility of the results, given what might actually be true.
But this extension is not a power critique, and critics should not hide behind “power” when their actual critique is one of plausibility.
Plausibility critiques are harder to make because they are subjective, but you just have to own the subjectivity and take responsibility for it.
For more on this issue, you can see Mayo and Morey (preprint) on why power critiques involving the so called “posterior predictive value” or “false discovery rate” are questionable (at best) or Morey and Lakens (preprint) on why the way power is poorly conceived in popular replication studies.
Yes, I’m writing this to procrastinate on getting those papers revised and submitted, but I hope it was a good primer.