Precision and Recall Trade-off and Multiple Hypothesis Testing

Measured chance probability of rejecting the " "null: %.

3f (should be %.

3f)" % (type_I_error_p, alpha))> 5000 out of 5000 completed (fraction of rejections so far: 0.

05)> Measured chance probability of rejecting the null: 0.

047 (should be 0.

050)It worked.

The type I error probability is indeed very close to the nominal 5%.

Issues with Bonferroni-like corrections: Global vs local hypothesisUntil now we were dealing with the “global” null hypothesis for the problem “is there any pair where the average is different between the two populations?” The null hypothesis is that all pairs have the same average between the populations, the alternative being that at least one does not.

However, often we are interested in another problem: “find all the pairs where the average is different”.

Or, “find all the stellar explosions”, as in my Astrophysics problem.

In this second case, each pair gets its own null and alternative hypothesis, and we are interested in how many null hypotheses are rejected.

It is clear that the Bonferroni correction will still guarantee a global α type I error probability of rejecting one or more nulls by chance, but it penalizes all tests in order to do so, because α for each test is given by the Sidak formula and 1−(1−α)^(1/m) < α for m>1.

Moreover, as m grows, the global null hypothesis is still tested with the same type I error probability, but each of the m null hypotheses gets tested more and more restrictively, and as m→∞ we have α′→0 so it would be extremely difficult to find any deviation from the null hypothesis.

In other words, “the more you look, the less you find.

”Let’s illustrate this by considering the Type II error of one single test, i.

e.

the probability of not rejecting the null when we should have.

First, let’s generate and test a pair where the null is false:# Let's get a dataset with 1 group and# the null hypothesis is Falsew1, w2, ground_truth = generate_dataset(n_datasets=1, n_null_true=0)# Let's now apply the testalpha = 0.

05pvalue = apply_ttest(w1, w2)if null_hyp_status(pvalue, alpha) is True: print("We do not reject the null hyp.

")else: print("We reject the null hyp.

")> We reject the null hyp.

We have rightfully rejected the null hypothesis.

Now let’s see how many times we fail to reject the null even if it is false over many repetitions of the same experiment (type II error probability):type_II_error_p = 1 – measure_rejection_prob(5000, apply_ttest, null_hyp_status, alpha, n_datasets=1, n_null_true=0)print(".Measured chance probability of *not* rejecting the " "null: %.

3f" % (type_II_error_p))> 5000 out of 5000 completed (fraction of rejections so far: 0.

94)> Measured chance probability of *not* rejecting the null: 0.

062So, for one test we have a probability of around 6% (β=0.

06) of not rejecting the null even if it is false (of course, β depends on α, as well as on the size of the effect — in this case, the difference between the two averages).

Now, let’s see what happens if we use the Bonferroni-corrected test on 50 pairs, where only one has the null hypothesis false:type_II_error_p = 1 – measure_rejection_prob(5000, apply_ttest, null_hyp_status_bonferroni, alpha, n_datasets=50, n_null_true=49)print(".Measured chance probability of *not* rejecting the " "null: %.

3f" % (type_II_error_p))> 5000 out of 5000 completed (fraction of rejections so far: 0.

59)> Measured chance probability of *not* rejecting the null: 0.

410Now we have a 41% probability of not rejecting the null when we should have.

We clearly have lost a lot of sensitivity now that the difference in one pair is buried in a sample of 50 pairs.

To some extent, this is inevitable and is the price we pay for not knowing exactly where to look.

However, when trying to test all the local null hypotheses instead of the global one, things get out of hand very quickly.

In order to get an idea, let’s make several larger and larger datasets with 50 false null hypotheses each, and see how the type II error changes as a function of the number of pairs/tests m:NOTE: from now on, we will repeatedly use the concepts of Precision and Recall.

The former describes the fraction of correct “detections” (i.

e.

, null hypothesis rejected) among all detections, i.

e.

describes the purity of the output sample of our procedure.

The latter describes the fraction of null hyp.

that we have rejected among the one that we should have rejected (i.

e.

the completeness of our output sample).

# Test the Bonferroni method with alpha=0.

05methods = [('bonferroni', 0.

05)]# Number of pairs per datasetms = np.

array([70, 80, 90, 100, 120, 150, 175, 220, 280, 350, 500, 700, 1000])print("Generating %s datasets" % len(ms))# Pairs with a false null hypothesis for each datasetn_false = 50(selections, false_positives, false_negatives, global_typeI) = characterize_methods(apply_ttest, methods, ms, [n_false] * ms.

shape, niter=800, plot=True)> Generating 13 datasets> Method bonferroni with alpha 0.

05.

.

.

completedWe can see that the purity of the output sample is constant to 1.

0, but the completeness is small and it also falls very quickly as the number of tests increases.

In other words, we are detecting fewer and fewer anomalies as m increases, but the ones we detect are always correct.

The type I error probability of detecting any false positive is always below the declared α level, although very conservatively so for small m.

Can we do better?The answer is, fortunately, yes!The Holm-Sidak methodThere are several corrections that have been proposed to the vanilla Bonferroni/Sidak method.

You can find them described here.

Without going into the details of each one of them (see the Wikipedia page for that), let’s just test them:# Test different Bonferroni-like methods with alpha=0.

05methods = [('bonferroni', 0.

05), ('holm', 0.

05), ('holm-sidak', 0.

05), ('simes-hochberg', 0.

05)]# Number of pairs per datasetms = np.

array([70, 80, 90, 100, 120, 150, 175, 220, 280, 350, 500, 700, 1000])print("Generating %s datasets" % len(ms))# Pairs with a false null hypothesis for each datasetn_false = 50(selections, false_positives, false_negatives, global_typeI) = characterize_methods(apply_ttest, methods, ms, [n_false] * ms.

shape, niter=800, plot=True)> Generating 13 datasets> Method bonferroni with alpha 0.

05.

.

.

completed> Method holm with alpha 0.

05.

.

.

completed> Method holm-sidak with alpha 0.

05.

.

.

completed> Method simes-hochberg with alpha 0.

05.

.

.

completedThe new methods conserve the absolute purity of the vanilla Bonferroni and a type I error below or at the nominal value, but improve the completeness a little bit.

However, we can do a lot better than this.Let’s see how.

False Discovery Rate vs Family-Wise Error RateUp until now, our solutions to the problem of multiple hypothesis testing have been trying to keep the Family-Wise Error Rate (FWER) under control, i.

e.

the rate of type I errors committed in the entire set of m tests (what we called the global α in the plots above).

However, in the case where we expect several “detections” (i.

e.

several false null hypotheses), we can sacrifice our desire for complete purity a little and decide that we can accept a controlled amount of false positives if this helps to improve the completeness.

This is a very familiar idea for a Data Science practitioner: we can trade some precision for increased recall.

In other words, we can accept to have a certain number of “impostors” in our output sample of detections.

This is the idea behind the FDR.

Benjamini and Holdberg (1995) presented a procedure that does just that.

There, α does not represent the type I error probability anymore but rather controls the purity of the output sample (i.

e.

directly affects the Precision instead of the global α, as in our previous methods).

The (expected) Precision is guaranteed to be >1−α.

As before, we refer to the paper for details.

Here I want to illustrate the difference with respect to our earlier methods.

Let’s use the same procedure as before.

For simplicity, let’s consider only the best method for our problem among the Bonferroni-like ones, according to the previous plot (“holm-sidak”):# Let's use two values of alpha per method to illustrate# what they affectmethods = [('holm-sidak', 0.

1), ('holm-sidak', 0.

05), ('fdr_bh', 0.

1), ('fdr_bh', 0.

05)]# Number of testsms = np.

array([70, 80, 90, 100, 120, 150, 175, 220, 280, 350, 500, 700, 1000])# False null hypothesis that we are going to generaten_false = 50(selections, false_positives, false_negatives, global_typeI) = characterize_methods(apply_ttest, methods, ms, [n_false] * ms.

shape, niter=800, plot=True)> Method holm-sidak with alpha 0.

10.

.

.

completed> Method holm-sidak with alpha 0.

05.

.

.

completed> Method fdr_bh with alpha 0.

10.

.

.

completed> Method fdr_bh with alpha 0.

05.

.

.

completedWe can immediately see that the BH method provides a much larger Recall (“completeness”, second panel) by sacrificing a controlled amount of Precision (“purity”).

Indeed, as promised, the Precision is >1−α.

Going from α=0.

01 to α=0.

05 in the BH method increases the purity as expected but decreases the completeness.

Also, the global α (bottom panel) for the BH method is large and close to 1, which means that in any experiment there is a high probability of getting one or more false positives.

This is the price to pay for increasing the completeness, where we almost gain a factor of 2 especially for a large or even very large number of tests m with respect to Bonferroni-like methods.

Now, we understand the key difference between FWER-controlling and FDR-controlling methods: the former put an upper limit α on the FWER (“global” α, bottom panel), while the latter put a lower limit 1−α on the Precision (“purity,” upper panel).

Up to now, we have studied the case where the number of false null hypotheses are constant and the number of tests increases.

What happens when the number of false hypotheses increases with the number of tests.This happens, for example, when we are expanding a search to a previously-unexplored part of the parameter space when we expect the false null hypotheses/anomalies (or stellar explosions, as in our Astronomical example) to have the same density as before.

# This time we have 30% of false hypothesis for each m(selections, false_positives, false_negatives, global_typeI) = characterize_methods(apply_ttest, methods, ms, np.

ceil(0.

3 * ms).

astype(int), niter=800)> Method holm-sidak with alpha 0.

10.

.

.

completed> Method holm-sidak with alpha 0.

05.

.

.

completed> Method fdr_bh with alpha 0.

10.

.

.

completed> Method fdr_bh with alpha 0.

05.

.

.

completedResults are similar as before, but now the completeness (Recall) for the BH method is essentially constant (middle panel), independently of m.

ConclusionsWe have illustrated the problem with multiple hypothesis testing, and two very different methods of dealing with it.

Methods controlling the FWER (like the Bonferroni correction) maximize the Precision (“purity”) at the expense of Recall (“completeness” and sensitivity).

They are appropriate when the number of the expected false null hypothesis, anomalies, or detections is small and the price to pay for false positives is high, so purity is more important than completeness.

For example, they are very appropriate when looking for a new effect/new physics for the first time, as the expected number of false null hypotheses is at most one and the global null hypothesis really matters.

In this case, indeed, making even one false claim is, of course, very consequential.

However, using FWER can introduce a significant Malmquist bias where only very strong effects can be seen.

Methods controlling the FDR (like the Benjamini — Hochberg) increase the completeness (Recall) substantially with respect to FWER-controlling methods by allowing a controlled number of false positives to sneak in (i.

e.

, sacrificing Precision in a controlled way).

They are appropriate when we expect to have several false null hypotheses, anomalies, or detections and we can afford to have some false claims.

NOTE: in this illustrative example we used one single effect size for all false null hypotheses.

This is almost never the case, so the distribution of the effect size (the magnitude of the difference between the average of the two populations in our t-test example, or the brightness of the stellar explosion for our Astrophysical example) is going to obviously affect the number of anomalies/detections.

However, the general idea presented here still holds.

Using the FDR instead of the FWER allows detecting more and smaller effect sizes (increase the sensitivity/recall) at the expenses of some false positives.

Codeimport numpy as npimport pandas as pdimport scipy.

statsimport matplotlib.

pyplot as pltimport sysimport multiprocessingimport itertoolsimport functoolsfrom statsmodels.

stats.

multitest import multipletests#Let's set the random seed so the results of the notebook# are always the same at every runnp.

random.

seed(0)%matplotlib inlinedef generate_dataset(n_datasets, n_null_true, n_samples=100, seed=0): # This is to make the results predictable np.

random.

seed(seed) n_null_false = n_datasets – n_null_true w1 = [] w2 = [] null_status = [] for i in range(n_null_true): wn_1 = np.

random.

normal(loc=90, scale=10, size=n_samples) wn_2 = np.

random.

normal(loc=90, scale=10, size=n_samples) w1.

append(wn_1) w2.

append(wn_2) null_status.

append(True) for i in range(n_null_false): wn_1 = np.

random.

normal(loc=95, scale=10, size=n_samples) wn_2 = np.

random.

normal(loc=90, scale=10, size=n_samples) w1.

append(wn_1) w2.

append(wn_2) null_status.

append(False) return w1, w2, np.

array(null_status)def worker_function(i, generate_dataset_kw, test, null_hyp_status): generate_dataset_kw['seed'] = (i+1) * 1000 w1, w2, _ = generate_dataset(**generate_dataset_kw) pvalue = test(w1, w2) return null_hyp_status(pvalue, alpha) def measure_rejection_prob(n_iter, test, null_hyp_status, alpha, **generate_dataset_kw): n_rejected = 0 worker = functools.

partial(worker_function, generate_dataset_kw=generate_dataset_kw, test=test, null_hyp_status=null_hyp_status) pool = multiprocessing.

Pool() try: for i, res in enumerate(pool.

imap(worker, range(n_iter), chunksize=100)): if not res: n_rejected += 1 if (i+1) % 100 == 0: sys.

stderr.

write("
%i out of %i completed (fraction of " "rejections so far: %.

2f)" % (i+1, n_iter, n_rejected / float(i+1))) sys.

stderr.

write(".") sys.

stderr.

flush() except: raise finally: pool.

close() pool.

join() return n_rejected / float(n_iter)def worker_function2(i, generate_dataset_kw, test, method, alpha): generate_dataset_kw['seed'] = (i+1) * 1000 w1, w2, null_hyp = generate_dataset(**generate_dataset_kw) pvalues = test(w1, w2) reject, _, _, _ = multipletests(pvalues, alpha, method=method, is_sorted=False, returnsorted=False) # False positives: I rejected when I shouldn't have n_false_pos = np.

sum((reject == True) & (null_hyp==True)) # False negatives: I didn't reject when I should have n_false_neg = np.

sum((reject == False) & (null_hyp==False)) return np.

sum(reject), n_false_pos, n_false_negdef measure_detections(n_iter, test, method, alpha, **generate_dataset_kw): n_false_pos = [] n_false_neg = [] n_selected = [] worker = functools.

partial(worker_function2, generate_dataset_kw=generate_dataset_kw, test=test, method=method, alpha=alpha) pool = multiprocessing.

Pool() try: for i, (s, fp, fn) in enumerate(pool.

imap(worker, range(n_iter), chunksize=100)): n_selected.

append(s) n_false_pos.

append(fp) n_false_neg.

append(fn) except: raise finally: pool.

close() pool.

join() global_typeI = np.

sum(np.

array(n_false_pos) > 0) / float(n_iter) return (np.

average(n_selected), np.

average(n_false_pos), np.

average(n_false_neg), global_typeI)def characterize_methods(test, methods, ms, n_false, niter=800, plot=True): selections = {} false_positives = {} false_negatives = {} global_typeI = {} for method, alpha in methods: # Clear output sys.

stderr.

write("Method %s with alpha %.

2f" % (method, alpha)) s = np.

zeros(len(ms), int) fp = np.

zeros_like(s) fn = np.

zeros_like(s) gtI = np.

zeros(s.

shape, float) for i, (m, nf) in enumerate(zip(ms, n_false)): s[i], fp[i], fn[i], gtI[i] = measure_detections(niter, test, method, alpha, n_datasets=m, n_null_true=m – nf) sys.

stderr.

write(".

") selections[(method, alpha)] = s false_positives[(method, alpha)] = fp false_negatives[(method, alpha)] = fn global_typeI[(method, alpha)] = gtI sys.

stderr.

write("completed.") if plot: fig, subs = plt.

subplots(3, 1, sharex=True, figsize=(4,10), gridspec_kw={'hspace': 0.

0, 'top': 0.

95}) for key in methods: true_positives = selections[key] – false_positives[key] precision = true_positives.

astype(float) / selections[key] recall = true_positives / (true_positives + false_negatives[key]).

astype(float) label = r"%s (\$alpha\$=%.

2f)" % (key, key) _ = subs.

plot(ms, precision, label=label) _ = subs.

plot(ms, recall, label=label) _ = subs.

plot(ms, global_typeI[key], label=label) subs.

set_ylabel("Precision.(purity)") subs.

set_ylabel("Recall.(completeness)") subs.

set_ylabel(r"Global \$alpha\$") subs.

set_xlabel("Number of tests") subs.

set_xscale("log") plt.

axes(subs) plt.

legend(bbox_to_anchor=(1.

05, 1), loc=2, borderaxespad=0.

) return selections, false_positives, false_negatives, global_typeI.