Tutorial for Using Confidence Intervals & BootstrappingLaura E Shummon MaassBlockedUnblockFollowFollowingMay 15In this tutorial I will attempt to show how the use of bootstrapping and confidence intervals can help with highlighting statistically significant differences between sample distributions.

First 5 Rows of albums_dataTo start off, imagine we have a dataset called albums_data that has album reviewer names, the scores they gave to each album, and the genre each album fell under.

Since some albums have multiple genres, the genres are recorded using dummies (in other words, each genre has its own column).

If we want to ask the question “Do reviewer X and reviewer Y have a statistical difference in how they review albums of a specific genre?” we can answer this using bootstrapping and confidence intervals.

Mark and StephenFirst, we need to define the question that we are asking.

Let’s assume we want to know whether two reviewers, Mark and Stephen, have a statistically significant difference in how they score electronic albums.

To answer this question we create two hypotheses called the null and alternative.

Our null will always be the scenario where there is no difference.

Here are our two hypotheses:Null Hypothesis: “There is NO statistically significant difference between Mark’s electronic scores and Stephen’s electronic scores.

”Alternative Hypothesis: “There IS A statistically significant difference between Mark’s electronic scores and Stephen’s electronic scores.

”Filtering the albums_data to only electronic albums.

Second, we need to filter our albums_data to only include albums falling under the electronic genre.

Creating two separate datasets for both reviewers.

Third, we need to create two separate datasets for the two reviewers that we want to compare.

We do this so that we can reference each sample specifically in the bootstrapping formulas.

The two datasets we need are for Mark and Stephen.

We’re almost ready to go!.However, before we dive in, let’s look at the current distribution between electronic scores for Mark and Stephen:Current electronic score distribution, pre-bootstrapping.

As we can see above, there is a lot of overlap between Mark and Stephen’s scores.

We can also see that Stephen’s distribution is slightly more skewed to the left than Mark’s distribution.

However, what we can not see by looking at the graph is whether or not there is a statistically significant difference between the two samples.

In other words, we can not reject or accept our null hypothesis just by looking at the graph above.

What is BootstrappingOk, quick break.

I know I am halfway through the tutorial and haven’t given a proper explanation for what bootstrapping is.

The setup was important, but let me quickly explain the idea behind bootstrapping.

We know we want to answer the question of whether either our null or alternative hypothesis is correct.

The goal behind any hypothesis testing is to answer this question for an entire population, not just for the data we have available to us.

Most of the time we will only have a sample of the population’s data to work with.

As a result, we need to maximize the usefulness of our sample.

To do this, we take repeated mini samples from our sample data, calculate the mean for each of these samples, and create a new distribution of these means.

If we do 10,000 mini samples, we will have 10,000 means in this new distribution.

We do this for both of our samples (in the example above, we would do it for both Mark and Stephen).

That is the express explanation.

If this is a new concept for you, I promise it will all make more sense as we move through the tutorial.

Back to Mark and StephenSo, as we just learned in the section above, our first step in bootstrapping is to take a bunch of mini samples from our datasets and calculate the means for each sample.

We do this for both Mark and Stephen individually.

Here is the code to create these lists of means.

To quickly run through what is happening, we are creating 10**4 (aka 10,000) lists of means.

Each list selects 100 random review scores in the dataset to calculate the mean.

We do this for both Mark and Stephen.

The final output is two lists (sample_mark_elec_means and sample_stephen_elec_means) that hold 10,000 means each.

Believe it or not, that’s pretty much it as far as bootstrapping complexity goes.

However, it’s not particularly useful in the current list format.

Let’s view these distributions on a graph:Wow!.That is way more telling than the previous graph we made.

Remember, the only difference here is that this graph is using the 10,000 randomly generated means we got from our bootstrapping work.

The previous graph was a distribution of all of the actual scores.

Although it looks like these distributions are pretty different, we can’t actually say whether there is a statistically significant difference.

Do you see where there is some overlap in the distributions?.We need to know whether the confidence intervals for each sample overlap.

If they do overlap, we accept our null hypothesis and conclude that there is not a statistically significant difference between the electronic scores of Mark and Stephen.

Confidence IntervalsOk, the moment of truth.

Are Mark and Stephen’s electronic scores statistically significantly different?.First, we need to determine whether our hypothesis is one-tailed or two-tailed.

One-Tailed: A one-tailed test requires that the difference between the samples is in a particular direction.

For example, if our alternative hypothesis were to have been “Mark’s electronic scores are statistically significantly higher than Stephen’s electronic scores” then we would use a one-tailed test.

Two-Tailed: A two-tailed test is when there is no directional difference required in the hypothesis.

Our alternative hypothesis was “There is a statistically significant difference between Mark’s electronic scores and Stephen’s electronic scores” which does not specify either sample needing to be higher than (or lower than) the other.

It just requires that there is a difference in either direction.

Two-Tailed Test.

Distribution sans 2.

5% on each end for 95% Confidence Interval.

So we know we have a two-tailed test.

But what does this mean when it comes to confidence intervals?.Well, it is probably important to clarify what a confidence interval is.

A confidence interval is how confident we want to be in our accepting or rejecting of our null hypothesis.

If we want to be 95% confidence that there is a difference between Mark and Stephen’s electronic scores, we will have a 5% confidence interval.

However, as we just learned above, we have a two-tailed test.

This means that this 5% is going to be split in both tails of our sample distributions (2.

5% on the left tail and 2.

5% on the right tail, for both Mark and Stephen).

This range can also be written as: between 2.

5% and 97.

5% (97.

5% is 100% minus 2.

5%).

Essentially this just cuts off the ends of both Mark and Stephen’s distributions, like the image above.

Calculating the 2.

5% and 97.

5% confidence interval cutoffs for both samples.

Results from the above code.

Now we want to know, do Mark and Stephen’s distributions (with the 2.

5% on each end cut off) overlap in our bootstrap graph?.This is a quick and easy question to answer.

All we need to do is find out what the 2.

5% and 97.

5% values are for both Mark and Stephen.

Thankfully, the Python library numpy finds the values at these percentages for us.

As we can see in the output, Mark’s 2.

5% and 97.

5% cutoffs are between 7.

13 and 7.

61 while Stephen’s cutoffs are between 6.

67 and 7.

11.

The question we ask ourselves is: “Do these distributions overlap without 2.

5% on each end?”.

We can see 7.

11 is less than 7.

13, meaning there is no overlap between the two distributions.

In other words, we can reject our null hypothesis and conclude, with 95% confidence, that there is a statistically significant difference between the electronic scores of Mark and Stephen.

That was a lot.

I hope this tutorial was useful for you.

Remember, this was a two-tailed test.

If it had been a one-tailed (hypothesizing that one sample was higher than another), we would put the entire 5% into one tail.

Which tail we put it into would depend on which sample we were hypothesizing to be larger than the other.

SummaryI want to very quickly run through the process one last time.

First, we determine what question we want to answer and create the appropriate null and alternative hypotheses.

Second, we create the two datasets for the two samples that we want to compare.

Third, we create our bootstrapping mean lists for each sample.

Fourth, we visualize the bootstrapped mean distributions.

Fifth, we calculate the confidence intervals (either 2.

5% on each end for two-tailed, or 5% on a specific end for one-tailed).

And finally, we compare the confidence intervals and look for any overlap.

If there is no overlap we can reject our null and accept our alternative hypothesis.

If not, we accept our null hypothesis.

.