Typically, the answers are just as speculative as the questions imply, but this year we have added an extra piece based on our own data analysis.
We set out to use statistics to provide an empirical answer to the following question: what would happen in American presidential elections if voting was mandatory, as it is in some other countries?This set off a months-long inquiry that was much more difficult than we anticipated.
Early on, we found that some scholars have proposed specific answers to this question, but they all either defined it slightly differently, did not provide enough data for a visualisation, or both.
This ruled out a piece that relied solely on existing academic research.
So we decided to do it ourselves.
We soon discovered that none of the commonly used computational tools in our arsenal would do the trick.
Simple summary statistics and regression analysis were obviously not enough; although we could use public polling to make predictions for individual citizens, the country doesn’t elect its president by popular vote.
Even popular machine-learning algorithms were not sufficiently suited for the task.
In the end, we needed to estimate how demographic variables, like race and education, interact with geography and voting behaviour.
This is one example of the output of our modelSome models are usefulUltimately, what we needed was a technique to make predictions for each state under varying degrees of voter turnout, to figure out the electoral-college winner under a system of mandatory voting.
The method would have to account for many factors, such as increasing turnout among minorities, who vote less often but lean to the left, and higher turnout among whites without degrees, who lean to the right.
We also had to answer the crucial question of whether a voter and a non-voter with the same demographic profile would vote in similar ways (for the most part, they do).
More questions popped up along the way.
A solution was lurking in the background, but The Economist had never attempted it before: a statistical method, popular among leading quantitative social scientists, called “multi-level regression and post-stratification” (MRP, or “Mr P” among its super-fans).
It involves combining national polls with information about individual voters to make predictions at different geographic levels.
Thanks to American political scientists, all of the necessary data are readily available, and the method has good documentation.
In the interests of methodological transparency and clarification, I have outlined our approach below.
MRP allowed us to predict how many extra votes Clinton and Trump would have won from non-voters in different demographic groups.
This was a prototype graphic that broke down the differences by stateProblems of inductionTo use MRP, one first starts with polling data about the voting habits of a medium-to-large number of individuals.
In our case, these data came from a national poll of 64,600 Americans called the Co-operative Congressional Election Study (CCES), which is conducted every two years and led by researchers at Harvard University.
We decided to focus our attention on the 2016 election, in which small changes in turnout would have made a big difference; Hillary Clinton lost the electoral college by just 78,000 votes.
The CCES provides detailed demographic data about all of its interviewees.
We can tell, for example, that 75% of its adult respondents are white, 12% are black and just under 51% are female.
But we can also combine categories; 52% are white and don’t have a college degree, according to the CCES, while 10% are men younger than 30.
This would come in handy later.
The CCES also includes data on whether Americans voted and, if so, who they preferred for president.
In the data, supporters of Mrs Clinton amount to 48% of all respondents, whereas supporters of Mr Trump clock in at 46% (both are the same percentages that the candidates won in 2016).
Crucially, the researchers in charge have taken the extra step of validating respondents’ turnout with the actual record of whether they voted.
This way, we can properly predict which Americans are likely to be actual voters.
No one who said they voted, but actually didn’t, is treated as an actual voter.
A random slice of the formatted CCES dataWith the CCES alone, we could assess the relationship between demographics, turnout and vote choice.
But due to small sample sizes in select states — only 115 Alaskans filled out the survey — we could not make reliable state-level projections.
To do so, we needed to know precisely what types of voters live in which states, and in what numbers; states with more non-white Americans will be more favourable to Mrs Clinton, for example, while those with more whites without college educations will tilt toward Mr Trump.
Fortunately, the US Census Bureau provides this information in the form of the American Community Survey (ACS), which is carried out every year and which features interviews with millions of Americans all over the country.
I crunched the Census Bureau’s data by having my laptop ingest a random representative sample of 175,000 people surveyed by the ACS and calculated how common each demographic group is in each state.
We could find out, for example, that roughly 1.
4%) of all Floridians are older than 65, female and have something between no education and a high school diploma.
About 13% of all Texans are middle-aged whites without a college degree.
This way, our target population contains the same demographic data present in the CCES — necessary for prediction purposes — but we also have the most precise numbers available on what types of people live in each state.
A random slice of the ACS data after it has been formatted to match the CCESWhat would you do if you had all the data?With data-wrangling finished, we could then move on to training a multi-level regression model (the “Mr” of “Mr P”) on the relationship between demographics and candidate preference (per the CCES).
There are several packages for R, a statistical programming language, that enable the training of these complex models, and we tried them all.
Below I’ve shown what the code looks like for a particular package called “rstanarm” that lets us interface with a separate language for Bayesian statistics called Stan.
A sample of the code I ran to fit one of our modelsOver thousands of iterations, the model gradually learns the relationships between demographics and political behaviour.
We can use those relationships to predict the voting habits of each demographic group in each state (per the ACS).
The aforementioned female Floridian seniors are predicted to vote for Donald Trump over Hillary Clinton by five percentage points, for example.
We compute the same for each of the tens of thousands of demographic groups in our data.
Once that has finished, all that’s left is to calculate the estimates for each state.
This is done by adding up (or “post-stratifying” — the “P” of “Mr P”) the predicted number of Clinton voters in each group in each state.
We obtain her vote share in each state by dividing the number of eligible voters favouring Mrs Clinton by the total number of adult citizens who live there.
The same is done for Mr Trump.
Since we are only concerned with votes for Mrs Clinton and Mr Trump (third parties have been excluded from this analysis for computational reasons, though in testing this made little difference) electoral votes are allocated to whichever candidate is projected to win more than 50% of votes in a given state.
Probabilities of victory are derived by simulating each state’s outcome thousands of times, accounting for the errors from predictions made by a similar model we built to make ex post facto predictions of the actual results of the 2016 presidential election.
The results are presented in this week’s Graphic Detail piece:The main chart from the print articleFirst, second and …n principlesFrom start to finish, our approach was not an easy one.
Although the processes resembled that for a typical social-science research article, the time frame was much more compressed: journalistic demands necessitated that the work be completed roughly within a month and a half.
Should anyone want to repeat the method I have described above, they might want to keep a few things in mind.
First, familiarisation with concepts like Bayesian statistics was important in our approach (because several of our Data Team members are sticklers for uncertainty, myself included) but this is not strictly necessary.
Other R packages exist to accomplish nearly identical tasks — in fact, we ended up using one of them, “lme4” to compute the final data because it generated identical point predictions.
But either way, an understanding of subjects like public opinion polling, survey weights and American voting behaviour is crucial.
Had we not completed similar projects before, this one would have taken even longer.
Second, MRP is an effective tool to extract reliable estimates of state-level opinion from national polling, but it is not perfect.
Even after having a validated record of who voted in 2016, the model still cannot precisely predict the election; the average absolute error in our predictions of Hillary Clinton’s state-level vote share in the contest was just under 2 percentage points.
Predictions made before the election, without the knowledge of who actually voted, could have had larger errors.
The quality of the national survey is key; you cannot weight your way out of unrepresentative data.
Finally, there is a certain utility in pursuing a complex approach, but a parsimonious one that accomplishes the same task with as few bells and whistles as possible will make things much easier to explain to the reader.
As that is our ultimate goal at The Economist, I did not do things like extract probabilities from posterior predictive distributions, include random effects terms with varying slopes or other such fanciness that a reader will only interpret as sociological gobbledygook, if they are communicated at all.
That being said, this is not an endeavour for Ockhamites; there is danger in being too simplistic.
What you may be asking yourself after reading all of this textStory timeIn the end, the madness was worth it.
Our team produced a phenomenal story.
The finished product is a highly detailed answer to the question of how America’s political landscape would change if every adult citizen had been required to vote in its most recent presidential election.
We have quantified for the reader just how left-leaning America’s non-voters are.
We have shown how an increase in voter turnout would produce varying political swings in states with different populations of whites and non-whites, holders of college degrees and high-school diplomas, millennials and baby boomers, etc.
And although the numbers didn’t make it onto the page — we had fewer than 300 words to work with in this week’s chart-filled Graphic Detail — we were also able to show the persistence of a built-in electoral advantage for working-class whites in America, a frequently-covered topic of this newspaper.
Finally, we have provided a data-driven answer to a quintessential Economist “What If?” question — something rarer in the era preceding this newspaper’s data team.
Elliott Morris is a data journalist at The Economist.
You can follow The Economist’s Data Team on Twitter.