Ultimate causality is hard so I’m going to recast that as “what predicts whether the newspaper someone reads most often is a Broadsheet or a Tabloid?”.
To answer that, I tossed the entirety of the British Election Study online panel dataset — a dataset with literally hundreds of political, demographic, attitudinal and personal variables — at a machine learning classification algorithm to see which questionnaire variables floated to the top as a predictor of whether people’s preferred newspaper was a Broadsheet ( →) or a Tabloid (←), as categorised by the above diagram.
Basically, dynamite fishing.
Here’s the output (via an algorithm that makes machine learning results that bit less opaque):Choose Your Own Blog Adventure: Do you want to know how to read the opaque graph above (read below)— or do you just want to skip to the breakdown (scroll down)?Do Want To Read The Graph: Top is most important, bottom is least important.
Each blob in each row is a person, coloured by the value of that variable, positioned by the *impact* of that variable on the final prediction in the fitted model (which turned out to be 80–90% accurate —pretty good for real world data).
The variables are (mostly) guessable at first glance, but here is the link to the pdf showing which variables relate to which questions — I’ve automatically modified the variable names for (slightly) enhanced legibility.
I’ll take the top variable as an example:Still at school/20+ is bright red/red (you participated in Higher Education).
Bluer than that means you left 19–15.
HE participation => more likely to prefer Broadsheet/no HE participation => more likely to prefer Tabloid.
Don’t Want To Read The Graph: Here’s roughly clustered/ordered breakdown of the above variablesHigher Education →Age ←Social Conservatism (aka Authoritarianism) — particularly death penalty/tougher sentences ←Class self-identification WC ← → MC Atheism →Political knowledge/attention variables →Immigration/black&female equality →Live in London/high household income →2005GE Labour voter ←Energy Price Cap (text of question not in pdf, but reasonable to assume *for*) ←Heard stuff about euRef during 2016 campaign from radio →Not a lot of surprises here — age, education, social conservatism, immigration/equality sentiment— are all known to correlate heavily (although above we see their separate impacts, ~effectively controlling for each other).
There’s nothing here that automatically tells you which way causation is going (well, I assume picking up a copy of the Guardian doesn’t teleport you to London or reading The Sun cause you travel back in time to vote Labour in 2005).
The “2005GE Labour vote” link to Tabloid readership is probably indirectly relevant for contemporary debates about when Labour started to lose hold of its Socially Conservative vote (more on that in the next blog) — but note that the 2005GE is the *earliest* one in the data (e.
significant split point may well have been earlier).
Next Up: UK Party Support by Political Compass Position — Where it Does and Does Not MatchRecommendation: If you liked this, you’d probably enjoy Chris Hanretty’s blog *even more*Code (&Data):Yougov Data/Network Diagram (pretty human readable notebook)Machine Learning Dataset Dredging Code (a horrible mess of mid-refactored uncommented code — and you’d have to run other notebooks in order to prep the BES data) I may also have been looking for an excuse to experiment with network diagramming software — in this instance, Gephi Xgboost classification Restricted to people who had a preferred paper and whose preferred paper was in that Newsticles diagram — people whose preferred newspaper was regional/None/Other newspapers were dropped from consideration Scott Lundberg’s SHAP python module We’re only looking at the top 30 because the further you go down the more you’re probably just looking at statistical noise.
This is only really the exploratory phase of a serious analysis During processing I’ve automatically added text to the end of variables:ordinal variables “blah” becomes “blah__bleh” where “bleh” is the top category e.
“lrUKIPW2__Right” means that a high value means that the respondent thinks UKIP is a right-wing partycategorical variables “blah” becomes “blah_bleh” which means the variable had 3+ non-ordered categories each of which became separate binary variables e.
“subjClassW2_W4W7W9_Yes, working class” means that respondents were asked whether they thought of themselves as belonging to any particular class and “Yes, working class” was one of the options, a high value means they chose this option, low value means they didn’tvariables that were already numbers — like Age/age — were just left as they are (the convention on age going *up* is widely understood)WXWY_WZ refers to the waves in which these variables were sampled (i.
you can probably ignore it)The use of underscores before the end of the variable is from the original variables e.
“profile_past_vote_2005_Labour Party” came from a categorical variable called ““profile_past_vote_2005” “Can’t remember” I have classified as a “don’t know” response (I have a big list of “weasel answers” that automatically replaces all Don’t Know/Don’t Know-alike answers with the code for “no response”) so it appears as gray (if I didn’t how do you deal with questions where all answers but the DK option are clearly ordered?).
 Worth stressing that this is self-identification — if your self-identification contradicts where the National Statistics Socio-economic Classification would place you they won’t come round to your house and beat you with sticks until you correct your ways (of course, once the UK leaves the EU, there will be many new opportunities).