How Smart is Your News Source?Text Data Analysis of 21 Different News OutletsMichael TaubergBlockedUnblockFollowFollowingJan 12I think it’s more important than ever to understand the perspectives and biases of our new sources.
Unfortunately there is just so much news¹ that it is almost impossible for us to escape our tiny filter bubbles.
Luckily, the same technology that got us into this mess, can help us navigate it.
Using computers, it’s possible to get a broad view of multiple news sources and to see what areas they focus on most.
It’s also fun to see how the writing styles of different outlets differ.
While we’ll need many more advancements in natural language processing (NLP) to really get a handle on news bias, there are some fun analyses we can do now.
Towards that end, I’ve used the python Newspaper library² to collect as many articles as I could from 21 differnet news outlets over the past 6 months.
Here are some interesting ways that they differ.
Sentiment of the NewsOne of the easy and interesting things to look at when it comes to news is story sentiment.
Using the python VADER library³, we can score all stories from different publications and measure what their average sentiment is.
Positive numbers indicate more upbeat language, while negative scores suggest dark and negative writing.
As expected, the fluff of USA today scores quite positive, while the conspiracy theories on Infowars are the mostly negative.
Surprisingly, most news articles are not strictly negative (although Russia Today, Breitbart and Buzzfeed seem to skew that way).
If we look just at news headlines, we see that they are more negative than story content.
Bad news gets more attention after all.
Again, Inforwars headlines are the most negative while only the Wall Street Journal score positive.
Readability of the NewsAnother simple measure of news writing is its readability.
There are multiple systems that have been developed over the years to measure how easy something is to read.
Below I’ve used the python textstat library⁴ to compare the reading difficulty of various publications.
Flesch–Kincaid Readability Testing⁵The Flesch-Kincaid readabilty test is one of the most popular.
It creates a score based on the number of words per sentence and on the number of syllables per word — i.
long words and sentence are harder to read.
It then converts this score to a grade level.
Note that this result has nothing to do with the content of the sentences analyzed.
It is solely based on the length of words/sentences.
Using this method, we see that MSNBC has the highest grade level making it the hardest to read.
The BBC is at the opposite end of the spectrum and can be read comfortably with a 10th grade education.
This is likely because most BBC stories are short and informative (with fewer meandering editorials).
Dale-Chall Readability TestThe Dale-Chall formula uses words lengths, but also takes the difficulty of the those words into account.
It keeps a list of “easy words” that a 4th grader should understand.
Using this method, the more words that aren’t on this list, the harder something is to read.
Based on this system, we again see that MSNBC is the least readable.
Except now the New York Times (NYT) and the Wall Street Journal (WSJ) move up the list with scores around 8.
This means they require an 11th or 12th grade education to fully comprehend.
Smog Grade⁷The Smog grading system uses the number of polysyllables (words with 3 or more syllables) to assign a difficulty grade.
Here again we see MSNBC , Breitbart, and Politico use the most long words, while the BBC and LA Times use simpler language.
Gunning fog index⁸The Gunning fog index also uses words with many syllables as a measure of reading difficulty, but the formula is different.
With this new method, USA Today takes the top spot as easiest to read.
Words Per StoryFinally, though not a strict readability test, we can get a sense of the complexity of a news outlet’s reporting by measuring how long its stories are.
Note — some news sites are paywalled so it wasn’t always possible to scrape full stories.
Also some websites require clicking “read more” buttons to get the full text.
These sources have been removed.
We see that Vox, with their mission to explain the news, has by far the longest articles (1427 words).
Politico and Buzzfeed are next with averages around 1000 words per story.
If we could included paywalled sites, the New York Times, LA Times, and Washington Post, they would also be close to 1500 words/article.
Personally, I think that the New York Post is the perfect paper and that 500 words is just right for reporting.
Editorials tend to be longer.
Content of the NewsUnfortunately NLP tools are not yet advanced enough to easily suss out political biases.
Nevertheless we can get an idea of an outlet’s political slant by looking at its news headlines.
Below are word clouds based on the frequency of terms used in news headlines (large words are the ones that are used most often).
The three main national news outlets seem to be mostly Trump coverage machines (I do worry about their future business models).
The Kavanaugh confirmation stories are the second most popular with all three sources, while the Mueller investigation is also very big on MSNBC.
The word clouds for the other major news sources are similar.
While they are all obsessed with Trump, they also cover other subjects.
The Wall Street Journal and Washington Post have many stories about Saudi Arabia, while the New York Times has lots of #MeToo coverage (‘accused’, ‘man’, ‘sex’).
NPR still talks Trump a lot, but USA Today spends most of its ink on “best deals” stories.
Meanwhile, the BBC is rightly focussed on Brexit and global politics.
We can also see that newer internet-only publications have a better spread of coverage (except for Politico).
The Huffington Post seems to have lots of Greek and Spanish language so results were mixed up with words from other languages.
Vox’s main stories are ‘explainers’ (on the usual topics of Trump, Saudi Arabia and Kavanaugh) while Breitbart and the Daily Caller devote a lot of coverage to the US-Mexico ‘border’.
Finally, local papers like the LA Times, Boston Globe and New York Post focus on local news (‘California’, ‘Boston’, ‘NYC’) in addition to Trump.
They also have lots of old-fashioned general interest stories (‘home’, ‘school’, ‘man’, ‘woman’ terms are common).
Word2Vec AnalysisSince ‘Trump’ is the dominant item in the news, I thought it would be fun to see how he is viewed by different outlets.
Using a technique called Word2Vec⁹ it is possible to see what words are considered similar to Trump by news outlets.
Below we see that the words surrounding ‘Trump’ are similar to the ones used to describe ‘Obama’, ‘Bush’, ‘Putin’, ‘Xi’ (Jinping), ‘Bolsonaro’, ‘Duterte’, and ‘Macron’.
It seems that both outlets on the left and right equate Trump with strong (or maybe militant) leaders around the globe.
Note — the fun fact that Neil ‘Gaiman’ got mixed into these results shows that this technique may not be well suited for this type of analysis.
ConclusionsThe words that we read most these days are from internet news sites.
These publications have a huge influence on what we see, what we think, and how we feel.
And yet they answer only to rich benefactors or to their bottom lines.
Technology has created this inredibly competitve media landscsape.
Maybe technology can also help us to navigat it more thoughtfully.
Bonus ResultsAs a writer, I’ve always wondered what the ideal sentence length is.
Based on the average number of words in a news headline, the answer is somewhere between 10 and 14 words.
This is the length that all news outlets use to draw in those clicks.
We can also measure the number of words in the average sentence of a news article.
Here we see that most are between 21 and 26 words.
This feel a bit long to me, but my background involves more technical and business writing.
Reporters with English majors are likely more long-winded.
Worst Sentiment HeadlinesFor fun, I saved the news headlines with the absolute worst sentiment scores.
The ones below reflect the darkest stories of the last 6 months.
FBI Director Wray: Terrorists Likely To Use Drones To Attack ‘Mass Gatherings’ In USDonald Trump Blames Deadly California Wildfires On ‘Gross Mismanagement’ Of ForestsJournalist death toll: retaliation killings nearly double in 2018Mother Suspected of Drowning Son Googled for Child-Killing Tips 100+ TimesGun deaths in US reach highest level in nearly 40 years, CDC data revealWhitey Bulger’s Fatal Prison Beating: ‘He Was Unrecognizable’Tesla factory is a hotbed of racism, former black employees claimThomas FriedmanAs I described above, most readability tests are very simple.
They don’t measure anything about how sentences are constructed or how words are used.
I wanted to see how these tests graded what I consider to be truly bad writing.
For this purpose, I selected the last 5 articles from the notoriously inscrutable New York Times columnist Thomas Friedman.
His scores are below.
average Flesh-Kincaid grade level — 12.
1average Dale-Chall readability score — 7.
27 (10th grade)average Gunning fog — 13.
03average Smog grade — 13.
5Based on how idiosyncratic Friedman’s writing style is, I think the Smog grade is the most accurate.
You definitely need a college degree to get at whatever he’s trying to say.
And in case you’re wondering, this article has the following readability scores:Flesh-Kincaid grade level — 7.
9, Dale-Chall readability score — 6.
81, Gunning fog — 9.
64, Smog grade — 11.
3NotesAll code and data (minus copyrighted news stories) on GitHub at https://github.
com/taubergm/news_readability1 — “There are2.
5 quintillion bytes of data created each day”2 — https://newspaper.
io/en/latest/3 — VADER Sentiment Analysis.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
com/shivam5992/textstatI had to fix a minor bug in this library associated with counting the number of words in a sentence.
I replaced the sentence_count() function with my own implementation using an nltk tokenizer5 — Flech-Kincade score before converted to grade level6 — Dale-Chall formula7- Smog Formual8- Gunning fog formula9 — From wikipedia.
Word2Vec models “are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words”.
I admit I only vaguely understand how it works.
.. More details