Fake news or not?Using natural language processing to identify articlesDawn GrahamBlockedUnblockFollowFollowingJan 29(Original photo by Robert Owen-Wahl.
)While the term “fake news” has become a hot topic in recent years, fake news itself is nothing new.
It’s also not new that truth is often stranger than fiction — sometimes real stories are hard to believe.
Researcher Claire Wardle categorized fake news into seven types, loosely measured by the intent to deceive.
7 Types of Mis- and Disinformation (Source: “Fake News.
”)The Onion falls into the first type: satire or parody.
The organization started publishing satirical news articles in print in 1988, then went online in 1996.
Articles from The Onion and its satirical sister sites are shared on the Reddit subreddit r/TheOnion, which had 83.
1k subscribers as of December 21, 2018.
There is also the subreddit r/nottheonion: “For true stories that are so mind-blowingly ridiculous that you could have sworn they were from The Onion.
” This had 14.
5m subscribers, far outnumbering the subreddit for the publication it references.
Another subreddit r/AteTheOnion had 229k subscribers, also outnumbering subscribers to r/TheOnion.
This one is dedicated to “screencaps of people who failed to see The Onion’s articles as satire.
”r/nottheonion and r/AteTheOnion point both to the interest in “strange but true” news and the challenge of separating fact from fiction.
This raises the question:Can we use natural language processing to predict whether an article is from r/TheOnion (fake news) or from r/nottheonion (real news) by the title alone?Even Reddit users get r/TheOnion and r/nottheonion mixed up.
Data CollectionI originally used the Reddit API to collect data, but was limited to 1,000 posts per subreddit.
I also tried using PRAW, but could not get past the 1,000 post cap since it is simply a wrapper for the Reddit API.
io API, however, is not limited by the cap.
Using this, I collected the 10,000 most recent submissions (at time of collection) to each subreddit, r/TheOnion and r/nottheonion.
Submissions to each span different time ranges:r/TheOnion: September 22, 2016 to December 17, 2018r/nottheonion: October 26, 2018 to December 17, 2018The shorter time span for r/nottheonion reflects both the larger subscriber base and the greater diversity of sources.
People can submit news articles from any “original, reliable source” written in English, whereas r/TheOnion only accepts articles from The Onion or sister sites.
ProcessingAfter collection, the titles of each submission were processed according to the following steps:Accents were removed so words were not inappropriately split up.
(For example, “Pokémon” being turned into “pok” and “mon.
”)Punctuation was removed.
Capitalization was removed so that only lowercase words would be returned.
Words were lemmatized.
Stop words were removed.
These steps turned this:Fighting Fire With Fire: Mitch McConnell Is Attempting To Channel Alexandria Ocasio-Cortez’s Populist Appeal By Preparing A Supper Of Boiled Dog Live On Instagraminto this “cleaned” title:fighting fire fire mitch mcconnell attempting channel alexandria ocasio cortez populist appeal preparing supper boiled dog live instagramFrom here, the data was split into training and testing data, then vectorized using CountVectorizer().
ModelingSince I had 10,000 submissions to each subreddit, the baseline accuracy score was 50%.
In other words, a model that classified every title the same way would get half of them correct.
I ran several different classification models and got cross-validation, training, and testing accuracy scores for each.
The testing scores were as follows:Random Forest (with hyperparameters tuned via GridSearch): 84.
5%Logistic Regression: 86.
8%Multinomial Naive Bayes: 87.
1%All models returned accuracy scores that were better than the baseline.
Selected Model DetailsAlthough Multinomial Naive Bayes performed slightly better than Logistic Regression, I ultimately selected Logistic Regression for interpretability.
The accuracy scores were as follows:Cross-validation: 85.
8%And the classification metrics:Misclassification rate: 13.
2%Recall / Sensitivity: 86.
2%The model was clearly overfitting to the training data, but performing about as expected from the cross-validation score.
It was able to predict with 86.
8% accuracy whether an article was from r/TheOnion or r/nottheonion by title alone.
Word clouds showing the most frequent words in r/TheOnion (left, in green) and r/nottheonion (right, in red).
InsightsAssociated WordsI generated the word clouds above to show the frequency of words in the “cleaned” titles for each subreddit.
r/TheOnion and r/nottheonion share many of the same most frequent words: man, trump, woman, say, new.
Logistic regression is able to provide far more helpful information: the words that are most associated with each subreddit.
The words that were more likely to be from titles in r/TheOnion were quiz, nation, blog, incredible, and tips.
The words that were more likely to be from titles in r/nottheonion were poop, arrested, sues, says, and cops.
Poop and r/nottheonionThe model identified titles containing the word “poop” as far more likely to be from r/nottheonion.
Indeed, 46 of the 47 titles containing the word are from that subreddit.
Many of these submissions were deleted for not having “an oniony quality” (seeming more like satire than news, not just a funny title) or being from an unreliable news source.
Here’s an example of an article that was submitted to r/nottheonion, but was removed:Scientists ate Legos to see how long it takes to poop them outPediatricians have to deal with all kinds of interesting situations in their daily work with children, and kids eating…www.
comThe one submission made to r/TheOnion with the word “poop” was removed for not being from The Onion or a sister site.
It was included in the testing data and was misclassified.
More MisclassificationAs noted above, some misclassifications could result from a title containing a word the model associates with the other subreddit.
A closer look shows that such submissions may have also been removed from the given subreddit due to not following submission guidelines.
Many of the other misclassifications also turned out to be removed from their given subreddits for a variety of reasons.
This suggests that filtering out submissions that were removed from subreddits could help improve model accuracy.
DuplicationHowever, many submissions are also removed from both r/TheOnion and r/nottheonion due to duplication.
Being able to see these submissions can provide insight into subscriber interest in an article.
In the case of r/nottheonion, it can also provide insight into what articles subscribers think are appropriate for the subreddit, even if they are ultimately removed.
For example, over 80 submissions of the article below (or articles with similar titles from other sites) were made to r/nottheonion between December 2 and December 4, 2018, showing a flurry of activity shortly after the news came out.
'It's the real me': Nigerian president denies dying and being replaced by cloneMuhammadu Buhari breaks silence about a rumour that has circulated on social media for monthswww.
comReal-World EventsIn the case of r/TheOnion, repeated postings of a satirical article can be related to real-world events.
Between October 2, 2017 to November 8, 2018, there were 27 submissions of this article about gun violence:'No Way To Prevent This,' Says Only Nation Where This Regularly HappensTHOUSAND OAKS, CA-In the hours following a violent rampage in California in which a lone attacker killed 12…www.
comThis article contributed to the model associating the word “nation” with this subreddit.
But a more important insight is that these repostings often corresponded to real mass shootings, including:Las Vegas, NV (11/1/17)Sutherland Springs, TX (11/5/17)Tehama County, CA (11/14/17)Stoneman Douglas High School, Parkland, FL (2/14/18)Sante Fe High School, TX (5/18/18)Capital Gazette, Annapolis, MD (6/29/18)Bakersfield, CA (9/12/18)Butler High School, NC (10/29/18)Thousand Oaks, CA (11/8/18)Indeed, many r/TheOnion subscribers began to use repostings as a signal that another mass shooting had taken place, with one commenting, “Usually I hear about the shooting before seeing the Onion article.
Not this time.
”Next StepsSome of the next steps I’d like to take with this project include:Taking a deeper dive into understanding the misclassified titles.
Trying filtering out submissions that were removed from the subreddits.
Looking at how repeated postings in r/TheOnion relate to real news.
Developing a more robust model.
Thanks for reading!.Thoughts, questions, and feedback are always appreciated.
You can also check out the GitHub repo for this project.
.. More details