Ego Network Analysis for the Detection of Fake NewsUsing a combination of network analysis and natural language processing to determine the sources of “fake news” on TwitterBrian SrebrenikBlockedUnblockFollowFollowingFeb 18Twitter network of verified users with over 1 million followers.
Circles (nodes) represent users and the lines connecting the circles represent one user “following” another.
Colors represent classes determined through modularity clustering.
While “Fake News” has existed long before the age of the internet, it seems like today it is harder than ever to determine the reliability of news sources.
After doing some research on the topic, I found that there was some work being done in graph theory to see if we could use machine learning to assist in the detection of sources of fake news.
I am very interested in the power of networks and the information we can gain from them, so I decided to see if I could build a classification model that would find patterns in ego networks to detect fake news.
What is an Ego Network?Ego Networks (also known as Personal Networks in a human social network analysis) consist of a focal node known as the Ego, and the nodes to whom ego is directly connected to, called Alters, with edges showing links between ego to altars or between altars.
Each alter in an ego network can have its own ego network, and all ego networks combine to form the social network.
In such a network, egos could be human beings or objects like products or services in a business context.
In the image below, I have visualized the ego networks of all Twitter Verified Users with over 1,000,000 followers.
Each circle represents a verified twitter user node (size of the circle related to total follower count) and the lines, or edges, linking them represent nodes “following” one another (credit to Luca Hammer who provided me with the Twitter edge list.
Be sure to check out his Medium for excellent posts on exploring and visualizing network data).
This graph visualization, as well as all others you’ll see in this article, were created using Gephi.
For the purpose’s of this project I decided to analyze strictly verified Twitter networks as I felt there was a natural tendency for users to have more trust in sources that have officially been verified by Twitter.
Training Data Problem: How do I decide which nodes represent fake news sources?Probably the biggest problem I faced at the outset of this project was how to determine which Twitter accounts to classify as sources of fake news for my training data.
There is no universally agreed upon way of determining whether or not news is fake news or not, and if there was, it would not be a problem in the first place.
But I had to start somewhere.
Luckily, I was able to find a fantastic dataset in the CREDBANK data that accompanied the ICWSM 2015 paper “CREDBANK: A Large-scale Social Media Corpus With Associated Credibility Annotations”.
If you’ve got the time, I highly suggest checking out the paper but here is the TLDR:In total, CREDBANK comprises more than 60M tweets grouped into 1049 real-world events, each annotated by 30 Amazon Mechanical Turk workers for credibility (along with their rationales for choosing their annotations).
The primary contribution of CREDBANK is a unique dataset compiled to link social media event streams with human credibility judgements in a systematic and comprehensive wayBy combining this dataset with the Twitter network data, I was able to create my own dataset for training a classification model.
The data consisted of 69,025 verified users, and all the connections between them.
Of those users, 66,621 were determined to be sources of real news and 2,404 were determined to be sources of fake news.
The sources I determined to be fake were those who had more than 5% of their tweets ranked as below partially accurate by the Amazon Mechanical Turk credibility raters.
Network EDAThis is the network graph of all sources in my dataset.
Blue dots and lines represent NOT fake sources and red dots and lines represent fake sources.
Same graph as above, but with fake sources onlyAfter collecting and organizing the data (I used graph database Neo4j to store network data), the first step was to do an initial exploratory analysis of the network data.
I used two network algorithms, Eigenvector Centrality and PageRank, in my initial analysis.
The Eigenvector Centrality algorithm was only run on a sample of the data as centrality measures take quite a long time to compute on large networks.
Eigenvector centrality is a measure of the influence of a node in a network.
Relative scores are assigned to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.
A high eigenvector score means that a node is connected to many nodes who themselves have high scores.
PageRank is widely recognized as a way of detecting influential nodes in a graph.
It is different to other centrality algorithms because the influence of a node depends on the influence of its neighbours.
com/neo4j-graph-analytics/graph-algorithms-notebooksI used Python library NetworkX’s implementation of these algorithms to determine the statistics shown above.
As you can see, although there is a much larger spread of the Eigenvector Centrality measure for the real sources, overall the numbers are quite similar for both the fake and real sources.
I will have to look at some other methods to differentiate between the two types of nodes.
Clustering through Louvain Community DetectionThe Louvain method of community detection is an algorithm for detecting communities in networks.
It maximizes a modularity score for each community, where the modularity quantifies the quality of an assignment of nodes to communities by evaluating how much more densely connected the nodes within a community are compared to how connected they would be in a random network.
I decided to run this algorithm on the network data I had to see if fake sources were placed in similar classes.
In the first image below, I visualized the network graph with each node in the color of the class it was assigned to.
The second image contains just the fake news sources.
It seems that the vast majority of the fake news sources were put in the purple and green classes, and it is clear that all fake sources are mainly located in one area of the network graph.
This did do a good job of eliminating 25,838 real sources (25,838 nodes were placed in classes without any fake sources through this clustering method), but it is still not enough to completely isolate the fake news sources.
For that I would have to turn to node2vec.
Node2VecAccording to the Stanford Network Analysis Project, the creators of node2vec:The node2vec framework learns low-dimensional representations for nodes in a graph by optimizing a neighborhood preserving objective.
The objective is flexible, and the algorithm accomodates for various definitions of network neighborhoods by simulating biased random walks.
Specifically, it provides a way of balancing the exploration-exploitation tradeoff that in turn leads to representations obeying a spectrum of equivalences from homophily to structural equivalence.
Basically, the node2vec algorithm will give me the ability to embed all of the nodes in more than one dimension (specifically, for this project, 128 dimensions) as a way of engineering new features for the locations of the nodes on the graph.
For my model, I used this implementation of the algorithm.
Here are the parameters that I selected:Unfortunately, even after engineering these 128 new features for each node, my initial attempts at building a classification model were unsuccessful.
Due to the large class imbalance (less than 4% of nodes were fake sources), my algorithms would always predict all sources to be real.
I needed some other differentiating features to help these classification algorithms.
Word EmbeddingsThe idea of node2vec actually came from Word Embeddings, which are a type of vectorization strategy that computes word vectors from a text corpus by training a neural network, which results in a high-dimensional embedding space, where each word is in the corpus is a unique vector in that space.
In this embedding space, the position of the vector relative to the other vectors captures semantic meaning.
I decided to use the profile descriptions for each Twitter user for classification in a recurrent neural network.
The Embedding Layer inside the network computes word embedding vectors.
The output of this neural network would then be a probability that the description of the Twitter account comes from a real or fake account.
I then will use these probabilities in conjunction with the features from node2vec to build a final ensemble classification model.
Below are the details of the recurrent neural network:Model SummaryFinal Classification ModelsI performed a grid search on both Support Vector Machine and XGBoost models using the features from node2vec and the probabilities from the neural network.
I decided to focus my search on models with high Recall and Precision scores due to the high class imbalance (predicting all “real” would lead to an accuracy score of about 96.
XGBoost and SVM Grid Search ResultsThe following images shows the optimal parameters found for my XGBoost and SVM classifiers and a confusion matrix for the final models:XGBoostSVMAs you can see above, the XGBoost model preformed slightly better on precision while the SVM model performed slightly better on recall.
ConclusionThese classification models performed quite well, especially considering the large class imbalance.
It’s clear that the word embedding features made a major difference in the models ability to detect true positives.
While I would have liked to classify the nodes strictly based on network features, there may not be enough to differentiate those nodes classified as fake.
I do think, however, that there is a lot of potential for network analysis in the detection of fake news.
Some of the issues I encountered simply had to do with the vast size of the network (as I mentioned before, I was unable to calculate centrality measures on the entire network due to its size), and there is certainly more patterns to be found in the data.
If you would like to check out my project repo and the entirety of my analysis, you can find that here: https://github.
com/briansrebrenik/Final_ProjectTools used:Neo4j Graph DatabaseGephiNode2Vec from Stanford Network Analysis ProjectNode2Vec Algorithm ImplementationNetworkXKerasPaperspace GradientData Sources:“Fake News” data from CredbankTwitter Network Edges from Luca Hammer.. More details