Anomaly Detection on Donald Trump’s Wikipedia PageChidi Prince JohnBlockedUnblockFollowFollowingMay 16Anomaly detection (aka outlier detection) is a data mining technique that identifies rare observations from a dataset.
This technique is widely used in analytics operations, such as web analytics and fraud analytics.
In web analytics, anomaly detection is used to analyze web traffic and identify periods when unusual events occur.
For instance, Nike received massive global attention across their web platforms when Zion Williamson broke his shoe in a college basketball game.
This significant increase in web traffic can be categorized as an anomaly because it happens in rare and unplanned occasions.
In fraud analytics, anomaly detection is used to analyze credit card transactions and identify unusual transactions.
For instance, if an individual spends approximately $50 on drinks every Friday, and it is discovered that he suddenly pays $2000 on a fateful Friday, that transaction can be immediately flagged as it is likely that the credit card had been stolen and was used to make hasty purchases.
Web analytics and fraud analytics are just two of the numerous applications of anomaly detection, and in this article, I will walk you through the process of performing anomaly detection on a Wikipedia page using R.
For this exercise, we will analyze the daily pageviews of Donald Trump’s Wikipedia page from January 2014 till May 2019.
The process is divided into three major steps.
Introducing the toolsPerforming the analysisExplaining the resultLet’s commence!1: Introducing the ToolsThe R Programming Language is used for this project, and our primary tool for anomaly detection is the AnomalyDetection package.
This package is highly beneficial because it accounts for seasonal trends through its Seasonal Hybrid ESD algorithm.
The Seasonal Hybrid ESD algorithm accounts for seasonal trends in anomaly detection.
For instance, if you purchase groceries worth $10 each day from Monday till Saturday, and then, purchase groceries worth $100 on Sunday, we can call the Sunday purchase an anomaly.
However, it ceases to become an anomaly if you buy groceries worth $100 every Sunday because it might be that you welcome visitors every Sunday for dinner.
Through this algorithm, seasonal trends, such as the one explained, are taken into consideration.
An overview of the tools used for this analysis are described below;devtools: This package facilitates package development, and its interface was instrumental in installing the wikipediatrend and the AnomalyDetection packages which were used to download the Wikipedia dataset and the tool used for anomaly detection respectively.
Rccp: This package helps to integrate R with C++ functions.
ggplot: This package helps us to make visualizations in R.
2: Performing the AnalysisThe first step in performing the analysis is to install the packages.
The Rccp and devtools packages are highly instrumental in this process because the devtools package allows us to install packages directly from GitHub.
Install the ToolsThe next step in this process is to load the libraries.
I already have ggplot and devtools installed, so I loaded them directly.
The wikipediatrend and AnomalyDetection packages are already explained.
Load the LibrariesFor this analysis, we will download our data from Trump’s Wikipedia page.
To perform this exercise, we will use the wikipediatrend package to download data on the pageviews from January 2014.
Importing and Viewing the DatasetThe dataset has four columns which include the language (English), the Wikipedia page title, the date and the number of views.
Since we started viewing from January 2014 till May 2019, we have 1961 observations.
We only need two columns for our analysis (the date column and the views column).
Therefore, we need to drop the undesired columns.
Keeping the Required VariablesThe next step is to visualize the data and to perform this activity, we use the ggplot package.
Data VisualizationFrom this chart, we can see huge spikes between 2016 and 2017.
After performing our analysis, we can tell if those spikes are anomalies.
Anomaly DetectionThe above image shows us how to perform anomaly detection.
The direction keyword tells us the directionality of the anomalies to be detection.
The options include positive, negative, and both.
The next step is to find the anomalies, and the inserted image takes care of that exercise.
Finding the AnomaliesThe result is shown belowAnomaliesIn total, we have 107 anomalies out of 1961 observations.
45% of our dataset are anomalies.
3: Explaining the ResultOur analysis will be incomplete if we don’t explain our outcome.
In this final part of the article, I will review three chunks of those anomalies and do deeper research to understand why they were picked up as anomalies.
For the first part of this section, I will analyze the reason for the huge spike in pageviews on the dates shown below.
First AnomalyTo do this, we will use the Advanced Filter for Google Search and check for news articles within that timeframe to understand if there was an interesting activity.
News SearchIt is interesting to note that Donald Trump surprisingly lost the Republican Presidential elections to Ted Cruz and the reactions to this event sparked massive online activities which led to more Wikipedia pageviews for Mr.
Let’s proceed with our next chunk of anomaly which will be from November 6, 2016, till November 12, 2016.
Second AnomalyWe can see the massive spike in pageviews in this period.
We also recorded >1M pageviews in four of those occasions.
Our Advanced Google search will provide suggestions on this trend.
Advanced Google SearchOur result is intuitive for followers of US politics as Trump won the US Presidential Elections in 2016.
This event sparked massive online activities, which led to more pageviews for Mr.
Third AnomalyThe third chunk of anomaly shows us dates from January 19 till January 26.
Let’s consult Google for more insights.
Advanced Google SearchThe third chunk of anomaly falls within Mr.
This activity sparked massive online activities, which led to more pageviews for him.
In summary, we can see how to perform anomaly detection on a Wikipedia page.
In advanced cases, these tools can be used in crime detection, digital analytics, and forensics.
Thank you for reading and do give me some claps.