Data Scientists who use R are known to write clumsy code — code that is not very readable and code that is not very efficient but this trend has been changing because of the tidy principle popularized by Hadley Wickham who supposedly doesn’t need any introduction in R universe, because his tidyverse is what contributes to the efficiency and work of a lot of R Data scientists.
Now, this new package anomalize open-sourced by Business Science does Time Series Anomaly Detection that goes inline with other Tidyverse packages (or packages supporting tidy data) – with one of the most used Tidyverse functionality – compatibility with the pipe %>% operator to write readable and reproducible data pipeline.
anomalize — InstallationThe Stable version of the R package anomalize is available on CRAN that could be installed like below:install.
packages('anomalize')The latest development version of anomalize is available on github that could be installed like below:#install.
packages('devtools') devtools::install_github("business-science/anomalize")Considering that the development version doesn’t require compiling tools, It’s better to install the development version from github that would be more bug-free and with latest features.
Case — Bitcoin Price Anomaly DetectionIt’s easier to learn a new concept or code piece by actually doing and relating it to what we are of.
So, to understand the Tidy Anomaly Detection in R, We will try to detect anomalies in Bitcoin Price since 2017.
Loading Required PackagesWe use the following 3 packages for to solve the above case:library(anomalize) #tidy anomaly detectiomlibrary(tidyverse) #tidyverse packages like dplyr, ggplot, tidyrlibrary(coindeskr) #bitcoin price extraction from coindeskData ExtractionWe use get_historic_price() from coindeskr to extract historic bitcoin price from Coindesk.
The resulting dataframe is stored in the object btcbtc <- get_historic_price(start = "2017-01-01")Data PreprocessingFor Anomaly Detection using anomalize, we need to have either a tibble or tibbletime object.
Hence we have to convert the dataframe btc into a tibble object that follows a time series shape and store it in btc_ts.
btc_ts <- btc %>% rownames_to_column() %>% as.
tibble() %>% mutate(date = as.
Date(rowname)) %>% select(-one_of('rowname'))Just looking at the head of btc_ts to see sample data:head(btc_ts) Price date 1 998.
2017-01-06Time Series Decomposition with AnomaliesOne of the important things to do with Time Series data before starting with Time Series forecasting or Modelling is Time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components.
anomalize has got a function time_decompose() to perform the same.
Once the components are decomposed, anomalize can detect and flag anomalies in the decomposed data of the reminder component which then could be visualized with plot_anomaly_decomposition() .
btc_ts %>% time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>% anomalize(remainder, method = "gesd", alpha = 0.
05, max_anoms = 0.
2) %>% plot_anomaly_decomposition()Gives this plot:As you can see from the above code, the decomposition happens based on ‘stl’ method which is the common method of time series decomposition but if you have been using Twitter’s AnomalyDetection, then the same can be implemented in anomalize by combining time_decompose(method = “twitter”) with anomalize(method = "gesd").
Also the ‘stl’ method of decomposition can also be combined with anomalize(method = "iqr") for a different IQR based anomaly detection.
Anomaly DetectionAnomaly Detection and Plotting the detected anomalies are almost similar to what we saw above with Time Series Decomposition.
It’s just that decomposed components after anomaly detection are recomposed back with time_recompose() and plotted with plot_anomalies() .
The package itself automatically takes care of a lot of parameter setting like index, frequency and trend, making it easier to run anomaly detection out of the box with less prior expertise in the same domain.
btc_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.
5)Gives this plot:It could be very well inferred from the given plot how accurate the anomaly detection is finding out the Bitcoin Price madness that happened during the early 2018.
If you are interested in extracting the actual datapoints which are anomalies, the following code could be used:btc_ts %>% time_decompose(Price) %>% anomalize(remainder) %>% time_recompose() %>% filter(anomaly == 'Yes') Converting from tbl_df to tbl_time.
Auto-index message: index = datefrequency = 7 daystrend = 90.
5 days# A time tibble: 58 x 10# Index: date date observed season trend remainder remainder_l1 remainder_l2 anomaly recomposed_l1 1 2017-11-12 5857.
2 2017-12-04 11617.
3 2017-12-05 11696.
4 2017-12-06 13709.
5 2017-12-07 16858.
6 2017-12-08 16057.
7 2017-12-09 14913.
8 2017-12-10 15037.
9 2017-12-11 16700.
10 2017-12-12 17178.
with 48 more rows, and 1 more variable: recomposed_l2Thus, anomalize makes it easier to perform anomaly detection in R with cleaner code that also could be used in any data pipeline built using tidyverse.
The code used here are available on my github.
If you would like to know more about Time Series Forecasting in R, Check out Professor Rob Hyndman’s course on Datacamp.
.. More details