Classical Time-Series vs Machine Learning MethodsA Tutorial Comparing Methods of Forecasting in PythonFergus O'BoyleBlockedUnblockFollowFollowingJul 9Photo by Parij Borgohain on UnsplashIn an earlier article, I discussed the use of supervised machine learning to predict a poverty metric using 1000+ World Development Indicators (WDIs).

In this article, I will use the same dataset to compare the lasso with a classical time-series method (exponential smoothing) and some other machine learning methods such as random forest and XGBoost).

In the end, XGBoost was, by far, the best performer.

XGBoost can handle sparse input data and interestingly, the best score by a large margin was obtained when no attempt to impute missing values in the input data was made.

More on all of this later in the article.

The problem, as discussed in the previous article, had a number of challenges:Over 60% of the WDI data was missing.

The number of features far outnumbered the number of observations.

The WDIs are measured on a yearly basis and most have a history of between only 20 and 40 years.

Discontinuous historical records of our the poverty metric that we are trying to predict (see the first plot below)In the previous article, a number of linear regression methods were compared with a naive forecaster.

Lasso regression performed the best of the lot.

I speculated that the lasso performed so well because of its tendency to produce a model with much less variance compared to an Ordinary Least Squares (OLS) model.

Secondly, compared to ridge regression, according to An Introduction to Statistical Learning:In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero.

Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size.

Classical time-series methods vs machine learning methodsStandard textbooks on time-series such as Forecasting: Principles and Practices discuss many methods for forecasting such as exponential smoothing, ARIMA and vector autoregression (VAR).

Although these methods are dominant in time-series (see this post from economist Francis X.

Diebold), Machine Learning is starting to emerge as an alternative method in time-series, finding utility, especially in multivariate time-series.

Here is a paper discussing machine learning strategies for time-series and here is an interesting paper on the success of random forest models on predicting avian influenza outbreaks.

In this walkthrough, we will compare results obtained from exponential smoothing and some machine learning methods such as linear models, random forest, bagging, and XGBoost.

Introducing the dataIn 2014 the UN called for a data revolution to put the best available tools and methods to work in service of achieving the Sustainable Development Goals (SDGs).

Here, we attempt to use machine learning to forecast one of the indicators for SDG 1, it’s objective being to “end poverty in all it’s forms everywhere”.

Machine learning is starting to find a place in understanding societal development, helping to uncover complex relations between different economic indicators and between other disparate data sources.

For some interesting research in this area, check out this paper.

The World Bank is also pursuing research on these topics:“Collecting household survey data on poverty is expensive and time-consuming, which means that policymakers are often making decisions based on old data…machine learning could drastically change the game, making poverty measurement cheaper and much closer to real-time.

” — Asli Demirguc-Kunt, Director of Research at the World BankIn this project, we will try to forecast one particular WDI, “Poverty headcount ratio at $1.

90 a day (2011 PPP) (% of population)”.

This is one of the principal indicators used to measure progress towards meeting the first Sustainable Development Goal.

We will forecast the indicator for the year 2010 with machine learning techniques using data up to and including the year 2009 from all indicators (including the target indicator).

Choosing a target year (i.

e.

2010) for which we already have data, allows us to measure our predictions against the actual reported values and measure the success (or failure) of our algorithms.

This will tell us how well the forecaster will perform in the future on unseen data.

The next graph shows the target values up to the year, 2010.

Where there are continuous values for a country a line is drawn, otherwise, the value is represented by a mark.

A number of countries have high values of SI.

POV.

DDAY in 2010, without any value for the indicator for a number of years beforehand.

Due to the discontinuity in the data for these countries, they will likely be difficult to predict accurately.

The graph is interactive, so you can display the source country of any data by hovering your cursor over the point.

For more details on the data wrangling (using Python and Pandas) needed to parse and clean this dataset see my previous blog.

In addition, missing values are imputed using Pandas very own linear interpolation.

Consideration given to the imputation process is discussed in my previous blog as well.

Note that for some implementations of tree-based learning algorithms, missing values are accommodated for.

Some interesting experimentation with this is detailed below.

What we end up with after all of the above data wrangling is something that looks like this:Windowed data for 2 indicators, for 2 countries if we selected a sliding window length of 3 and if we go back 4 yearsFor training a machine learning algorithm with time-series data, we use a sliding window to create observations.

The above table shows how this might look like for 2 indicators, for 2 countries if we selected a sliding window length of 3 and if we go back 4 years.

Again, if this doesn’t make sense have a look at my previous article on this.

In the table above, I have used the year as shorthand for the value for a given indicator and a given country in that year.

Each row comprises of a set of features and a target variable value that can be used as input to any supervised machine learning algorithm.

For the classical time-series methods used in this comparison, we stick to univariate time-series data.

Therefore, rather than use the windowed data above, we simply use the historical values of the target variable, SI.

POV.

DDAY.

The number of observations that we create for each country is a tunable parameter, but it is a safe bet that our model will improve the more data that we have for training.

I settled on 10 observations per country.

Considering we have 83 countries in the dataset and that the most recent observation for each country is reserved for a final test, this gives 747 observations.

However, we discard observations that have no value for the target for that year.

This leaves us with 453 observations for training, n.

The number of features that we create for each indicator is also tunable.

Let’s consider the case where we have chosen to create 3 features per indicator.

There are 1594 indicators in the WDI dataset.

After pruning the indicators as per the Reintroducing the data section above we are left with 820 indicators.

This gives us 2460 features in total, p.

As you can see, p >> n.

We can consider this data to be high dimensional.

This makes choosing a machine learning model and method of evaluation a bit more difficult as will be discussed below.

Predictive ModellingA number of predictors are considered and compared in this section.

We start off by looking at a naive predictor to set a baseline for our predictive models to beat.

Naive forecasterOur naive forecaster uses the last value of SI.

POV.

DDAY for a given country as a prediction of the target for that country.

Exponential smoothingAn excellent introduction to exponential smoothing can be found in Chapter 7 of Forecasting: Principles and Practices.

Here I implement Holt’s linear trend method using the python library, statsmodels, as I consider it the best match for the data having a trend but no seasonal component.

This is an extension of the simple exponential smoothing method to include a separate term in the prediction for the trend.

For working with classical time-series techniques in python using pandas, numpy and statsmodels, I found Jose Marcial Portilla’s course on Udemy, Python for Time Series Data Analysis to be an excellent resource.

Note that for the exponential smoothing method a separate model is created for every country.

This is in contrast to the machine learning methods that follow, where I built a single model, trained from data from all countries, that can be applied to any country thereafter.

The data wrangling process I used to prepare the data for exponential smoothing was different from that described above.

For exponential smoothing, it is far simpler, in that only historical values of the indicator that we are trying to predict are of interest (SI.

POV.

DDAY).

First, we convert the year index to a timestamp using pandas to_datetime() before passing the data to statsmodels.

The data is organised to look like this:Each column of the table contains the value of our poverty metric, SI.

POV.

DDAY, for a given country up to and including 2009.

Remember that the goal of our predictive algorithm is to forecast SI.

POV.

DDAY for the year 2010.

The following code was used to perform Holt’s linear trend exponential smoothing.

For the entire notebook refer to the source code.

Lasso RegressionLasso (and other linear methods) applied to this dataset were discussed in detail in a previous article.

Lasso regression performed the best.

I speculated that the lasso performed so well because of its tendency to produce a model with much less variance compared to an Ordinary Least Squares (OLS) model.

Secondly, compared to ridge regression, according to An Introduction to Statistical Learning:In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero.

Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size.

Random ForestI’m not going to present a background on tree-based machine learning methods here.

A great introduction to the Random Forest can be found in chapter 15 of The Elements of Statistical Learning or you can refer to countless online articles.

The expectation is that random forest will perform well with this high dimensional dataset as each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

This has the effect of decorrelating the resultant trees and therefore reducing the variance of the aggregated result.

With our high dimensional dataset, this is exactly what the doctor ordered.

XGBoostA good explanation of gradient boosting is presented here.

Gradient boosting is a method whereby trees are built one at a time, where each new tree helps to correct errors made by the previously built tree.

The risk of overfitting is greater than with random forests, therefore, tuning of parameters to help to make sure the model generalises well is important.

In my notebook, I go through some iterations of tuning of some of the parameters that make the model more conservative and prevent overfitting such as gamma, max_depth, min_child_weight, alpha L1 regularisation and lambda L2 regularisation.

One of the interesting features of the XGBoost implementation is that it is able to handle missing values in the input.

During the training process, a default direction (for missing values) is learned for every node in the system.

This is described in detail in the XGBoost paper.

As a result, I experimented with using XGBoost without any attempt to impute missing values like I had done for all the other predictive models.

This turned out to be, by far, the most successful model I ran (see the results below).

Evaluation of the modelsDue to the high dimensional nature of our data, any performance measure on our training set such as R², p-values, etc could be very misleading due to the high risk of overfitting.

Therefore, it is vital that we use some measure on a separate test set.

This could either be a score on a hold-back test set or the scores associated with the cross-validation test sets used in training.

In this project, I measure performance using hold-back test data.

Note though that cross-validation is also used during the training process to prevent over-fitting.

The performance metric we use for each fold of the cross-validation is the Root Mean Squared Error (RMSE).

It is also the performance metric we use to in the final test of performance using the hold-back test set.

The RMSE is chosen as it penalizes large prediction errors more compared with Mean Absolute Error (MAE).

ResultsThe following table of results shows the RMSE obtained from a number of different linear models compared with the naive model in predicting our target variable, SI.

POV.

DDAY for all countries in our target year, 2010 compared with the real values.

XGBoost without any attempt to impute data prior to training turned out to be by far the most successful model I tried.

I did some hand-tuning on parameters for XGBoost to obtain the RMSE score of 3.

11.

The parameters I used were:Parameters for XGBoost (no imputation)Interestingly, a key parameter for success with XGBoost was tuning of reg_lambda.

This is the L2 regularisation term, similar to what is used in ridge linear regression.

Therefore, our two best performing models, XGBoost and lasso both perform regularisation.

Lasso regression remains one of our best models.

According to An Introduction to Statistical Learning, lasso deals very well with the case where the majority of the predictors are unrelated to the target variable and this might explain why lasso is so effective.

It would be great to hear any opinions on why you think XGBoost performed so well on this high dimensional sparse dataset and why imputation using linear interpolation seems to be so harmful to performance?.