Time Series Analysis & Climate ChangeAn introductory, hands-on guide to time series analysis and forecasting; investigating climate data using Python, Pandas, and Facebook’s Prophet libraryPeter TurnerBlockedUnblockFollowFollowingJun 3Contents — What to Expect in this PostContents of this Blog PostWhy time series?All of life’s scenes are placed in the foreground of time, take her away and there isn’t a picture left that we can comprehend.
Understanding time itself is not a pursuit for the faint-hearted (see here and here), and we as humans are pretty much stuck comprehending time as a linear concept.
Time series analysis is useful for two major reasons:It allows us to understand and compare things without losing the important, shared background of ‘time’It allows us to make forecasts‘Make-up’ of a time seriesA time series is a set of repeated measurements of the same phenomenon, taken sequentially over time; it is thus an interesting variation of data type — it encapsulates this background of time, as well as…erm… anything else.
Time is (usually) the independent variable in a time series, whilst the dependent variable is the ‘other thing’.
It is useful to think of a time series as being made up of different components — this is known as decomposition modeling, and the resulting models can be additive or multiplicative in nature.
The four main components are:TrendSeasonalityCyclicityIrregularityThe four main components of a time seriesTrendPersistent over relatively a long period of time, the trend is the overall increase or decrease of the series during that time.
See in the picture above how the series exhibits an upwards trend (shown with the two, straight green lines).
SeasonalitySeasonality is the presence of variations that occur at specific regular intervals; it is the component of the data and series that experiences regular and predictable changes over a fixed period.
Seasonality is illustrated in the picture above — notice the six identical ‘up-down’ fluctuations seen at regular intervals of x minutes.
The peaks and troughs could also be illustrative of a seasonal component.
CyclicityCyclicity refers to the variation caused by circumstances, which repeat at irregular intervals.
Seasonal behavior is very strictly regular, meaning there is a precise amount of time between the peaks and troughs of the data; cyclical behavior, on the other hand, can drift over time because the time between periods isn’t precise.
For example, the stock market tends to cycle between periods of high and low values, but there is no set amount of time between those fluctuations.
Cyclicity is illustrated in the picture above; in this case, the cyclicity seems to be due to specific events taking place before each occurrence.
IrregularityIrregularity is the unpredictable component of a time series — the ‘randomness’.
This component cannot be explained by any other component and includes variations which occur due to unpredictable factors that do not repeat in set patterns.
In the picture above; the magnifying glass illustrates this rough, random, and irregular component.
DatasetsPulling the dataThis tutorial assumes that you are familiar with Jupyter notebooks and have at least some experience with Pandas.
In this section, we will start getting our feet wet by getting hold of some climate data, and pulling it into our Jupyter Notebook with Pandas.
We will be using two datasets:An estimate of global surface temperature change, from NASA (download here)An estimate of CO₂ emissions, in metric tons per capita, from the World Bank (download here)Firstly download the datasets in CSV format (using the links above) and read them in using pandas:Using Pandas to read in climate data in CSV formatNotice that in these cases, when reading in the datasets, we have to skip several rows in order to get what we want — this is due to how the datasets are structured.
Note that link for the CO₂ emissions dataset given above downloads a compressed folder that contains 3 files, we are using the file ‘API_EN.
csv’ in this tutorial.
I have moved both raw CSVs into their own folder named ‘data’.
Let's have a look at our newly acquired datasets:GISTEMP temperature anomaly data from NASACO₂ emissions data, in metric tons per capita, from the World BankBackgroundThe temperature data represents temperature anomalies (differences from the mean/expected value) per month and per season (DJF=Dec-Feb, MAM=Mar-May, etc).
We will not be working with absolute temperature data as in climate change studies, anomalies are more important than absolute temperature.
A positive anomaly indicates that the observed temperature was warmer than the baseline, while a negative anomaly indicates that it was cooler than the baseline.
The CO₂ gives us the average CO₂ emissions (in metric tons) per person.
The dataset is divided up by countries and other categories such as ‘World’ or ‘Upper middle income’, in this tutorial we are only interested in ‘World’ as we are looking at things on a global scale.
For more information on both datasets, see their respective download pages here and here.
Now that we have the data, let’s wrangle it a bit so as to make it easier to work with.
Wrangling“Data wrangling is the process of transforming and mapping data from one “raw” data form into another format, so as to make it more valuable for downstream processes such as analytics”Wrangling temperature dataWe will first wrangle NASA’s temperature anomaly data.
In doing so, we will look at several things:Using a DateTime indexBasic manipulation and dealing with missing valuesResampling to a different frequencyUsing a DateTime index:Using a DateTime index can allow us to be more productive in our time series manipulation; it will allow us to make selections and take slices using timestamps or time ranges.
For the temperature data, we will create an empty DataFrame with a DateTime index of monthly frequency — we will then use our raw data in order to populate this new DataFrame.
The empty DataFrame will range from 1880 to March 2019.
Creating a new Pandas DataFrame to hold the temperature anomaly data using a DateTime indexResulting empty DataFrame, with DateTime indexOkay.
So we have our DataFrame, now we need to populate it.
The idea is that each row will represent the anomaly for that month — we could have therefore used a discrete index of ‘month’ (or even ‘year’ with the average of the months’ anomalies).
I chose to do it this way because this a tutorial on time series analyses and I thought it’d be a more useful exercise.
Basic manipulation and dealing with missing valuesIn order to populate our DataFrame, we will basically be stepping through a slice of the raw data that we are interested in (i.
the year and month columns), and assigning the corresponding anomaly values to our new DataFrame.
Lets first select only the data that we want, to do this we make use of Pandas’ selecting functionality.
We only want the year column, as well as the month columns — and will leave out the season columns.
raw_t = raw_t.
head()Slice of raw temperature dataWe will now make use of Pandas’ apply function to ‘step through’ the rows of our raw data (axis=1 for rows, 0 for columns).
We will also make use of a couple of additional libraries, namely:datetimecalendarPython’s datetime ‘library’ will be useful as it can help us parse dates and times of various formats (see here).
The calendar library is used to get the last day of each month.
Using NASA data to populate our DataFrame using DateTime indexNow populated DataFrameYou may have noticed that the anomaly values seem to be a bit messy, they are a mixture of strings and floats — with a few unusable ‘***’ values mixed in (2019).
Let's clean them up.
Cleaning up anomaly values, and dealing with NaNs using Panda’s ‘Foward Fill’Final temperature DataFrame, after wranglingGreat!.We have successfully wrangled our NASA temperature anomaly data into a nice, usable form.
We will cover the plotting of our data in more detail later but in the meantime, you can render a simple plot within your notebook using Matplotlib:Code to plot our temperature anomaly data using MatplotlibResulting plot of temperature anomaly data using MatplotlibResampling to a different frequency:Now this is all good and well, but the above plot looks a bit messy — it seems that over such a long time period the data is too granular to visualize nicely.
Pandas offers a very convenient function known as ‘resample’, which can change our frequency from months to years (this will also be helpful when comparing it with the CO₂ data later on).
Let’s downsample our temperature data into years, the string ‘A’ represents ‘calendar year-end’.
For all of the frequency strings, see here.
head()Resulting DataFrame after resampling to yearly frequency (end-of-year)The resulting plot is much cleaner:Code to plot resampled temperature anomaly dataResulting plot of resampled temperature anomaly dataAlright, I think we have wrangled our temperature data to a state in which we will be able to use it productively — let's move on to the CO₂ emissions data.
Wrangling of CO₂ emissions dataThis section will tackle the wrangling of our Carbon Dioxide emissions data.
We will use some of the same techniques used above, as well as looking at some new ones:Slicing and SearchingUseful functionsFamiliar techniquesFrom our DataFrame, we will use only the row representing the CO₂ emissions for the entire world.
Like before, we will create a new DataFrame that uses a DateTime index — and then use the raw data to populate it.
Creating a DataFrame — and populating it — with world emissions dataResulting emissions DataFrameSlicing and SearchingDateTime indexes make for convenient slicing of data, let’s select all of our data after the year 2011:e[e.
year>2011]Slice of emissions data after the year 2011 (notice the missing data)Hmm.
There seems to be a few NaN’s towards the end of our data — lets use Panda’s fillna method to deal with this.
year>2011]Slice of emissions data after the year 2011 (no missing data)Much better!.We can also make use of the DateTime index to search for values within a specific range:e['1984-01-04':'1990-01-06']Resulting slice of emissions data within the specified rangeThis functionality starts to become very useful with more granular time-based data — in our case we have years, and so a range index would probably have been sufficient.
Useful functionsPandas provides a whole range of other functions that can be very useful when dealing with time series data — we cannot cover them all in this tutorial, but some are listed below:DataFrame.
rolling → provides rolling window calculationsPandas.
to_datetime → a replacement for datetime.
datetime’s strptime function, it is more useful as it can infer the formatTimSeries.
shift & TimSeries.
tshift → allows for shifting or lagging of the values of a time series backward and forwards in timeFor more information and functionality, see this great Pandas page on time series.
VisualizingNow that we have our datasets nicely wrangled, let’s look at how to plot them.
We will be using two plotting libraries, namely:MatplotlibPlotlyPlotting with MatplotlibMatplotlib is a very popular 2D plotting library for Python, and can easily be downloaded using pip.
Let's plot our temperature data again using Matplotlib, this time we will do it more fancily — adding axis labels and titles, etc.
:Code to plot temperature anomalies using MatplotlibResulting temperature plot using MatplotlibAnd our CO₂ emissions data:Code to plot emissions using MatplotlibResulting emissions plot using MatplotlibPlotting with PlotlyPlotly is a great library for generating plots that are both interactive and suitable for the web.
The Plotly Python package is an open-source library built on plotly.
js — which is in turn built on d3.
In this tutorial, we will be using a wrapper called cufflinks — this makes it easy to use Plotly with Pandas DataFrames.
Importing Plotly and Cufflinks correctly for offline modeNow that we have the library correctly imported, let’s plot both datasets again, this time using Plotly and Cufflinks:Plotting temperature data using PlotlyPlotting emissions data using PlotlyThe resulting plots look much nicer — and are interactive:Resulting temperature plot using PlotlyResulting emissions plot using PlotlyTime Series CorrelationAlthough it seems relatively obvious that both series are trending upwards, what we’d actually like to do here is determine whether the temperature change is as a result of the CO₂ emissions.
Granger CausalityNow, proving causation is actually very difficult — and just because two things are correlated, does not mean that one causes the other (as any statistician will earnestly tell you!).
What we will do instead is determine how helpful the CO₂ emissions data is in forecasting the temperature data; to do this we will use the Granger Causality test.
Do not be fooled by the name, Granger Causality does not test for true causality.
It would actually be more apt to call it Granger Predictability, or something along those lines.
Anyway — what this test does is determine whether one time series will be useful in forecasting another.
Dynamic Time WarpingWe, humans, have developed a number of clever techniques to help us measure the similarity between time series — Dynamic Time Warping (DTW) is one such technique.
What DTW is particularly good at, as measuring similarities between series that ‘differ in speed’.
For example, and to quote Wikipedia:“Similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation.
”As this blog post was running awfully long, I have decided to separate out both the Granger Causality, as well as the Dynamic Time Warping stuff, into a separate post.
Modeling and ForecastingOkay so we can pull, wrangle and visualize our data — and we can do Granger Causality tests to determine whether one time series can be used to predict another; but what about forecasting?Forecasting is fun because it allows us to take a stab at predicting the future.
In this section, we will look at forecasting using a library, namely Facebook’s Prophet library.
We will also briefly look at ARIMA models — though so as to keep this blog post from getting unmanageably long, we will not go through ARIMA in too much detail (at least not in this post).
Forecasting using Facebook’s ProphetOur blue overlords — Facebook — have released an extremely powerful and user-friendly library named ‘Prophet’.
Prophet makes is possible for those with little-to-no experience to predict time series, whilst providing intuitive parameters that are simple to tune.
The library works in a similar way to the models in sklearn — an instance of the Prophetis instantiated, and then thefit and predict methods are called.
This may come as a breath-of-fresh-air to you machine learning enthusiasts out there, it certainly did to me.
Creating, fitting and plotting a model for TemperatureWe will first import Prophet, and then create a separate DataFrame into which we will copy the data across in the correct format — Prophet takes a DataFrame with two columns, one for the date and one for the values.
The date column must be called ‘ds’ whilst the value column must be called ‘y’.
You could do this by modifying your original DataFrame, but I opted to just create a new one:Python code to train a temperature anomaly model using ProphetResulting forecast for temperature anomaliesAnd there we have it, a prediction for global temperature anomalies over the next 100 years!.Notice the light blue region, which widens as we move further into the future?.That is the forecast’s uncertainty; which grows as we proceed further forward in time.
Remember that this forecast only takes into account past anomaly data and nothing else.
In reality, this could prove problematic as the heat retained may actually increase exponentially with increased CO₂ in the atmosphere.
The picture below shows NASA’s current global surface temperature forecast for different levels of emissions into the future (hey, we didn’t do too badly!).
NASA’s forecastDecompositionAs was stated at the beginning of this article, it can be useful to think of a time series as being made up of several components; and luckily for us, Prophet can help us break up our model into these components — so that we can see them visually.
We have already instantiated our model above, to split the model up into its components — run the following code:# Plot the forecast componentsm.
plot_components(forecast);By default, you’ll see the trend and yearly seasonality of the time series.
If you include holidays, you’ll see those here, too.
(With more granular data — the weekly seasonality will be shown too).
What can you tell by looking at the components?.Leave your comments below.
Forecasting using ARIMAAutoregressive Integrated Moving Average — or ARIMA — is a forecasting technique which is able to project future values of a series.
ARIMA is part of a broader class of time series models, all of which are very useful in that they provide a means through which we can use linear regression type models on non-stationary data.
Stationarity basically means that your data is not evolving with time (see explanation in the next section).
Linear models require stationarity; they are good at dealing with and predicting stationary data.
So the basic intuition is that we’d like to achieve a stationary time series that we can do linear regression on, and ARIMA is just linear regression with some terms which ‘force’ your time series to be stationary.
As this blog post is getting rather long — I have decided to leave Autoregressive Integrated Moving Average Modeling for another day, and another post.
For now, see this post for a great introduction to the various forecasting methods (including ARIMA).
General Tips, Terms, and Common PitfallsTermsAutocorrelationAutocorrelation is an important concept to understand when doing time series analyses; the term refers to (and is a mathematical representation of) the degree of similarity between a given time series, and a lagged version of itself over successive time intervals.
Think of autocorrelation as the correlation of a time series with itself — it is thus sometimes referred to as lagged correlation (or serial correlation).
If you are interested in doing ARIMA modeling (see below) — an understanding of autocorrelation is doubly important.
Spurious CorrelationSpurious correlations are actually not-altogether uncommon phenomena in statistics; a spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related.
This can be due to either coincidence or the presence of a third, unseen factor (sometimes called a “common response variable”, “confounding factor”, or “lurking variable”).
StationarityA stationary time series is one in which several statistical properties — namely the mean, variance, and covariance — do not vary with time.
This means that, although the values can change with time, the way the series itself changes with time does not change over time.
Non-Stationary Time SeriesFor more on this, check here and here.
We will not dive too deep into stationarity — but we will do go over a how we can test for stationarity, and how we can make our two series stationary (for the purpose of the Granger Causality test) in this post.
TipsCorrelation is not CausationWhat has come to be a basic mantra in the world of statistics is that correlation does not equal causation.
This means that just because two things appear to be related to one another does not mean that one causes the other.
This is a worthwhile lesson to learn early on.
Correlation does not have to equal causation (original)Beware of trendTrends occur in many time series, and before embarking on an exploration of the relationship between two different time series, you should first attempt to measure and control for this trend.
In doing so, you will lessen the chance of encountering spurious correlations.
But even de-trending a time series cannot protect you from all spurious correlations — patterns such as seasonality, periodicity and autocorrelation can too.
Be aware of how you deal with a trendIt is possible to de-trend naively.
Attempting to achieve stationarity using (for example) a first differences approach may spoil your data if you are looking for lagged effects.
Thanks for reading!I hope that this post has been somewhat enlightening to those that are getting started out with time series analyses — I really enjoyed writing it, and learned loads doing so.
All of the source code can be found on Github.
Please post feedback or questions below.