Intro to Time-Series Analysis | Towards AIAn Introduction to Time-series Analysis Using Python and PandasFirst steps on analyzing and stationarising time seriesOscar ArzamendiaBlockedUnblockFollowFollowingApr 12Very recently I had the opportunity to work on building a sales forecaster as a POC.
It was a challenging project with a cool MVP as an outcome, and through this post, I will share part of my journey and findings on analyzing the data I was provided with.
AssumptionsI will assume you have previous knowledge of both Python and Pandas.
First things first…This project started like every other data science project: by checking the data we had in hand.
I did this by importing the CSV file provided as data source.
path_to_csv = r'path ocsvfile.
csv'data_df = pd.
head()Once I had a clear idea of how the data looked like, I proceeded with the initial exploration and usual transformations.
# Check for nulls: data_df.
isnull()]# Check for other odd values: data_df[data_df['cantidad'].
inf])]# convert column fecha to datetimedata_df['fecha'] = data_df['fecha'].
astype('datetime64[ns]')# Replace NaN valuesdata_df['cantidad'] = data_df['cantidad'].
fillna(0)To simplify future manipulations over the pandas dataframe, I made ‘fecha’ the index of the dataframe.
Since the records already came in the order it was simple to perform this transformation and convert the dataframe into a series with a ‘daily-level’ frequency, by resampling the entire dataframe.
index = data_df.
fechadata_df = data_df.
mean()Updating the index is also helpful for retrieving data in a more intuitive way.
# this will work for ranges:data_df['2014-02-01':'2015-02-02']# this will work for a given year:data_df['2016']After completing the above transformations, the data was ready to be plotted.
With the help of the matplotlib library, I was able to display a graph of the quantity of product sold per day throughout the years.
plot(figsize=(15,8), title= 'Ventas Por Día', fontsize=14)plt.
show()So… what’s a time series and what makes it special?From the initial data exploration, it was clear that we are dealing with what is known as a time series.
Time series is just a fancy way of saying we are dealing with data points indexed in time order.
Usually, when dealing with time series we look for some special characteristics in our data to be able to make predictions based on it.
Specifically, we look for a time series that is stationary.
Stationarity of a time seriesWe can say that a time series is stationary when its mean and variance are not a function of time (i.
, they are constant through time).
Stationarity is important because most of the statistical methods to perform analysis and forecasting work on the assumption that the statistical properties (mean, variance, correlation, etc.
) of the series are constant in time.
How to test the stationarity of a time series?Stationarity can be assessed in two ways:Visually inspect the data points and check how the statistical properties vary in time.
Perform a Dickey-Fuller test.
Let us take a visual approach first and see how it goes.
By plotting the standard deviation and mean along with the original data points, we can see that both of them are somewhat constant in time.
However, they seem to follow a cyclical behavior.
Although the visual approach can give us a clue, applying the Dicky-Fuller Test (DF-test) can provide a more precise way to measure the stationarity of our series.
Results of DF-testI will not go through much detail on how the DF-test work, but let’s say all we need to care about is the numbers we see in “Test Statistic” and “Critical Values”.
We always want the former to be less than the latter.
And the lesser the value of Test Statistic the better.
Our series is stationary given that the Test Statistic is less than all the Critical Values, though not by much.
In case you ever need it, below goes the code I used to evaluate the stationarity.
def test_stationarity(timeseries): # Determining rolling statistics rolmean = timeseries.
mean() rolstd = timeseries.
std() # Plot rolling statistics: orig = plt.
plot(timeseries, color='blue',label='Original') mean = plt.
plot(rolmean, color='red', label='Rolling Mean') std = plt.
plot(rolstd, color='black', label = 'Rolling Std') plt.
title('Rolling Mean & Standard Deviation') plt.
show(block=False) # Perform Dickey-Fuller test: print ('Results of Dickey-Fuller Test:') timeseries = timeseries.
values dftest = adfuller(timeseries, autolag='AIC') dfoutput = pd.
Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used']) for key,value in dftest.
items(): dfoutput['Critical Value (%s)'%key] = value print(dfoutput)What if our time series was non-stationary?There are some techniques one can apply to stationarise a time series.
The two I am more familiar with are:Transformation: apply transformation which penalizes higher values more than smaller values.
These can be taking a log, square root, cube root, etc.
This method helps in reducing the trend.
Differencing: take the difference of the observation at a particular instant with that at the previous point in time.
This deals with both trend and seasonality, hence improving stationarity.
Pandas and numpy provide you with very practical ways to apply these techniques.
For the sake of demonstration, I will apply a log transformation to the dataframe.
# Transform the dataframe:ts_log = np.
log(data_df)# Replace infs with NaNts_log.
nan, inplace=True)# Remove all the NaN valuests_log.
dropna(inplace=True)Bonus track: We can even apply a smoothing technique over the transformed data set to remove the noise that may be present.
A common smoothing technique is to subtract the Moving Average from the data set.
This can be achieved as easy as:# Get the moving average of the seriesmoving_avg = ts_log.
mean() # 12 months# Subtract the moving average of the log-transformed dataframets_log_moving_avg_diff = ts_log – moving_avg# Remove all the NaN valuests_log_moving_avg_diff.
dropna(inplace=True)test_stationarity(ts_log_moving_avg_diff)Clearly, we can see that applying log transformation + moving average smoothing to our original series resulted in a better series; in terms of stationarity.
To apply differencing, Pandas shift() function can be used.
In this case, first order differencing was applied using the following code.
ts_log_diff = ts_log – ts_log.
plot(ts_log_diff)Log-transformed data set after differencingLet us perform a DF-test on this new resulting series.
dropna(inplace=True)test_stationarity(ts_log_diff)With the log transformation and differencing the test statistic is significantly smaller than the critical values, therefore this series is too more stationary than the original series.
Wrapping up…When we face a predictive task that involves a time series, we need to analyze said series and determine whether it is stationary or not.
To determine the stationarity, we can either plot the data and visually inspect the mean and other statistical properties or perform a Dickey-Fuller Test and look at the Test Statistic and Critical Values.
In case the series happens to be non-stationary, we can apply techniques such as transformation or differencing to stationarise the series.
After all this analysis and preparation, the next step on the project was to forecast with the time series, but that’s a topic for another post :).