It would be interesting to see the weather information for the previous dates and map them to my steps (this would require a bit of work since I have to fetch weather data per location and date… so maybe in a near future when I feel more productive :P).
Normally, there are three types of time series structures: Univariate, External Regressions, and Multivariate.
In our case, it is univariate because we have a numeric vector with a timestamp, and our prediction model will be based on historical data to see how many steps I take.
We are dealing with a time stamp/index and ordered univariate data so we can apply we can apply time series model to make a prediction because it has a specific order.
Time Series AnalysisSince we have determined that our dataset is suitable for time series analysis, we need to identify a few characteristics.
These four characteristics help us determine which model to use and which steps to perform prior to modeling.
Seasonality: I would dare say that there is seasonality if I saw the same spike up during the summer and lower activity in the winter times but given that I only have >1 year of data, I want to give myself the benefit of the doubt that this is not the case.
Logically, it makes sense that I would walk more during the spring and the summer though…Trend: there is not really a trend that I see.
No real upward or downward movement in the line plot.
This means we have a constant mean, which is a consequence of a lack of trend and not shift in levels.
Variability: is the deviation from the mean, and there is no consistency in the dataset.
If we split the dataset, you could say that low is around ~5,000 and high is ~20,000 for 2019 and low is ~0 and high is ~25,000 for 2018.
Mean: the mean seems to be somewhere around 10,000 steps and we will explore this more when we apply smoothers.
TestingOnce we delve into the characteristics, we need to look at the time series statistics: autocorrelation (ACF and PACF plots) and stationarity (augmented Dickey-Fuller test).
To analyze these statistics, we will conduct some testing.
StationarityNow, we will perform the Augmented Dickey-Fuller Test to test for stationarity.
Remember, stationarity answers the question: Has the data the same statistical properties throughout the time series?.variance, mean, autocorrelation?.The ADF (one of the Unit Root tests, there are several more) is very powerful since it removes autocorrelation and tests for stationarity.
Let’s run the test:# Test for Stationaritydef stationarity_test(timeseries): """"Augmented Dickey-Fuller Test Test for Stationarity""" from statsmodels.
stattools import adfuller print("Results of Dickey-Fuller Test:") df_test = adfuller(timeseries, autolag = "AIC") df_output = pd.
Series(df_test[0:4], index = ["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"]) print(df_output)Analyze the results.
If p < 0.
05, then stationary.
Test statistics, p value, lags used = how far it goes into the paststationarity_test(steps_new)figure 5p-value is indeed < 0.
05 so our dataset is stationary.
AutocorrelationAutocorrelation answers the question: do earlier observations influence the later observations?.To test for this, we use the ACF and PACF plots.
ACF and PACFACF shows the autocorrelation between lags.
PACF is adjusted for all earlier lags and therefore, we see a difference amongst the two graphs.
The blue area is the confidence level boundary.
Marks outside the CI hints towards autocorrelation.
# Classic ACF and PACF Plots for Autocorrelationfrom statsmodels.
tsaplots import plot_acf, plot_pacf# Autocorrelation and partical autocorrelation in the Steps dataset# Two plots on one sheet%matplotlib inlinefig = plt.
figure(figsize=(12,8))ax1 = fig.
add_subplot(211)fig = plot_acf(steps_new, lags=20, ax=ax1)ax2 = fig.
add_subplot(212)fig = plot_pacf(steps_new, lags=20, ax=ax2)figure 6Not many dataset are outside of the highlighted blue area.
This is good.
PredictionI applied smoothers to remove some of the huge spikes in the data and here is the before the prediction.
Notice how the dataset is a lot more ‘flatter’ now:figure 7We will be using the Autoregressive Integrated Moving Average model for predictions.
I invite you to learn more about it here:https://machinelearningmastery.
com/arima-for-time-series-forecasting-with-python/The python package ‘statsmodel’ makes is easy it use for us but you still have to know how to choose the parameters correctly.
For our dataset, we will look into two different ARIMA models and compare them.
The ARIMA model gives you both model diagnostics and prediction functions.
Here is the diagnostics for my model:# ARIMA Model Setupfrom statsmodels.
arima_model import ARIMA# Model Diagnosticsresults_AR.
summary()figure 8AIC and BIC are both criteria for model selection and comparison.
Read more here:AIC vs.
BICI often use fit criteria like AIC and BIC to choose between models.
I know that they try to balance good fit with…www.
eduSince our data has monthly frequency, let’s forecast 5 months forward:# ARIMA forecasts# Setting upmodel_AR4 = ARIMA(steps_new, order=(4, 0, 0)) results_AR4 = model_AR4.
fit()# model 1Fcast400 = results_AR4.
predict(start = ‘08/19/2018’, end = ‘08/19/2019’)# setup 2# Arima(2,0,2) model and its fitted valuesmodel303 = ARIMA(steps_new, order=(3, 0, 3)) results_M303 = model303.
fit()# model 2Fcast303 = results_M303.
predict(start = ‘08/19/2018’, end = ‘08/19/2019’)Visualization# Comparing the forecasts via data visualizationplt.
figure(figsize = (12, 8))plt.
plot(steps_new, linewidth = 2, label = “original”)plt.
plot(Fcast400, color=’red’, linewidth = 2, label = “ARIMA 4 0 0”)plt.
plot(Fcast202, color=’blue’, linewidth = 2, label = “ARIMA 3 0 3”)plt.
legend()figure 9Prediction in both models seems to suggest I will be meandering slightly above 10,000 steps during the next 5 months.
Reality is that the data will be more volatile but I’m excited to see how close/far I am in real life.
ConclusionYou may say, okay, so what Stephen?.It was cool to read through this (or not! haha), but how is this a step towards well-being?Well-being is a personal, and continuous pursuit for each person and I have only analyzed one feature of the several components that well-being is comprised of.
In this exercise, my hope was to add targets to my exercise regimen to hold myself accountable.
However, it is the hope that I can put myself baseline metrics in a number of different measurements to see progress, continuous good health and prosperity.
I recognize the limitations and it is not my intention to generalize that well-being is only composed of exercise, let alone steps taken.
Mental health, financial health, social well-being, spiritual prosperity, nutrition and more are all part of the equation.
Activities other than exercise like meditation, sleep, mindfulness should also be considered.
There are many improvements to the process that I have done.
For example, I could map my past data with the weather and see the correlation there.
This would be a bit tricky since i have to also have to consider my location per date-stamp.
Quantifying the intangibles like mood, stress, and tangibles like diet (caloric intake, macros), water intake, amount of sleep could also affect and complement the forecast.
Some of the data sources I have mentioned are readily available now but it is still disparate and haven’t found one device that captures it all.
Building a comprehensive, customizable dashboard would also be interesting.
It would also be very interesting to use big data and leverage cloud computing to get some more macro analysis.
Instead of personalized experience, seeing general trends would be beneficial.
This leaves me to say… thank you for reading along and I invite you to try this code for yourself!.A few caveats to consider will be:Location & lifestyle: Data will be different per these two factors.
Some people walk more whereas some people drive cars more.
Easy example would be NYC vs LA.
Commute patterns are different, and that is only one component.
Device: I used my iPhone but Fitbit, Apple Watch, Garmin and other devices track more comprehensive metrics.