This is scary as almost every view touches the database!Step 2: load layers time series data into PandasTo start, I want to look for correlations between the layers (ex: SQL, MongoDB, View) and the average response time of the Django app.
There are fewer layers (10) than views (150+) so it’s a simpler place to start.
I’ll grab this time series data from Scout and initialize a Pandas Dataframe.
I’ll leave this data wrangling to the notebook.
After loading the data into a Pandas Dataframe we can plot these layers:Step 3: layer correlationsNow, let's see if any layers are correlated to the Django app’s overall average response time.
Before comparing each layer time series to the response time, we want to calculate the first difference of each time series.
With Pandas, we can do this very easily via the diff() function:df.
diff()After calculating the first difference, we can then look for correlations between each time series via thecorr() function.
The correlation value ranges from −1 to +1, where ±1 indicates the strongest possible agreement and 0 the strongest possible disagreement.
My notebook generates the following result:SQL appears to be correlated to the overall response time of the Django app.
To be sure, let's determine the Pearson Coefficient p-value.
A low value (< 0.
05) indicates that the overall response time is highly likely to be correlated to the SQL layer:df_diff = df.
dropna() p_value = scipy.
values) print("first order series p-value:",p_value)The p-value is just 1.
I'm very confident that slow SQL queries are related to an overall slow Django app.
It's always the database, right?Layers are just one dimension we should evaluate.
Another is the response time of the Django views.
The overall app response time could increase if a view starts responding slowly.
We can see if this is happening by looking for correlations in our view response times versus the overall app response time.
We’re using the exact same process as we used for layers, just swapping out the layers for time series data from each of our views in the Django app:After calculating the first difference of each time series, apps/data does appear to be correlated to the overall app response time.
With a p-value of just 1.
64e-46, apps/data is very likely to be correlated to the overall app response time.
We’re almost done extracting the signal from the noise.
We should check to see if traffic to any views triggers slow response times.
Step 5: Rinse+repeat for Django view throughputsA little-used, expensive view could hurt the overall response time of the app if throughput to that view suddenly increases.
For example, this could happen if a user writes a script that quickly reloads an expensive view.
To determine correlations we’ll use the exact same process as before, just swapping in the throughput time series data for each Django view:endpoints/sparkline appears to have a small correlation.
The p-value is 0.
004, which means there is a 4 in 1,000 chance that there is not a correlation between traffic to endpoints/sparkline and the overall app response time.
So, it does appear that traffic to the endpoints/sparkline view triggers slower overall app response times, but it is less certain than our other two tests.
ConclusionUsing data science, we’ve been able to sort through far more time series metrics than we ever could with intuition.
We’ve also been able to make our calculations without misleading trends muddying the waters.
We know that our Django app response times are:strongly correlated to the performance of our SQL database.
strongly correlated to the response time of our apps/data view.
correlated to endpoints/sparkline traffic.
While we're confident in this correlation given the low p-value, it isn't as strong as the previous two correlations.
Now it’s time for the engineer!.With these insights in hand, I’d:investigate if the database server is being impacted by something outside of the application.
For example, if we have just one database server, a backup process could slow down all queries.
investigate if the composition of the requests to the apps/data view has changed.
For example, has a customer with lots of data started hitting this view more?.Scout's Trace Explorer can help investigate this high-dimensional data.
hold off investigating the performance of endpoints/sparkline as its correlation to the overall app response time wasn't as strong.
It’s important to realize when all of that hard-earned experience doesn’t work.
My brain simply can’t analyze thousands of time series data sets the way our data science tools can.
It’s OK to reach for another tool.
If you’d like to work through this problem on your own, check out my shared Google Colab notebook I used when investigating this issue.
Just import your own data from Scout next time you have a performance issue and let the notebook do the work for you!Originally published at scoutapp.