Exploring, visualizing, and modeling the Minnesota Vikings offenseWilliam ButlerBlockedUnblockFollowFollowingJan 29After a brief search, I acquired play-by-play data for the entire 2018 season from NFL savant.
Since I’m a Minnesota fan at heart, I decided to try out some machine learning techniques to see explore whether I could find some quantifiable ways in which the Vikings might have fallen short in the past year.
For some initial EDA, I decided to look at the yards accumulated through passes, broken down by both pass location and by receiver.
I aggregated all of the receiving yards for each Vikings receiver with at least 10 catches over the course of the season, according to where the pass was directed.
Because the mean yards/reception is so strongly influenced by a few long receptions (and because median yards/reception was frequently 0 due to incomplete passes), I used the total number of yards (sum) to aggregate the data.
This data is visualized at left, and sorted by total receiving yards/player (most to least).
Things that I thought were noteworthy were: Thielen and Diggs tend to split deep on opposite sides of the field; Thielen caught far more yards running deep on the right side of the field while Diggs had most of his deep receiving yards on the left.
I think that this also highlights the relative scarcity of a reliable third WR; Kyle Rudolph is far and away the team’s best third option, but Aldrick Robinson (17 receptions total) and Laquon Treadwell (35 receptions total) are pretty steep dropoffs if they have to line up as WRs when Thielen and Diggs sit a play out.
Dot plots of penalties/team, grouped by defensive penalties, offensive penalties, and all penalties.
MN Vikings are marked in purple.
Given how they were at least partially undone due to penalties in their final game of the season, I next looked at how the Vikings fared in penalties overall.
Minnesota was the second least penalized team overall.
Throughout the 2018 season, only 88 penalties were accepted against them (47 on offense, 41 on defense) while the average team had 106 penalties accepted against them (56 on offense, 50 on defense).
These penalties were good for only 729 yards, third best in the league (on average, teams had 881 yards of penalties called against them).
Though there are certainly additional confounds not accounted for by this basic summary (not all penalties are created equal, both in terms of absolute yardage given up and in terms of how much they impact the outcome of the game), but the data suggests that Minnesota is well-coached and disciplined.
So then I wondered whether Minnesota was doing something odd in their play-calling, perhaps tipping their hand to opposing defenses and giving their opponents extra information with which to prepare.
First, I simply used logistic regression to try and predict whether a play was a pass or not.
As predictors, I used the current game conditions: the amount of time left in the game (computed by combining quarter, minute, and second into a single measure), the current score differential, the down and yards to go, the distance from the endzone, and the identity of the opposing team.
Across 1,000 training splits, logistic regression predicted the pass status of the holdout data at about 58% accuracy (+-3%).
While this is significantly better than chance (if you just predicted “pass” on every play, you would be 53.
7% accurate; running a one-sample t-test on our observed accuracies against this value yields t=15.
18, p < 2–16), it’s not particularly impressive in terms of explaining the data.
Looking at the coefficients, time had no effect on whether or not a play was a pass; while it’s true that a team shouldn’t always pass more as the clock winds down, there are definitely common situations in which they should.
However, our logistic model doesn’t take into account predictor interactions, such as whether there is an effect of the clock when down by two or more possessions, or whether it’s third down and inches or third down and long, which helps to illustrate some of the shortcomings of this relatively simplistic approach.
Given these results, I started to wonder whether being able to predict whether a play was going to be a pass or not (even without all of the additional information that’s on the field) was actually a good thing, and whether more “predictable” teams might have won fewer games.
I re-ran the same logistic modeling code on each team individually; Minnesota was the fifth most “predictable” team by this logistic regression model method, after Jacksonville, Seattle, Pittsburgh, and Indianapolis.
Across all 32 teams, there was a non-significant correlation between win % and logistic predictability (r=0.
30, see figure below).
However, the non-significant positive correlation between logistic accuracy and wins could probably be explained by the fact that the winning-est teams tend to pass until they’re winning by multiple scores, then run the clock out as quickly as possible on the ground.
Plus, some of the least predictable teams (San Francisco, Oakland, and Tampa Bay, for example) were essentially eliminated from the playoffs already halfway through the season.
When there’s less incentive to just maximize your win numbers, I’d imagine teams are more willing to diversify their play-calling, get their players as many different kinds of reps as possible, and give themselves more experience and better odds next season.
Confusion matrix for the random forest model of Minnesota’s offensive play calls.
To improve upon the basic logistic regression results, and to take into account likely interactions between the predictors that may be contributing, I next used a random forest model to predict play type by game conditions.
Rather than simply estimating the “is_pass” outcome, I instead allowed the outcome to have the multiple levels actually found in the data.
A total of 500 trees were used, with 2 variables selected at each split.
Overall, the random forest model predicted plays better than the logistic regression model of pass probability, with about 63% accuracy across all conditions.
And looking at the confusion matrix (left), the random forest model does pretty well at predicting most specific play types.
It predicted field goals with 81% accuracy, passes with 69% accuracy, and punts with 91% accuracy.
Naturally, fumbles and sacks were never predicted by the model, which makes sense given their rarity and inherent unpredictability (most offensive coordinators probably don’t draw up intentional fumbles).
Where the random forest model faltered the most was in predicting rushes; only 52% of the plays it predicted would be rushes ended up actually staying on the ground, while 43% ended being passes after all.
This could be due to shortcomings of the data the model was built on (e.
, using formation and personnel information would probably help), but it could also be due to the Vikings tendency (at least early on in the season) to be a pass-first team.
One of the biggest data sources that I had chosen not to include in the initial random forest model was the formation type (for the Vikings, these included shotgun, under center, no huddle, no huddle shotgun, punt, and field goal formations).
Originally, I was more interested in trying to take the coaches’ perspective, and see if I could predict what play they would come up with given the game situation.
However, another approach would be to take on the role of the opposing defense, and try to predict the play call given the game situation and the formation (which is open information at the line of scrimmage).
Adding in the formation to the random forest model yielded much higher accuracy, with an error rate of only ~29%.
More importantly, the previous difficulty in predicting when the Vikings would rush the ball was greatly reduced, such that when the model was predicting a rush play, 63% of the time it was correct (compared to 52% accuracy without the formation data).
Trying out this random forest+ model on the other teams yielded similar improvements in accuracy.
The mean error rate across all 32 teams was 33%, ranging from 24–40%.
Minnesota was one of the most “predictable” teams again (5th overall), but considering that the 1st and 3rd most predictable teams (New England and the LA Rams, respectively) are playing in the Super Bowl, it again appears that the ability of the model to predict what type of play a team will run doesn’t necessarily make it any easier to stop them from being successful.
To sum up, using a random forest approach to modeling the play-by-play data was fairly accurate at identifying what type of play would be run given the current game conditions, and was a definite improvement over a simple logistic regression on pass/no pass.
Given the Vikings’ well above-average performance on “mistake” metrics like penalties and fumbles, I also don’t think there’s too much of a case to be made for them simply being unlucky.
Some analysts think Kirk Cousins is to blame, but I certainly wouldn’t say he’s the main issue.
During his most recent seasons in Washington, Cousins had five different receivers with 500+ receiving yards over the season, whereas he only had Thielen, Diggs, and Rudolph to rely on this year.
Part of that was due to Cook missing part of the season, but even if he had been healthy all year there would still be fewer reliable receiving options than Cousins was used to.
The offensive line’s difficulty on both rushes and pass blocking was also an area to target for improvement, possibly in the draft.
Overall, though, this modeling exercise suggests that the types of plays being called given game conditions are right in line with some of the most successful teams in the league, which makes me optimistic for improvement next year, especially if the team can add another consistent receiving option.