Probability will only break your heart — Or — Trust the Process, Doubt the Procedure: NBA playoff win chancesSettling a bet with SQL data-nesting & spicy statistical takes (Bayesian vs frequentist decision analysis)Daniel McNicholBlockedUnblockFollowFollowingMay 28This is a parable about simple, straightforward questions of fact, & how they often devolve into complex matters of data processing, analysis & decision-making under fragile epistemic limits, in the real world.
(This is Part I of II.
Will link to Part II here when it is published.
)Prologue: The ChallengeGame 1: NBA Playoffs, 2019 Eastern Conference Semifinals (best of 7)Our heroes, The Philadelphia 76ers — a team forged by a notoriously brazen, analytics-driven, hyperrational “process” — were down 20 points with 4 minutes left against the Toronto Raptors.
The game took place in Toronto, who had home court advantage in the best of 7 series (2 games in Toronto — 2 in Philly — then, if necessary, alternating locations over the next 3 games until a team reaches 4 total wins).
My colleague, advanced analytics nerd & fellow process-truster, Nat, was distraught, but I preached equanimity.
This outcome was well within my prior expectation, despite predicting a 76ers series win in 6 games¹, if healthy & full-strength.
I still liked our chances to win 1 of the 2 initial road games, which is generally considered a good outcome for the away team in a playoff series.
A win followed by a loss was vastly preferable to a loss followed by a win, according to him, despite the identical outcome heading into game 3: taking the series back to Philly tied 1–1.
If u lose the previous game ur more likely to win the next game.
– Nat, wrongly…probablyI was aghast.
A #nerdfight ensued, names were called, gambler’s fallacy accusations flew.
A bet was made:This was essentially a bet about probability theory, independence of events & cognitive bias as much as basketball.
In my mind, teams have roughly stable probabilities of winning each home & away game in a playoff series, which shouldn’t be affected by the order of those wins or losses (at least early in a 7 game series).
Thus, Nat’s argument sounded like the common innumerate mistake of believing heads-tails-heads is a more probable outcome of 3 fair coin flips than heads-heads-heads, because the former feels ‘more random’.
In fact, each sequence has a precisely 12.
5% probability of occurring, assuming 50% chance of heads or tails on a given flip.
This is true because each new coin flip is in no way influenced by the result of past flips — an attribute known as ‘independence’.
Probability axioms for independent events in standard notationThere could be a plausible mechanism of dependence among playoff wins, foremost because NBA playoff series are known to be a game of strategic adjustments as much as talent & athleticism.
The losing team is expected to make adjustments to increase their winning chances, which could theoretically lead to a series of shifting advantages, at least among closely-matched teams.
In this case, striking first would give the best chance to win the series, because strictly alternating wins means the first winner is also the first to 4 wins, which ends the series.
But I had serious doubts that this could be a strong enough effect to overcome the forces of overall team quality & matchup dynamics that should govern the results of a large enough sample of best of 7 series, given the fact that the first 2 games were split, in any order.
Nat is undoubtedly the bigger fan & basketball stat expert, but I felt compelled to stick to my probabilistic guns on this one.
¹ This was contrary to conventional wisdom (Raptors in 6–7) as well as analytics nerd consensus (Raptors in 5), but appeared, at some points, to be under-optimistic, if anything.
→ (back to paragraph)Data Collection & PreprocessingFinding the dataA short search for the best data to settle this question led to 538's expertly curated Historical NBA Elo dataset (under CC BY license).
(Of course the eminent basketball-reference.
com has the data, but not in as convenient a format, that I could tell.
Only later did I learn of a useful ‘frivolities’ back alley of the site).
The 126,314-row CSV file (17mb) contains entries for every NBA (& ABA?) game between 1946-7 & 2014-5, with 2 rows for each game (1 from the perspective of each team), as well as an is_playoffs binary indicator field &game_location indicator field (Home or Away), among many others:This data will more than suffice, but as with any real world data analysis, substantial preprocessing & data preparation is necessary, informed by relevant domain knowledge.
Most notably, the NBA playoff format varied dramatically over its history, only recently settling into the 4-round, best of 7 format across the board.
Per wikipedia:Finally in 1984, the tournament expanded to its present 16-team, four-round knockout, and the now-complete set of first-round series were expanded to a best-of-five.
In 2003 the first round was changed to also be best-of-seven.
So only post-2002 playoff series are fully in line with the spirit of the original question, but the 1984–2002 seasons’ series are quite close as well.
Wrangling the dataWhile I’d normally read such smedium datasets directly into R for analysis, I wanted a more robust & persistent repository for exploring & iterating over the data.
So I uploaded the file into my cloud data warehouse of choice: Google BigQuery, which boasts not only outstanding performance on massive datasets, but also my favorite “standard” SQL engine due to its support for nested data structures, DDL & DML (to say nothing of next-gen innovations like BigQuery ML & deep integration w/ Google Cloud Platform).
I made the BigQuery dataset public here for your enjoyment (under the same CC BY license as 538).
The first wrangling step was to isolate only playoff series after 1983.
I also wanted to add some fields:series_id : unique identifier for each playoff seriesseries_results_array: full results for each series, nested in an array in each single-game level rowseries_first_two: an indicator of results of only the first 2 games, underscore-separated (e.
W_W, W_L, etc)series_wins & series_losses: total number of wins & losses in the playoff series by the team of interestseries_result: indicator of the series result determined by comparing series_wins & series_lossesseries_home_court: indicator of whether the team of interest was home or away for first 2 gamesThis was accomplished via the following query, taking advantage of some of my favorite features of BigQuery standard SQL mentioned above (particularly array functions)The BigQuery output table is here, & looks something like this:Note the series_results_array column, which nests results for the entire playoff series inside of a single cell in each row.
I actually did some extra formatting to allow these arrays to be represented as a single string & exported / rendered in a csv file.
BigQuery itself represents nested arrays as a table-within-a-table:…& actually stores the results on the backend as a JSON, which can also be viewed or exported:single row represented as JSONThis is a powerful feature (reminiscent of tibble “list-columns” in R).
Once stored as an array accessible on a row-wise basis, we can also retrieve values from the array based on their index, using e.
[OFFSET(0)] (0-based indexing) or [ORDINAL(1)] (1-based indexing).
This is how I concatenated the first two game results of each series into a single indicator in the above query:Well & good, but this was essentially an intermediate preprocessing step.
I also wanted to collapse rows down from a single-game level to a playoff-series level: since I now had the full series game results in each row, I no longer needed a row for each game, in fact analysis & aggregation would be easier without them.
That is accomplished by this simple query & analytic / window function:Now only the 1st game / row of each playoff series remains, along with all other relevant series-level info added in the first wrangling query above.
Table output here, sample:Yet there remains row data extraneous to the conditions of our original bet:Each series contains duplicate entries, one from each team’s perspectiveWe’re only interested in the away team’s perspectiveWe’re only interested in series where the first 2 games were split, in some order — “W_L” or “L_W”This is resolved by a few statements in a WHERE clause (which could also simply be applied during any subsequent aggregation query or analysis step, but done here for clarity):Table output here, sample:This brings us to the moment of truth…kinda.
Data AnalysisCounts, summary stats / descriptive statisticsBeginning with the cleanest period (post-2002), we can start with simple counts & percentages, conventionally represented in a contingency table:Output:So at first blush, independence prevails: away teams winning then losing the first 2 games have an essentially identical series win % to teams losing then winning: 39%.
Still, a sample size of 84 total series, while meaningful, leaves a bit to be desired.
So let’s try it on the full post-1983 table:(use your imagination as to the necessary modification to the above query)Now it gets interesting.
A 49% to 35% advantage for the win first away teams in playoff series since 1984.
Moreover, the entirety of that advantage must have come between 1984-2002, since we’ve already established that 2003–2014 series were essentially tied.
The 193-series sample size is substantial, but also tainted by the best of 5 first round series from 1984–2002, which change the dynamics of the original wager to some degree.
So lets try to filter those out by removing series ending with less than 4 wins:Output:That cut the gap in half, from 14% points to 7% points, with a sample size of 134 playoff series.
In-conclusionSo is this a “real” difference?.Or simply expected random fluctuation around otherwise equal winning chances?This sounds like a question about statistical significance, but, as you might have heard, the scientists are rising up against it, as the (mostly Bayesian) statisticians have long advocated.
Traditional so-called Null Hypothesis Significance Testing (NHST) is on the outs, but surely the Bayesian alternatives will save us?…oh…oh no.
How to proceed?Bets must be settled.
Decisions must be made.
Science must advance.
Enter the hairy field of decision theory/science/analysis, …which we’ll explore in Part II of this post.
So stay Tuned!…But since you’re here, left momentarily hanging, please accept as consolation this interactive dashboard built in Google DataStudio, fed by this intermediate BigQuery table created above.
The filters are set to relevant configurations for the question at hand, but feel free to modify & explore on your own!Epilogue: Flus, Flukes & TearsProperty of the Philadelphia Inquirer – Charles Fox, staff photographerAs for our heroes, they indeed won game 2 in enemy territory, bringing the tied series home & going on to win game 3 by blowout.
Up 2–1 in the series with another home game on deck, the odds appeared to be resolutely in our favor.
Then the oddness began.
Joel Embiid — the 76ers’ crown jewel— required an IV at 6am the morning of game 4, after vomiting all night due to an upper respiratory infection.
This after being nearly sidelined with gastroenteritis in game 2 & struggling with knee tendonitis throughout the latter half of the season.
This hard luck resulted in a close loss at home in game 4, a blowout loss on the road in game 5, then a convincing recovery win at home in game 6.
Thus, the stage was set for a final showdown in Toronto, the series tied 3–3: winner takes all.
After an epic back & forth battle, the game was tied with 4.
2 seconds left.
…Let’s just say, our heroes ended up on the extremely unfortunate side of chance, several times over, when the flukiest buzzer-beater game winning shot I’ve ever seen happened to transpire:Angle of maximal preposterousness:The close up:……….
felt real bad.
So put 1 more in the loss column of the L_W group.
(However, for the purposes of our wager, this loss was offset by the contemporaneous Portland Trailblazers vs Denver Nuggets Western Conference Semifinals series, which exactly mirrored the above results though 6 games, but ended in favor of the away team).
While I concoct absurd numbers of statistical models & tests for this silly bet out of pure pettiness & write them up in Part II, follow me & check out my other posts.
Some relevant selections:On Average, You’re Using the Wrong Average: Geometric & Harmonic Means in Data AnalysisWhen the Mean Doesn’t Mean What You Think it Meanstowardsdatascience.
comOn Average, You’re Using the Wrong Average — Part IIReal & Simulated Data + Summary Statistic Dynamics using R & BigQuerytowardsdatascience.
comThe Logistic Map & the Onset of Chaos, SonifiedSystem Dynamics Modeling & Audio Synthesis in Max/MSPmedium.
comSimulating Misanthropic NeighborsUsing R & Shiny to solve a FiveThirtyEight Riddletowardsdatascience.
comThe Empire of Chance: How probability changed science & everyday lifeSolo Book Club vol.
1 – skimmable notestowardsdatascience.
com— Follow on twitter: @dnlmcLinkedIn: linkedin.
com/dnlmc(for possible featured image):.