I used machine learning to predict an Ultimate Frisbee tournament and it kinda workedDaniel WaltonBlockedUnblockFollowFollowingJun 13Photo by Paul Rutherford of UltiPhotos.

comDaniel Walton has a PhD in Atmospheric Sciences and an MA in Mathematics.

He is passionate about statistics and modeling and is an aspiring data scientist.

You can connect with him on LinkedIn.

He is also an experienced ultimate player, qualifying for USAU Club Nationals eight times in a row, most recently with Seattle Mixtape.

The ChallengeThis will only take a few hours, I thought.

Famous last words…Ultiworld.

com, the premier news website for Ultimate Frisbee, just posted the seedings for College Nationals.

More importantly, they also posted the rules for entering their annual College Nationals fantasy game, called #TheGame.

Pick five total teams: two from the women’s division, two from the men’s division, and one from either division.

The person whose five-team combination scores the most fantasy points at College Nationals wins.

Hold on, let’s step back for a second.

What is ultimate exactly?This is ultimate.

(Photo by Paul Rutherford of UltiPhotos.

com)Yes, this is ultimate too.

(Photo by Paul Rutherford of UltiPhotos.

com)No, this is not ultimate.

This is actually Frisbee Dog.

Ultimate Frisbee or “ultimate” for short (and for trademark reasons) is a sport played between two teams of seven.

Players advance the disc down the field by completing passes and score by catching a pass in the endzone.

Games are usually played to 15, win by two, though they are often shortened by a time cap.

USA Ultimate College Nationals is the culminating tournament of the college season.

It’s a four-day event that features 20 of the best teams in each of the women’s and men’s divisions.

The first stage is pool play, where the teams are divided into four pools of five and play round-robin style against the other teams in their pool.

For example, here are the women’s division pools:Women’s division pools.

(Overall seed is in parentheses.

)So for example, in Pool B, Ohio State is the top-seeded team (B1) and Washington is the lowest-seeded team (B5).

Ohio State will be favored in all of its pool-play games, while Washington will be the underdog in all of theirs.

Teams that finish in the top three of their pool advance to a 12-team single elimination bracket to determine the national champion.

Here is the championship bracket:A couple of notes on the bracket:The top finisher in each pool gets a bye into the quarterfinal roundSecond and third must play in the pre-quarters roundOK, back to the #TheGame.

As I was saying, a participant picks five teams: two women’s, two men’s, and one that can be from either division.

The participant whose teams score the most combined fantasy points wins #TheGame.

Fantasy points are scored based on the total number of wins a team has (pool play and bracket), multiplied by the team’s seed in their pool.

This weighs things more evenly, so bottom seeds can still earn lots of points by pulling a few upsets.

So, if you are Oregon, the five-seed in pool A, and you win two games in pool play and one game in the bracket you’d earn 3 wins × 5 points/win = 15 points.

A team can also earn bonus points.

Being the top finisher in the pool is worth 1 bonus point.

Winning a quarters game earns 1 bonus point.

Winning semis earns 2 bonus points.

Winning finals earns 3 bonus points.

If North Carolina (the top seed and title favorite) won all their pool-play games and won the championship bracket, they would score 14 points.

If Oregon, the 5-seed in their pool, won all their pool-play games and the championship bracket, they would earn 42 points.

So, lower seeds have higher ceilings, but are less likely to advance.

Thus, to accurately assess the fantasy value of a team, we need to know how likely they are to win each game as they advance deeper into the tournament.

Like I said, piece of cake.

The tournament was seven days away, but it should take only a couple hours for me to whip up some Python code, run some Monte Carlo simulations, and calculate the projected distribution of fantasy points for each team.

I was wrong.

Simulating an ultimate tournamentThe key building block of a tournament is a game.

If we can model the likelihood of a game finishing with a certain score, then we can model a tournament as a succession of games.

Here are some facts about a how an ultimate game works that we want to account for:Teams randomly decide which team starts on offense.

This team receives the kickoff, known as the “pull”.

If Team A starts a point on offense and scores that point, then the other team, Team B, will start on offense the next point.

If Team A starts a point on offense and does not score, they will continue to start points on offense until they score.

[The statisticians out there might already be thinking, “A series of failures before the first success, this sounds like a negative binomial process.

” You’re right.

]Here are some simplifying assumptions:A game is over when one team reaches 15 points.

No time caps, no winning by two.

No halftime.

[Note: It would have been easy to implement halftime.

I should have done it because it could have an impact in close games.

Next year!]Now, the interesting part.

Here’s how I modeled the games:When Team A starts on offense, we simulate them winning the point by flipping a weighted coin with P(heads) = pₐ.

If it comes up heads, Team A scores a point and Team B starts on offense the next point.

If it comes up tails, Team B scores a point and Team A starts again on offense the next point.

When Team B starts a point on offense, we flip a second, differently weighted coin with P(heads) = pᵦ to see if Team B scores on offense.

Successive flips of the same coin are independent (i.

e.

teams aren’t influenced by good/poor outcomes of previous points.

)Statistically speaking, our process is basically a series of negative binomial processes where the probability of success depends on which team started the point on offense.

Here are some frequently asked questions:Q: Why two coins?A: Having two different coins reflects the fact that teams typically play different players when they start a point on offense versus when they start a point on defense.

Team A’s offensive players against Team B’s defensive players is a fundamentally different matchup than Team B’s offensive players against Team A’s defensive players.

Q: Is such complexity warranted?A: Yes.

My initial model, which flipped the same weighted coin for all points, produced the right average scores, but the variance was far too high (many more upsets than the game data supports).

Now, to simulate a game between Team A and Team B, we need to know two probabilities, pₐ and pᵦ.

Intuitively, they should depend on the relative strength of the teams.

Essentially, I tuned these two parameters until I got the right expected score and the right frequency of upsets based on actual game data.

But given two teams, what is the expected score and and the frequency of an upset?I came here for the machine learning, where does that fit in?I’m glad you asked.

I used machine learning techniques to model how the expected score and upset frequency depend on the strength of the two teams.

But to do that we need actual data.

[Machine learning not your thing?.Click here to jump to to my fantasy picks.

Interested in my code?.It’s all on GitHub.

]To gather data, I scraped the USAU website using BeautifulSoup and pandas.

Here’s my scraping code.

I scraped USAU power ratings and game data for any game involving a top-50 team.

Game data is simply the outcome (W/L) and score of each game.

A team’s power rating is a numerical value measuring their strength based on who they played and what the score was.

Big wins or wins against good competition usually increase a team’s power rating; big losses or losses against inferior competition usually decrease their power rating.

Thus, the power ratings estimate how strong your team is based all games up to that point.

I grabbed ratings from the end of the regular season (4/4/2019) and game data from the postseason, 4/4/2019 to 5/23/2019 (called Sectionals and Regionals; they precede nationals).

By only using game data from after the ratings are released, we can test whether power ratings are predictive of future games.

(We know they are reflectitve of past games, as that’s what they are generated from.

) Assessing the predictive capacity of our modeling framework is important, since we’ll ultimately apply this same framework to predict what will happen at Nationals.

Snapshot of USA Ultimate women’s power ratings from the end of the regular season (4/4/2019)Onto the nitty gritty of the machine learning.

The short version is, I fit models on the training data; they performed sufficiently well on test data.

This made me confident that our framework would be useful in predicting game outcomes for College Nationals.

Here’s my machine learning code.

Like any good data scientist, I split my data into training and test sets (a 50/50 split).

Some of you might be wondering why I chose a split with such a large test set.

The answer is that upsets are relatively rare in this dataset and I was concerned that if I did a typical 5-way or 10-way split, I would come up with test sets with no upsets at all.

First, I used non-linear regression to model the score as a function of the difference in power rating between the two teams.

Actually, to be specific, the target variable was the adjusted victory margin, which accounts for the fact that in some games the winner doesn’t get to 15 points.

Adjusted victory margin is calculated by first normalizing the score to a game to 15 and then subtracting the underdog’s score from the score of the favorite.

So, if the favorite won 12–8, then the normalized score would be 15–10 and the adjusted victory margin would be +5.

(If the underdog won 12–8, then the adjusted victory margin would be -5.

)So, like I said, I used non-linear regression to model victory margin.

I actually used a logistic function because its shape and interpretation fit the distribution of adjusted victory margin.

[Note: this is different than logistic regression, which is actually a type of classification.

Later, I use logistic regression, but for win/loss outcomes.

]Logistic curve fit to training data.

Of the 229 games in the training data, there were 23 upsets (i.

e.

victory margin < 0).

To predict game outcome (whether the higher-rated team won or lost), I used logistic regression with the power rating difference as the predictor variable.

Logistic regression is appropriate here because our data has binary outcomes and we want to be able to predict the probability of those outcomes.

Logistic regression used to predict win probability of the higher-rated team based on rating difference.

OK, so now we fit models for victory margin and win probability.

But, why did we want to know the expected victory margin and the win probabilities again?.Because we need those to constrain the probabilities pₐ and pᵦ, the respective chances that each team scores when they start a point on offense.

Basically, the victory margin tells us the ratio of pₐ and pᵦ, while the win probabilities allow us to fit pₐ.

Interestingly, it turns out that if Team A is the favorite, then the expected score is approximately 15 to (pᵦ / pₐ)·15.

[I won’t get into the details of the math here, but try it at home if you’re so inclined.

It’s exactly true that E(Team B’s score)/E(Team A’s score) → pᵦ / pₐ, as the winning score → ∞.

] For example, if Team A scores on 60% of its offensive points and Team B scores on 40% of its offensive points, then the expected score would be approximately 15 to 10, since 0.

4/0.

6 ·15 = 10.

This allows us to express pᵦ as a function of pₐ and the adjusted victory margin:pᵦ = pₐ · (15 — adj.

victory margin) / 15Ex: If the expected adj.

victory margin between teams is 3, thenpᵦ = pₐ · (15–3)/15 = pₐ · 12/15 = 0.

8 · pₐ.

Now, we just need to constrain pₐ, which we can do with the win probabilities.

But first, let’s develop some intuition for pₐ.

What does a high pₐ scenario represent?.Let’s look at an extreme scenario.

If pₐ is really high (say 1.

0), then Team A scores every time they start a point on offense.

And if they get to start on offense first, they will win the game every time, even if pᵦ is high.

So, a high value of pₐ means very few upsets.

On the other hand, suppose pₐ is relatively low (say 0.

1).

Since, Team A is the favorite and Team B is the underdog, pᵦ < pₐ, so pᵦ is also low.

Having low probabilities of the offense scoring means that the teams are likely to go on long runs where the defense scores many points in a row.

Such long runs lead to much more variable scores and a higher chance of an upset.

So, a low pₐ value means more upsets.

Based on limited data I was able to get from Ultianalytics.

com, it appears that pₐ is fairly stable for elite teams, in the 0.

6–0.

8 range.

If it were truly constant, it would be quite remarkable.

Essentially, it would say that the favorite is expected to convert on offense at the same rate, regardless of the skill of the underdog.

Is it true?.My intuition says that it is probably not true when the favorite is much better than the underdog.

However, it might work for Nationals, where the disparities between teams are smaller than other tournaments.

Still, I don’t have enough data to determine one way or another.

It does make things much simpler in terms of our calculations.

And did I mention I was running out of time?.At this point, the first games of Nationals were less than 24 hours away.

So I made the simplifying assumption that pₐ is constant.

As we’ll see, our resulting model performed well, so the assumption didn’t overly hinder this experiment.

So, I decided to assume a constant pₐ, but what should it’s value be?.Essentially, it is a hyperparameter for our model.

I used a grid search in the parameter space to determine what the optimal value would be.

Based on the training data, for women it was 0.

65; for men it was 0.

78.

The interpretation of these values is that men’s favorites hold serve on offense at a slightly higher clip than women’s favorites.

Thus, if two men’s teams and two women’s teams differ by the same amount in the power ratings, then we would expect fewer upsets in the men’s matchup and more upsets in the women’s matchup.

I used the test set to evaluate the model.

I compared the win probability curve inferred from the test data to the win probability predictions of my model.

Women’s test data (W/L; blue dots), win probability inferred from test data (blue line), and predicted win probability from my model based on training data (orange dots).

For the women’s side, the agreement between the predicted win probabilities (orange dots) and those inferred from the test data (blue line) is pretty good.

The root mean square error was 0.

008.

The agreement is weaker for the men’s side, with a root mean square error of 0.

016.

The men’s model had a noticeable bias, as it underestimated the probability that the favorite would win in the test set.

Men’s test data (blue dots), win probability inferred from test data (blue line), and predicted win probability from my model based on training data (orange dots).

The errors may seem lower than expected.

This partly because of the large fraction of the games were between extremely mismatched teams.

Many of the games have the favorites above a 0.

95 chance of winning, so even if the model doesn’t get it exactly right, the difference in probability is small.

Still, even for the less certain games, our the model predicts the win probabilities within about 0.

1.

Also, I don’t want to belabor the model validation component, because now that Nationals actually happened, we can directly test how well the model did!But, before I get ahead of myself, let’s talk about making fantasy picks.

Fantasy PicksThe goal of simulating College Nationals is to produce picks for fantasy ultimate.

But let’s first see what the results would look like for a single game.

Here is the likelihood of a team beating the overall #1 seed on the men’s side, North Carolina, in a single game based on 1,000 simulations.

Probability of upsetting North Carolina in the men’s division based on 1,000 simulations for each rating difference.

Brown has a nearly 50/50 chance of beating North Carolina in a single game.

Meanwhile, Northeastern would be expected to pull off the upset only 1 time in 10 games.

Now we get into the tournament itself.

How likely is each team to advance?Back to the women’s division.

Chance of each team in the women’s division of reaching a given round.

Our simulations give North Carolina a 45% chance of winning!.Dartmouth, the two-time defending champ is given only a 2% chance of winning it all.

Dartmouth had some subpar results this year and aren’t even the top seed in their pool, but I imagine a lot of people are still betting on them.

Over to the men’s division.

Chance of each team in the men’s division advancing to at least the given round.

The simulations suggest North Carolina and Brown are the likely finalists, with North Carolina the favorite to win it all.

As we’ll see, the simulations got a lot of that right.

OK, but what does that mean in terms of fantasy points?Let’s first check the clock.

How much time do I have left?.Well the first game of College Nationals starts at 6:00 AM Pacific.

What time is here?.1:00 AM Pacific.

I only have 5 hours left.

With most of my week gone and really needing to sleep, I had little time to make my picks in a perfectly data-driven manner.

I would characterize the way I made them as “mostly data-driven”.

Here are the distributions of fantasy points for the #TheGame based on simulating College Nationals 1,000 times.

[I know the standard in statistics is 10,000 simulations, but again, time was not on my side.

]Women’s division:Distribution of fantasy points in #TheGame for each women’s team.

Columns are pools A-D.

On the women’s side, North Carolina was expected to score the most fantasy points (10.

52) and it’s not even close.

Next is UCSD with 8.

64 and Dartmouth with 8.

63.

There are some valuable middle seeds too, with the Texas women (seeded third in their pool) expected to score 7.

79 points.

I made sure not to pick Cornell (expected score 0.

26).

Now for the men’s division:Distribution of fantasy points in #TheGame for each men’s team.

Columns are pools A-D.

The overall favorites, UNC and Brown, are both expected to score a lot of fantasy points, 9.

56 and 9.

17, respectively.

However, the simulations suggest some excellent value from 2-seeds Oregon (9.

14) and Colorado (9.

21), and 3-seeds Wisconsin (8.

36) and Washington (9.

35).

However, I came across a problem in my thinking.

If I just pick based on expected value, I will be trying to maximize my average score.

But will I actually win any of the games?.Or will I just finish pretty high in most games, but never win?. More details