library(tidyverse); library(repmis); library(httr)
Simulating 2018 NCAA March Madness Tournament Performance in R
It’s that time of the year again–MARCH MADNESS.
A few of us graduate students decided it would be fun to simulate this year’s NCAA Men’s National Championship tournament using R. When we all met to discuss our method to simuluate tournament performance, we quickly realized that each of us had a different method in mind. These methods differed in not only how we determined probability of win via seed, but also the techniques and packages implemented within R. Below showcases 3 different methods we used, each with different results. The methods are arranged in order of increasing use of historical data.
Matt’s Method
The packages needed for this method:
Simulation
This method (that Nick actually invented) relies on the seed of each team in the tournament. The idea is to assign a range of numbers to each seed with the ranges of lower seeds (e.g., #1, #2, #3, …) containing greater numbers than the ranges of higher seeds (e.g., …, #14, #15, #16). Just like this:
Seed | Lower.Bound | Upper.Bound |
---|---|---|
1 | 160 | 320 |
2 | 150 | 310 |
3 | 140 | 300 |
4 | 130 | 290 |
5 | 120 | 280 |
6 | 110 | 270 |
7 | 100 | 260 |
8 | 90 | 250 |
9 | 80 | 240 |
10 | 70 | 230 |
11 | 60 | 220 |
12 | 50 | 210 |
13 | 40 | 200 |
14 | 30 | 190 |
15 | 20 | 180 |
16 | 10 | 170 |
Then, randomly sample from each team’s range and assign the win to the higher number (for each match-up). This is a crude method that will roughly correspond to seed. In other words, each seed has a higher probability of winning over each subsequent seed; however, it is theoretically possible, although very unlikely, for a 16th seed to win the tournament. A simulation given purely historical data would not be able to account for this because a 16th seed has never made it past the first round prior to the 2018 tournament.
Below is the code I used to run the simulation. I tried to avoid for loops as much as possible and learn the map()
functions in the purrr package. The code is not as efficient as I would have liked it to be, but it gets the job done and allows for a variety of plots.
# Sets the regions
<- c('South', 'East', 'West', 'Midwest')
regions
# Import the team names and their seeds from 2018
# These data are being pulled from this website's Github repo
<- 'https://raw.githubusercontent.com/mkmiecik14/mkmiecik14.github.io/master/data/ncaa-2018-teams-updated.csv'
link
# from https://stackoverflow.com/questions/60714074/download-csv-file-from-github-using-httr-get-request
<- GET(link)
data .2018 <-
ncaaread_csv(content(data, "raw")) %>%
rename(region = Region, team = Team1, seed = Seed)
# Initializations
<- 1000
iters <- probTableHolder <- vector('list', length = iters)
results
# Sets seed for reprodicible results
set.seed(18)
The simulation was then ran 1000 times:
for(i in 1:iters) {
# Creates whole tournament probability set
<- probTable <- regions %>%
probTableHolder[[i]] map_df(~tibble(seed = c(1:16),
lowNum = seq(160, 10, -10),
highNum = seq(320, 170, -10)) %>%
group_by(seed) %>%
mutate(result = sample(lowNum:highNum, 1)) %>%
ungroup()) %>%
mutate(region = rep(regions, each = 16)) %>%
inner_join(ncaa.2018, by = c('seed', 'region'))
# Round 1
<- regions %>%
r1 map_df(~tibble(match.up = 1:8, t1.seed = 1:8, t2.seed = 16:9) %>%
mutate(t1.res = probTable$result[1:8],
t2.res = probTable$result[16:9],
seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
mutate(region = rep(regions, each = 8)) %>%
inner_join(probTable, by = c('seed', 'region'))
# Round 1 results
<- r1 %>% select(match.up, region, team, seed, result)
r1.res
# Round 2
<- regions %>%
r2 map_df(~tibble(r2.match.up = 1:4,
r2.match.up.1 = 1:4,
r2.match.up.2 = as.integer(seq(8, 5, -1)),
region = .) %>%
left_join(r1.res,
by = c('r2.match.up.1' = 'match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r1.res,
by = c('r2.match.up.2' = 'match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 2 results
<- r2 %>% select(r2.match.up, region, team, seed, result)
r2.res
# Round 3
<- regions %>%
r3 map_df(~tibble(r3.match.up = 1:2,
r3.match.up.1 = 1:2,
r3.match.up.2 = 4:3,
region = .) %>%
left_join(r2.res,
by = c('r3.match.up.1' = 'r2.match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r2.res,
by = c('r3.match.up.2' = 'r2.match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 3 results
<- r3 %>% select(r3.match.up, region, team, seed, result)
r3.res
# Round 4
<- regions %>%
r4 map_df(~tibble(r4.match.up = 1,
r4.match.up.1 = 1,
r4.match.up.2 = 2,
region = .) %>%
inner_join(r3.res,
by = c('r4.match.up.1' = 'r3.match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
inner_join(r3.res, by = c('r4.match.up.2' = 'r3.match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 4 results
<- r4 %>% select(r4.match.up, region, team, seed, result)
r4.res
# Semifinal
<- tibble(t1.region = c('South', 'East'),
semi.final t2.region = c('West', 'Midwest')) %>%
left_join(r4.res, by = c('t1.region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r4.res, by = c('t2.region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(team = if_else(t1.res > t2.res, t1.team, t2.team)) %>%
inner_join(probTable, by = 'team')
# Semifinal results
<- semi.final %>% select(region, team, seed, result)
semi.final.res
# Final
<- tibble(t1.region = semi.final.res$team[1],
final t2.region = semi.final.res$team[2],
t1.team = semi.final.res$team[1],
t2.team = semi.final.res$team[2],
t1.seed = semi.final.res$team[1],
t2.seed = semi.final.res$team[2],
t1.res = semi.final.res$result[1],
t2.res = semi.final.res$result[2]) %>%
mutate(team = if_else(t1.res > t2.res, t1.team, t2.team)) %>%
inner_join(probTable, by = 'team')
<- final %>% select(region, team, seed, result)
final.res
<- this.result <- bind_rows(r1.res, r2.res, r3.res, r4.res,
results[[i]]
semi.final.res, final.res, .id = 'round')
#print(i) # Uncomment to see progress
}
Data Viz
The results of this simulation were saved out and can be imported from this website’s Github data repository like this:
# Loading probability tables from github
<- 'https://github.com/mkmiecik14/mkmiecik14.github.io/blob/master/data/probTableHolder-updated.RData?raw=true'
probTableLink source_data(probTableLink)
# Loading simulation results from github
<- 'https://github.com/mkmiecik14/mkmiecik14.github.io/blob/master/data/results-updated.RData?raw=true'
resultsLink source_data(resultsLink)
First, let’s ensure that the probability table results worked across simulations and regions:
# Converts into a dataframe
<- probTableHolder %>% map_df(bind_rows, .id = 'iter')
probTable.long
# Calculates summary stats
<- probTable.long %>%
probTable.sum group_by(seed, region) %>%
summarize(m = mean(result),
sd = sd(result),
n = n(),
sem = sd/sqrt(n))
# Probability table plot
ggplot(probTable.long, aes(factor(seed), result, color = region)) +
geom_point(position = 'jitter', alpha = 1/3, shape = 1) +
geom_pointrange(
data = probTable.sum,
aes(factor(seed), m, color = region, ymin = m - sd, ymax = m + sd),
color = 'black'
+
) geom_line(
data = probTable.sum,
aes(factor(seed), m, group = 1, color = region),
color = 'black', alpha = 1/2
+
) scale_color_brewer(palette = 'Set2', name = 'Region') +
labs(x = 'Seed', y = 'Randomly Chosen Value', caption = 'SD Error Bars') +
theme_minimal() +
facet_wrap(~region)
This above visualization depicts the potential ranges from which each seed can draw an arbitrary number. This drawing occurs at the beginning of each simulation iteration and determines each team’s tournament fate. Higher seeds have an advantage of a higher possible range, but this overlap gradually decreases with increases in seed.
Now let’s examine the results of the simulations. As a check, better seeds should win with more frequency each round than worse seeds:
# Converts to long dataframe
<- results %>% map_df(bind_rows, .id = 'iter')
results.long
# Win histogram
ggplot(results.long, aes(as.factor(seed), fill = region)) +
geom_histogram(stat = 'count') +
facet_grid(round~region) +
scale_x_discrete(breaks = seq(1, 16, 3)) +
labs(x = 'Seed', y = 'Win Frequency', title = 'Number of Wins Per Round') +
scale_fill_brewer(palette = 'Set2', name = 'Region') +
theme_minimal()
The above plot confirms this; higher seeds win more often than lower seeds for each round.
Now for fun, let’s see which teams won the tournament most often.
# Round 6 summary information
<- results.long %>%
results.long.sum1 filter(round == '6') %>%
group_by(team, region) %>%
summarise(n = n(), seed = unique(seed))
# Winner plot
ggplot(results.long.sum1, aes(reorder(team, -1*n), n, fill = region)) +
geom_bar(stat = 'identity') +
scale_fill_brewer(palette = 'Set2', name = 'Region') +
labs(
x = 'Team', y = 'NCAA Tournament Wins',
caption = paste('Simulation iterations:', iters)
+
) geom_text(aes(label = seed), nudge_y = 4, color = 'grey') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
There are some clear, but expected, patterns that emerge given a simulation based solely on seed. First, #1 seeds tend to win more often than lower seeds. Second, win frequency tracks well with seed, but in a diminishing step-wise way. The margin is closer between #3 vs #4 seeds, than #2 vs #3 seeds, than #1 vs #2. Perhaps this has to do with the NCAA men’s tournament not re-seeding teams on a round-by-round basis (see this video by Jon Bois).
However, this method did allow for some upsets. As we can see, three #6 seeds won the tournament (TCU, Houston, Miami)!
So, does this analysis inform my bracket this year at all? Sort of, but I’d rather just have R do it for me.
I’ll just have the computer pick a random iteration that occured and I’ll use that to fill out my bracket:
# Random selection from the set of simulation iterations
# note: seed was set earlier in script
<- sample(1:iters, 1)
myPick
# Matt's bracket from a random simulation iteration
<- results.long %>%
mattsBracket filter(iter == myPick) %>%
select(round, region, team, seed) %>%
rename(winner = team)
So here’s what the computer drew for me (winners of each round):
round | region | winner | seed |
---|---|---|---|
1 | South | Virginia | 1 |
1 | South | Cincinnati | 2 |
1 | South | Tennessee | 3 |
1 | South | Arizona | 4 |
1 | South | Kentucky | 5 |
1 | South | Loyola Chicago | 11 |
1 | South | Nevada | 7 |
1 | South | Creighton | 8 |
1 | East | Villanova | 1 |
1 | East | Purdue | 2 |
1 | East | Texas Tech | 3 |
1 | East | Wichita State | 4 |
1 | East | West Virginia | 5 |
1 | East | St. Bona | 11 |
1 | East | Arkansas | 7 |
1 | East | Virginia Tech | 8 |
1 | West | Xavier | 1 |
1 | West | North Carolina | 2 |
1 | West | Michigan | 3 |
1 | West | Gonzaga | 4 |
1 | West | Ohio State | 5 |
1 | West | San Diego State | 11 |
1 | West | Texas A&M | 7 |
1 | West | Missouri | 8 |
1 | Midwest | Kansas | 1 |
1 | Midwest | Duke | 2 |
1 | Midwest | Michigan State | 3 |
1 | Midwest | Auburn | 4 |
1 | Midwest | Clemson | 5 |
1 | Midwest | Syracuse | 11 |
1 | Midwest | Rhode Island | 7 |
1 | Midwest | Seton Hall | 8 |
2 | South | Virginia | 1 |
2 | South | Nevada | 7 |
2 | South | Tennessee | 3 |
2 | South | Arizona | 4 |
2 | East | Villanova | 1 |
2 | East | Purdue | 2 |
2 | East | St. Bona | 11 |
2 | East | Wichita State | 4 |
2 | West | Xavier | 1 |
2 | West | North Carolina | 2 |
2 | West | Michigan | 3 |
2 | West | Gonzaga | 4 |
2 | Midwest | Seton Hall | 8 |
2 | Midwest | Duke | 2 |
2 | Midwest | Michigan State | 3 |
2 | Midwest | Clemson | 5 |
3 | South | Arizona | 4 |
3 | South | Tennessee | 3 |
3 | East | Villanova | 1 |
3 | East | Purdue | 2 |
3 | West | Xavier | 1 |
3 | West | North Carolina | 2 |
3 | Midwest | Clemson | 5 |
3 | Midwest | Duke | 2 |
4 | South | Tennessee | 3 |
4 | East | Villanova | 1 |
4 | West | Xavier | 1 |
4 | Midwest | Duke | 2 |
5 | South | Tennessee | 3 |
5 | Midwest | Duke | 2 |
6 | South | Tennessee | 3 |
Wish me luck!
Linh & David’s Method
Packages needed for this method:
library(tidyverse); library(repmis); library(httr)
Similar to Matt’s method, we also decided to go with a sliding range approach; however, we wanted to include historical win probabilities of first round play to determine the width of each team’s number range, as well as the overlap between seeds.
The Basics
We have the most historical data on the matchups in the first round because they happen four times each year. Therefore, it makes the most sense to use these data to predict each matchup, especially for future rounds where certain matchups are rare. To do this, we created a range of values for each team in such a way that when one number is drawn from that range and compared against the number generated for its first round opponent, the odds of that team’s number being higher than the opponent’s match the historical probability of that team winning. In other words, every team has a range of values that can be chosen from, and this range overlaps with every other team by a very specific amount.
Example 1 - Overlap of 2
If team 1 (favorite) has a range of possible values from 5-10, and team 2 (underdog) has a range of 1-6, randomly choosing a number for each team will result in team 1 winning almost every time. The odds of team 2 winning are equal to the odds of team 1 drawing a 5 and team 2 drawing a 6. This equates to a mathematical “and” problem that can be calculated by multiplying the odds of each event happening. Since each team has 6 numbers to draw from, the odds are 1/6 for each, or 1/36 (2.78%) for both happening at the same time.
Example 2 - Overlap of 4
If we use these same dimensions, but want to increase the odds of the underdog winning, we could increase the amount of numbers that overlap by one number. Now the favorite chooses from 4-10 and the underdog chooses from 1-7. Here, the odds of the underdog winning increase to 12.24% because they can win if they draw a 5, 6, or 7 because it is possible for the favorite to draw a 4, 5, or 6.
The Math
Legend for the following equations:
= number of overlapping values
n = Maximum value possible
M = Size of range R
To make this method work for our purposes, we created an algorithm that defines a range of values for both teams in every first round matchup such that the odds of randomly selecting a value within these ranges matches the historical odds of that team winning. To do this, we determined the formula of the underdog winning (i.e., the number of outcomes in which the underdog would win, divided by the formula for the total number of possible outcomes).
\[ \begin{aligned} \text{Probability of UnderdogWin} = \mathbf{\frac{Underdog Wins}{Total Outcomes}} \end{aligned} \]
\[ \begin{eqnarray} \mathbf{P} = \frac{(\frac{n^2 - n}{2})}{R^2} \Leftrightarrow \frac{(\frac{n^2 - n}{2})}{(\frac{M+n}{2})^2} \end{eqnarray} \]
When this equation is rearranged to solve for n (i.e., the range of numbers that needed to overlap for a given probability), you get a quadatratic equation that can be plugged into the quadratic formula.
\[ \begin{align*} & \mathbf{(P-2)n^2 + (2PM + 2)n + P(M^2)} \text{, where} \\ \\ & \mathbf{a = P - 2} \\ & \mathbf{b = 2PM + 2} \\ & \mathbf{c = P(M^2)} \end{align*} \]
Once n was determined, we needed to set the exact range for each team based on this overlap. To do this, we divided m by 2 and set half of n above and below this mid-point. Teams are assigned a value from this range in round 1, and then carry this value for each round to determine the winner of each matchup. Only in the event of a tie is a new number temporarily generated.
\[ \begin{align} \mathbf{UnderdogHigh} = \frac{M+n}{2} \\ \\ \mathbf{Favored Low} = \frac{M-n+2}{2} \end{align} \]
And that’s it. Plug this into Matt’s code for each round.
The Code
<- function(UnderdogWinProb, MaxVal) {
calcRanges <- UnderdogWinProb - 2
a <- (2 * UnderdogWinProb * MaxVal) + 2
b <- UnderdogWinProb * (MaxVal^2)
c
<- (-b - sqrt(b^2 - 4 * a * c)) / (2 * a)
quad_minus <- (-b + sqrt(b^2 - 4 * a * c)) / (2 * a)
quad_plus
if (isTRUE(sign(quad_plus) == 1)) {
<- quad_plus
n_overlap else {
} <- quad_minus
n_overlap
}
## Round n_overlap to nearest even integer
<- 2 * round(n_overlap/2)
n_overlap
<- (MaxVal + n_overlap) / 2
UnderdogHigh <- ((MaxVal - n_overlap) / 2) + 1
FavoredLow
cat(paste0("UnderdogHigh Value: ", UnderdogHigh))
cat(paste0(""),sep='\n')
cat(paste0("FavoredLow Value: ", FavoredLow))
}
The Simulations
We’ve included two simulations:
- Implements historical match-up records of the first-round up to, but not including, the 2018 men’s NCAA tournament
- Implements all first-round match-up records that include this year’s (2018) tournament first round performance
Click on the separate tabs below to see each one in turn:
Data Prior to 2018
Here are the low and the high numbers matched with each seed:
<- c(4673, 3947, 3036, 2714, 1351, 1216, 1080, 76, 1, 1, 1, 1, 1, 1, 1, 1)
lowNum <- c(10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,
highNum 9925, 8921, 8785, 8650, 7287, 6965, 6054, 5328)
That correspond to seed like this:
seed | lowNum | highNum |
---|---|---|
1 | 4673 | 10000 |
2 | 3947 | 10000 |
3 | 3036 | 10000 |
4 | 2714 | 10000 |
5 | 1351 | 10000 |
6 | 1216 | 10000 |
7 | 1080 | 10000 |
8 | 76 | 10000 |
9 | 1 | 9925 |
10 | 1 | 8921 |
11 | 1 | 8785 |
12 | 1 | 8650 |
13 | 1 | 7287 |
14 | 1 | 6965 |
15 | 1 | 6054 |
16 | 1 | 5328 |
Simulation
This method’s simulation code is identical to Matt’s, except that we will exchange the arbitrary sliding range with these newly created ranges based on first round historical probabilities:
# Sets the regions
<- c('South', 'East', 'West', 'Midwest')
regions
# Import the team names and their seeds from 2018
# These data are being pulled from this website's Github repo
<- 'https://raw.githubusercontent.com/mkmiecik14/mkmiecik14.github.io/master/data/ncaa-2018-teams-updated.csv'
link
# from https://stackoverflow.com/questions/60714074/download-csv-file-from-github-using-httr-get-request
<- GET(link)
data .2018 <-
ncaaread_csv(content(data, "raw")) %>%
rename(region = Region, team = Team1, seed = Seed)
# Initializations
<- 1000
iters <- probTableHolder <- vector('list', length = iters) results
This simulation was also ran 1000 times:
for(i in 1:iters) {
# Creates whole tournament probability set
<- probTable <- regions %>%
probTableHolder[[i]] map_df(~tibble(seed = c(1:16),
lowNum = lowNum,
highNum = highNum) %>%
group_by(seed) %>%
mutate(result = sample(lowNum:highNum, 1)) %>%
ungroup()) %>%
mutate(region = rep(regions, each = 16)) %>%
inner_join(ncaa.2018, by = c('seed', 'region'))
# Round 1
<- regions %>%
r1 map_df(~tibble(match.up = 1:8, t1.seed = 1:8, t2.seed = 16:9) %>%
mutate(t1.res = probTable$result[1:8],
t2.res = probTable$result[16:9],
seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
mutate(region = rep(regions, each = 8)) %>%
inner_join(probTable, by = c('seed', 'region'))
# Round 1 results
<- r1 %>% select(match.up, region, team, seed, result)
r1.res
# Round 2
<- regions %>%
r2 map_df(~tibble(r2.match.up = 1:4,
r2.match.up.1 = 1:4,
r2.match.up.2 = as.integer(seq(8, 5, -1)),
region = .) %>%
left_join(r1.res,
by = c('r2.match.up.1' = 'match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r1.res,
by = c('r2.match.up.2' = 'match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 2 results
<- r2 %>% select(r2.match.up, region, team, seed, result)
r2.res
# Round 3
<- regions %>%
r3 map_df(~tibble(r3.match.up = 1:2,
r3.match.up.1 = 1:2,
r3.match.up.2 = 4:3,
region = .) %>%
left_join(r2.res,
by = c('r3.match.up.1' = 'r2.match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r2.res,
by = c('r3.match.up.2' = 'r2.match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 3 results
<- r3 %>% select(r3.match.up, region, team, seed, result)
r3.res
# Round 4
<- regions %>%
r4 map_df(~tibble(r4.match.up = 1,
r4.match.up.1 = 1,
r4.match.up.2 = 2,
region = .) %>%
inner_join(r3.res,
by = c('r4.match.up.1' = 'r3.match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
inner_join(r3.res, by = c('r4.match.up.2' = 'r3.match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 4 results
<- r4 %>% select(r4.match.up, region, team, seed, result)
r4.res
# Semifinal
<- tibble(t1.region = c('South', 'East'),
semi.final t2.region = c('West', 'Midwest')) %>%
left_join(r4.res, by = c('t1.region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r4.res, by = c('t2.region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(team = if_else(t1.res > t2.res, t1.team, t2.team)) %>%
inner_join(probTable, by = 'team')
# Semifinal results
<- semi.final %>% select(region, team, seed, result)
semi.final.res
# Final
<- tibble(t1.region = semi.final.res$team[1],
final t2.region = semi.final.res$team[2],
t1.team = semi.final.res$team[1],
t2.team = semi.final.res$team[2],
t1.seed = semi.final.res$team[1],
t2.seed = semi.final.res$team[2],
t1.res = semi.final.res$result[1],
t2.res = semi.final.res$result[2]) %>%
mutate(team = if_else(t1.res > t2.res, t1.team, t2.team)) %>%
inner_join(probTable, by = 'team')
<- final %>% select(region, team, seed, result)
final.res
<- this.result <- bind_rows(r1.res, r2.res, r3.res, r4.res,
results[[i]]
semi.final.res, final.res, .id = 'round')
#print(i) # Uncomment to see progress
}
Data Viz
The results of this simulation were saved out and can be imported from this website’s Github data repository like this:
# Loading probability tables from github
<- 'https://github.com/mkmiecik14/mkmiecik14.github.io/blob/master/data/probTableHolder-linh-david-before-2018.RData?raw=true'
probTableLink2 source_data(probTableLink2)
# Loading simulation results from github
<- 'https://github.com/mkmiecik14/mkmiecik14.github.io/blob/master/data/results-linh-david-before-2018.RData?raw=true'
resultsLink2 source_data(resultsLink2)
First, let’s ensure that the probability table results worked across simulations and regions:
# Converts into a dataframe
<- probTableHolder %>% map_df(bind_rows, .id = 'iter')
probTable.long
# Calculates summary stats
<- probTable.long %>%
probTable.sum group_by(seed, region) %>%
summarize(m = mean(result),
sd = sd(result),
n = n(),
sem = sd/sqrt(n))
# Probability table plot
ggplot(probTable.long, aes(factor(seed), result, color = region)) +
geom_point(position = 'jitter', alpha = 1/3, shape = 1) +
geom_pointrange(data = probTable.sum,
aes(factor(seed), m, color = region,
ymin = m - sd, ymax = m + sd),
color = 'black') +
geom_line(data = probTable.sum,
aes(factor(seed), m, group = 1, color = region),
color = 'black', alpha = 1/2) +
scale_color_brewer(palette = 'Set2', name = 'Region') +
labs(x = 'Seed', y = 'Randomly Chosen Value', caption = 'SD Error Bars') +
theme_minimal() +
facet_wrap(~region)
This above visualization depicts the potential ranges each seed can draw from using first round historical win probabilities. This drawing occurs at the beginning of each simulation iteration and determines each team’s tournament performance.
Now let’s examine the results of the simulations.
# Converts to long dataframe
<- results %>% map_df(bind_rows, .id = 'iter')
results.long
# Win histogram
ggplot(results.long, aes(as.factor(seed), fill = region)) +
geom_histogram(stat = 'count') +
facet_grid(round~region) +
scale_x_discrete(breaks = seq(1, 16, 3)) +
labs(x = 'Seed', y = 'Win Frequency', title = 'Number of Wins Per Round') +
scale_fill_brewer(palette = 'Set2', name = 'Region') +
theme_minimal()
Higher seeds tend to win more often than lower seeds for each round. In contrast to Matt’s simulation, there is more variability towards seed #7 and this variability becomes more pronounced with each subsequent round.
Now for fun, let’s see which teams won the tournament most often.
# Round 6 summary information
<- results.long %>%
results.long.sum1 filter(round == '6') %>%
group_by(team, region) %>%
summarise(n = n(), seed = unique(seed))
# Winner plot
ggplot(results.long.sum1, aes(reorder(team, -1*n), n, fill = region)) +
geom_bar(stat = 'identity') +
scale_fill_brewer(palette = 'Set2', name = 'Region') +
labs(x = 'Team', y = 'NCAA Tournament Wins',
caption = paste('Simulation iterations:', iters)) +
geom_text(aes(label = seed), nudge_y = 4, color = 'grey') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This simulation has Villanova (#1) with the most wins, while Purdue (#2) comes in second.
Interestingly, compared to Matt’s simulation, this method allows for more variability for higher seeds to win the tournament:
ggplot(results.long.sum1, aes(factor(seed), n)) +
geom_bar(stat = 'identity') +
labs(x = 'Seed', y = 'NCAA Tournament Wins',
caption = paste('Simulation iterations:', iters)) +
theme_minimal()
According to this method, #5 and #7 seeds won the tournament fewer times than #6 and #8 seeds. Nevertheless, this method results in several upsets winning the tournament, even as high as #9 seeds.
Data Including 2018
This section was added after observing the first round of the 2018 Men’s NCAA Tournament because, for the first time in history, a #16 seed beat a #1 seed. Therefore, we no longer had to guess this probability because it occured once out of 136 games–a probability of 0.0076.
The ranges for each team were adjusted to include data from the 2018 tournament:
Here are the low and the high numbers matched with each seed:
<- c(4678, 3966, 3077, 2639, 1444, 1182, 1116, 1, 1, 1, 1, 1, 1, 1, 1, 1)
lowNum <- c(10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,
highNum 8885, 8819, 8557, 7363, 6924, 6035, 5323)
That correspond to seed like this:
seed | lowNum | highNum |
---|---|---|
1 | 4678 | 10000 |
2 | 3966 | 10000 |
3 | 3077 | 10000 |
4 | 2639 | 10000 |
5 | 1444 | 10000 |
6 | 1182 | 10000 |
7 | 1116 | 10000 |
8 | 1 | 10000 |
9 | 1 | 10000 |
10 | 1 | 8885 |
11 | 1 | 8819 |
12 | 1 | 8557 |
13 | 1 | 7363 |
14 | 1 | 6924 |
15 | 1 | 6035 |
16 | 1 | 5323 |
Simulation
This method’s simulation code is identical to Matt’s, except that we will exchange the arbitrary sliding range with these newly created ranges based on first round historical probabilities:
# Sets the regions
<- c('South', 'East', 'West', 'Midwest')
regions
# Import the team names and their seeds from 2018
# These data are being pulled from this website's Github repo
<- 'https://raw.githubusercontent.com/mkmiecik14/mkmiecik14.github.io/master/data/ncaa-2018-teams-updated.csv'
link
# from https://stackoverflow.com/questions/60714074/download-csv-file-from-github-using-httr-get-request
<- GET(link)
data .2018 <-
ncaaread_csv(content(data, "raw")) %>%
rename(region = Region, team = Team1, seed = Seed)
# Initializations
<- 1000
iters <- probTableHolder <- vector('list', length = iters) results
This simulation was also ran 1000 times:
for(i in 1:iters) {
# Creates whole tournament probability set
<- probTable <- regions %>%
probTableHolder[[i]] map_df(~tibble(seed = c(1:16),
lowNum = lowNum,
highNum = highNum) %>%
group_by(seed) %>%
mutate(result = sample(lowNum:highNum, 1)) %>%
ungroup()) %>%
mutate(region = rep(regions, each = 16)) %>%
inner_join(ncaa.2018, by = c('seed', 'region'))
# Round 1
<- regions %>%
r1 map_df(~tibble(match.up = 1:8, t1.seed = 1:8, t2.seed = 16:9) %>%
mutate(t1.res = probTable$result[1:8],
t2.res = probTable$result[16:9],
seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
mutate(region = rep(regions, each = 8)) %>%
inner_join(probTable, by = c('seed', 'region'))
# Round 1 results
<- r1 %>% select(match.up, region, team, seed, result)
r1.res
# Round 2
<- regions %>%
r2 map_df(~tibble(r2.match.up = 1:4,
r2.match.up.1 = 1:4,
r2.match.up.2 = as.integer(seq(8, 5, -1)),
region = .) %>%
left_join(r1.res,
by = c('r2.match.up.1' = 'match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r1.res,
by = c('r2.match.up.2' = 'match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 2 results
<- r2 %>% select(r2.match.up, region, team, seed, result)
r2.res
# Round 3
<- regions %>%
r3 map_df(~tibble(r3.match.up = 1:2,
r3.match.up.1 = 1:2,
r3.match.up.2 = 4:3,
region = .) %>%
left_join(r2.res,
by = c('r3.match.up.1' = 'r2.match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r2.res,
by = c('r3.match.up.2' = 'r2.match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 3 results
<- r3 %>% select(r3.match.up, region, team, seed, result)
r3.res
# Round 4
<- regions %>%
r4 map_df(~tibble(r4.match.up = 1,
r4.match.up.1 = 1,
r4.match.up.2 = 2,
region = .) %>%
inner_join(r3.res,
by = c('r4.match.up.1' = 'r3.match.up',
'region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
inner_join(r3.res, by = c('r4.match.up.2' = 'r3.match.up',
'region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(seed = if_else(t1.res > t2.res, t1.seed, t2.seed))) %>%
inner_join(probTable, by = c('region', 'seed'))
# Round 4 results
<- r4 %>% select(r4.match.up, region, team, seed, result)
r4.res
# Semifinal
<- tibble(t1.region = c('South', 'East'),
semi.final t2.region = c('West', 'Midwest')) %>%
left_join(r4.res, by = c('t1.region' = 'region')) %>%
rename(t1.team = team, t1.seed = seed, t1.res = result) %>%
left_join(r4.res, by = c('t2.region' = 'region')) %>%
rename(t2.team = team, t2.seed = seed, t2.res = result) %>%
mutate(team = if_else(t1.res > t2.res, t1.team, t2.team)) %>%
inner_join(probTable, by = 'team')
# Semifinal results
<- semi.final %>% select(region, team, seed, result)
semi.final.res
# Final
<- tibble(t1.region = semi.final.res$team[1],
final t2.region = semi.final.res$team[2],
t1.team = semi.final.res$team[1],
t2.team = semi.final.res$team[2],
t1.seed = semi.final.res$team[1],
t2.seed = semi.final.res$team[2],
t1.res = semi.final.res$result[1],
t2.res = semi.final.res$result[2]) %>%
mutate(team = if_else(t1.res > t2.res, t1.team, t2.team)) %>%
inner_join(probTable, by = 'team')
<- final %>% select(region, team, seed, result)
final.res
<- this.result <- bind_rows(r1.res, r2.res, r3.res, r4.res,
results[[i]]
semi.final.res, final.res, .id = 'round')
#print(i) # Uncomment to see progress
}
Data Viz
The results of this simulation were saved out and can be imported from this website’s Github data repository like this:
# Loading probability tables from github
<- 'https://github.com/mkmiecik14/mkmiecik14.github.io/blob/master/data/probTableHolder-linh-david-including-2018.RData?raw=true'
probTableLink2 source_data(probTableLink2)
# Loading simulation results from github
<- 'https://github.com/mkmiecik14/mkmiecik14.github.io/blob/master/data/results-linh-david-including-2018.RData?raw=true'
resultsLink2 source_data(resultsLink2)
First, let’s ensure that the probability table results worked across simulations and regions:
# Converts into a dataframe
<- probTableHolder %>% map_df(bind_rows, .id = 'iter')
probTable.long
# Calculates summary stats
<- probTable.long %>%
probTable.sum group_by(seed, region) %>%
summarize(m = mean(result),
sd = sd(result),
n = n(),
sem = sd/sqrt(n))
# Probability table plot
ggplot(probTable.long, aes(factor(seed), result, color = region)) +
geom_point(position = 'jitter', alpha = 1/3, shape = 1) +
geom_pointrange(data = probTable.sum,
aes(factor(seed), m, color = region,
ymin = m - sd, ymax = m + sd),
color = 'black') +
geom_line(data = probTable.sum,
aes(factor(seed), m, group = 1, color = region),
color = 'black', alpha = 1/2) +
scale_color_brewer(palette = 'Set2', name = 'Region') +
labs(x = 'Seed', y = 'Randomly Chosen Value', caption = 'SD Error Bars') +
theme_minimal() +
facet_wrap(~region)
This above visualization depicts the potential ranges each seed can draw from using first round historical win probabilities. This drawing occurs at the beginning of each simulation iteration and determines each team’s tournament performance.
Now let’s examine the results of the simulations.
# Converts to long dataframe
<- results %>% map_df(bind_rows, .id = 'iter')
results.long
# Win histogram
ggplot(results.long, aes(as.factor(seed), fill = region)) +
geom_histogram(stat = 'count') +
facet_grid(round~region) +
scale_x_discrete(breaks = seq(1, 16, 3)) +
labs(x = 'Seed', y = 'Win Frequency', title = 'Number of Wins Per Round') +
scale_fill_brewer(palette = 'Set2', name = 'Region') +
theme_minimal()
Higher seeds tend to win more often than lower seeds for each round.
Now for fun, let’s see which teams won the tournament most often.
# Round 6 summary information
<- results.long %>%
results.long.sum1 filter(round == '6') %>%
group_by(team, region) %>%
summarise(n = n(), seed = unique(seed))
# Winner plot
ggplot(results.long.sum1, aes(reorder(team, -1*n), n, fill = region)) +
geom_bar(stat = 'identity') +
scale_fill_brewer(palette = 'Set2', name = 'Region') +
labs(x = 'Team', y = 'NCAA Tournament Wins',
caption = paste('Simulation iterations:', iters)) +
geom_text(aes(label = seed), nudge_y = 4, color = 'grey') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This simulation that incorporated the results from the first round of the 2018 men’s NCAA tournament had Virginia (#1) winning with Xavier (#1) coming in second in overall tournament wins across all 1000 simulations.
Interestingly, compared to Matt’s simulation, this method allows for more variability for higher seeds to win the tournament:
ggplot(results.long.sum1, aes(factor(seed), n)) +
geom_bar(stat = 'identity') +
labs(x = 'Seed', y = 'NCAA Tournament Wins',
caption = paste('Simulation iterations:', iters)) +
theme_minimal()
According to this method, overall wins track well with increasing seed. Lower seeds win more than higher seeds. However, upsets are definitely allowed to occur with #11 seeds winning the tournament! This is most likely a consequence of this year’s (2018) tournament containing a variety of upsets, namely a 16th seed over a 1st seed victory.
Nick’s Method
R packages needed for this method:
library(tidyverse); library(httr); library(repmis)
For my bracket, I wanted to match the probability of any given seed (seed X) beating any other given seed (seed Y) to the actual proportion of times that seed X has beaten seed Y in real life. For example, 15th seed teams have beaten 2nd seed teams 8 times in 132 games, or 6.1% of the time. Luckily for me, someone has already gone through every single game played and calculated the proportions for every match-up.
So, with the help of my friend Giuseppe Miranda, I created a table containing all of these proportions where rows represent winners and columns represent losers. So, if you want the proportion of times that 1st seed teams have beaten 16th seed teams, look at [row 1, column 16] and you will see that 1st seed teams have beaten 16th seed teams 132 times in 132 games, or 100% of the time (132/132 = 1). Likewise, if you go to [row 8, column 9] you can see that 8th seed teams have beaten 9th seed teams 53.2% (or 0.532) of the time.
These data are available on this website’s Github data repository and can be read into R like this:
# Table of probabilites to reference for each seed vs. seed match-up
<- 'https://raw.githubusercontent.com/mkmiecik14/mkmiecik14.github.io/master/data/SeedMatchupProportions.csv'
propLink
# from https://stackoverflow.com/questions/60714074/download-csv-file-from-github-using-httr-get-request
<- GET(propLink)
data <- read_csv(content(data, "raw"))
data_temp
<- data_temp %>% select(-Seed) %>% as.matrix(.)
ProportionsData rownames(ProportionsData) <- data_temp$Seed
The historical proportions data looks like this:
Seed | Seed1 | Seed2 | Seed3 | Seed4 | Seed5 | Seed6 | Seed7 | Seed8 | Seed9 | Seed10 | Seed11 | Seed12 | Seed13 | Seed14 | Seed15 | Seed16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.500 | 0.534 | 0.600 | 0.699 | 0.840 | 0.688 | 0.857 | 0.800 | 0.905 | 0.857 | 0.571 | 1.000 | 1.000 | 0.000 | 0.000 | 1.0 |
2 | 0.466 | 0.500 | 0.629 | 0.444 | 0.200 | 0.722 | 0.710 | 0.444 | 0.500 | 0.607 | 0.867 | 1.000 | 0.000 | 0.000 | 0.939 | 0.0 |
3 | 0.400 | 0.371 | 0.500 | 0.625 | 0.500 | 0.553 | 0.588 | 1.000 | 1.000 | 0.692 | 0.706 | 0.000 | 0.000 | 0.841 | 1.000 | 0.0 |
4 | 0.301 | 0.556 | 0.375 | 0.500 | 0.565 | 0.333 | 0.333 | 0.364 | 0.666 | 1.000 | 0.000 | 0.692 | 0.809 | 0.000 | 0.000 | 0.0 |
5 | 0.160 | 0.800 | 0.500 | 0.435 | 0.500 | 1.000 | 0.000 | 0.250 | 0.333 | 1.000 | 0.000 | 0.671 | 0.800 | 0.000 | 0.000 | 0.0 |
6 | 0.313 | 0.278 | 0.447 | 0.667 | 0.000 | 0.500 | 0.625 | 0.333 | 0.000 | 0.600 | 0.638 | 0.000 | 0.000 | 0.875 | 0.000 | 0.0 |
7 | 0.143 | 0.290 | 0.412 | 0.667 | 0.000 | 0.375 | 0.500 | 1.000 | 0.000 | 0.609 | 0.000 | 0.000 | 0.000 | 1.000 | 0.660 | 0.0 |
8 | 0.200 | 0.556 | 0.000 | 0.636 | 0.750 | 0.750 | 0.500 | 0.500 | 0.532 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.0 |
9 | 0.095 | 0.500 | 0.000 | 0.333 | 0.666 | 0.000 | 0.000 | 0.468 | 0.500 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.0 |
10 | 0.143 | 0.393 | 0.308 | 0.000 | 0.000 | 0.400 | 0.391 | 0.000 | 0.000 | 0.500 | 0.333 | 0.000 | 0.000 | 1.000 | 1.000 | 0.0 |
11 | 0.429 | 0.133 | 0.294 | 0.000 | 0.000 | 0.362 | 1.000 | 0.000 | 0.000 | 0.667 | 0.500 | 0.000 | 0.000 | 1.000 | 0.000 | 0.0 |
12 | 0.000 | 0.000 | 0.000 | 0.308 | 0.329 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.727 | 0.000 | 0.000 | 0.0 |
13 | 0.000 | 0.000 | 0.000 | 0.191 | 0.200 | 0.000 | 0.000 | 0.273 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.000 | 0.000 | 0.0 |
14 | 0.000 | 0.000 | 0.159 | 0.000 | 0.000 | 0.125 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.000 | 0.0 |
15 | 0.000 | 0.061 | 0.000 | 0.000 | 0.000 | 0.000 | 0.333 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.0 |
16 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.5 |
Next, we will need a data frame for each round of the tournament, where each row represents each game played in the tournament. Each row will contain the name of the teams that are playing each other (‘Team1’ and ‘Team2’ columns) and the seed numbers of both of those teams (‘Team1Seed’ and ‘Team2Seed’). Later we will loop through these data frames to simulate each round of the tournament and predict the winner of each game.
For the first round, it was easier to create the data in excel beforehand and is available via .csv download from this website’s Github data repository or you can retrieve from R like this:
# We will load a data frame containing each match-up for the round of 64
<- 'https://raw.githubusercontent.com/mkmiecik14/mkmiecik14.github.io/master/data/NCAA2018Bracket.csv'
r1Link
# from https://stackoverflow.com/questions/60714074/download-csv-file-from-github-using-httr-get-request
<- GET(r1Link)
data <- read_csv(content(data, "raw"))
data_temp
<- data_temp %>% select(-Game) %>% as.data.frame(.)
Round1Data rownames(Round1Data) <- data_temp$Game
# We also need to coerce the Team Name columns into character vectors
$Team1 <- as.character(Round1Data$Team1)
Round1Data$Team2 <- as.character(Round1Data$Team2) Round1Data
Round1Data
looks like this:
Region | Team1 | Team2 | Team1Seed | Team2Seed | |
---|---|---|---|---|---|
1 | South | Virginia | UMBC | 1 | 16 |
2 | South | Creighton | Kansas State | 8 | 9 |
3 | South | Kentucky | Davidson | 5 | 12 |
4 | South | Arizona | Buffalo | 4 | 13 |
5 | South | Miami | Loyola Chicago | 6 | 11 |
6 | South | Tennessee | Wright State | 3 | 14 |
7 | South | Nevada | Texas | 7 | 10 |
8 | South | Cincinnati | Georgia State | 2 | 15 |
25 | West | Xavier | Play in 4 | 1 | 16 |
26 | West | Missouri | Florida State | 8 | 9 |
27 | West | Ohio State | South Dakota State | 5 | 12 |
28 | West | Gonzaga | UNC Greensboro | 4 | 13 |
29 | West | Houston | San Diego State | 6 | 11 |
30 | West | Michigan | Montana | 3 | 14 |
31 | West | Texas A&M | Providence | 7 | 10 |
32 | West | North Carolina | Lipscomb | 2 | 15 |
9 | East | Villanova | Play in 1 | 1 | 16 |
10 | East | Virginia Tech | Alabama | 8 | 9 |
11 | East | West Virginia | Murray State | 5 | 12 |
12 | East | Wichita State | Marshall | 4 | 13 |
13 | East | Florida | Play in 2 | 6 | 11 |
14 | East | Texas Tech | SF Austin | 3 | 14 |
15 | East | Arkansas | Butler | 7 | 10 |
16 | East | Purdue | CS Fullerton | 2 | 15 |
17 | Midwest | Kansas | Penn | 1 | 16 |
18 | Midwest | Seton Hall | NC State | 8 | 9 |
19 | Midwest | Clemson | New Mexico State | 5 | 12 |
20 | Midwest | Auburn | Charleston | 4 | 13 |
21 | Midwest | TCU | Play in 3 | 6 | 11 |
22 | Midwest | Michigan State | Bucknell | 3 | 14 |
23 | Midwest | Rhode Island | Oklahoma | 7 | 10 |
24 | Midwest | Duke | Iona | 2 | 15 |
But we can make the rest of the data frames for the remaining rounds in R:
<- data.frame(game = 1:16,
Round2Data Region = rep(c('South', 'West', 'East', 'Midwest'),
each = 4),
Team1 = character(length = 16),
Team2 = character(length = 16),
Team1Seed = integer(length = 16),
Team2Seed = integer(length = 16),
Winner = character(length = 16),
stringsAsFactors = F)
<- data.frame(game = 1:8,
Round3Data Region = rep(c('South', 'West', 'East', 'Midwest'),
each = 2),
Team1 = character(length = 8),
Team2 = character(length = 8),
Team1Seed = integer(length = 8),
Team2Seed = integer(length = 8),
Winner = character(length = 8),
stringsAsFactors = F)
<- data.frame(game = 1:4,
Round4Data Region = c('South', 'West', 'East', 'Midwest'),
Team1 = character(length = 4),
Team2 = character(length = 4),
Team1Seed = integer(length = 4),
Team2Seed = integer(length = 4),
Winner = character(length = 4),
stringsAsFactors = F)
<- data.frame(game = 1:2,
Round5Data Team1 = character(length = 2),
Team2 = character(length = 2),
Team1Seed = integer(length = 2),
Team2Seed = integer(length = 2),
Winner = character(length = 2),
stringsAsFactors = F)
<- data.frame(game = 1,
Round6Data Team1 = character(length = 1),
Team2 = character(length = 1),
Team1Seed = integer(length = 1),
Team2Seed = integer(length = 1),
Winner = character(length = 1),
stringsAsFactors = F)
Simulation
We are then going to loop through our dataset for round 1 and calculate a score
for each game played by randomly selecting a decimal number between 0 and 1. After we assign our score
we will then refer to the cell of our ProportionsData
, where row = seed of team 1 and column = seed of team 2. This cell should give us the proportion (in decimal form) of times that a team with a seed number equal to the seed of team 1 beat a team with a seed number equal to the seed of team 2. If our score variable is less than or equal to the proportion in this cell then team 1 wins, if not then team 2 wins. In other words, the probability of team 1 beating team 2 should be equivalent to the proportion of times that team 1’s seed beat team 2’s seed in real life.
## Round of 64##
print("Round of 64")
for (match in 1:nrow(Round1Data)){
#Select 1 random decimal between 0 and 1.
<- runif(1,0,1)
score print(score)
#Compare our 'score' variable to the real-life proportion data.
#If score is less than the proportion of times the seed of team 1
#beat the seeed of team 2 then team 1 wins. Otherwise, team 2 wins.
if (score <= ProportionsData[Round1Data$Team1Seed[match],
$Team2Seed[match]]) {
Round1Dataprint(paste(Round1Data$Team1[match], " beats ",
$Team2[match], sep = ""))
Round1Data$Winner[match] <- Round1Data$Team1[match]
Round1Dataelse {
}print(paste(Round1Data$Team2[match], " beats ",
$Team1[match], sep = ""))
Round1Data$Winner[match] <- Round1Data$Team2[match]
Round1Data
} }
Next, we will use the results from round 1 (we now have a winner column in our Round1Data filled in with the winner of each game) to fill in our data frame for round 2.
# Fill in the data frame for the next round using the results of the
# previous round.
for (winner in 1:nrow(Round2Data)) {
<- as.character(Round1Data$Winner[winner*2-1])
team1Winner
if (Round1Data$Winner[winner*2-1] %in% Round1Data$Team1){
<- Round1Data$Team1Seed[winner*2-1]
team1Seed else {
}<- Round1Data$Team2Seed[winner*2-1]
team1Seed
}
<- Round1Data$Winner[winner*2]
team2Winner
if (Round1Data$Winner[winner*2] %in% Round1Data$Team1){
<- Round1Data$Team1Seed[winner*2]
team2Seed else {
}<- Round1Data$Team2Seed[winner*2]
team2Seed
}
# I discovered that the team 1 needs to be the lower seed
# (i.e. higher seed number).
if(team1Seed > team2Seed){
$Team1[winner] <- team1Winner
Round2Data$Team1Seed[winner] <- team1Seed
Round2Data$Team2[winner] <- team2Winner
Round2Data$Team2Seed[winner] <- team2Seed
Round2Dataelse {
} $Team1[winner] <- team2Winner
Round2Data$Team1Seed[winner] <- team2Seed
Round2Data$Team2[winner] <- team1Winner
Round2Data$Team2Seed[winner] <- team1Seed
Round2Data
}
}
And finally, we’re simply going to repeat those last two steps for every remaining round in the tournament.
## Round of 32 ##
print("Round of 32")
for (match in 1:nrow(Round2Data)){
<- runif(1,0,1)
score print(score)
if (score <= ProportionsData[Round2Data$Team1Seed[match],
$Team2Seed[match]]) {
Round2Dataprint(paste(Round2Data$Team1[match], " beats ",
$Team2[match], sep = ""))
Round2Data$Winner[match] <- Round2Data$Team1[match]
Round2Dataelse {
}print(paste(Round2Data$Team2[match], " beats ",
$Team1[match], sep = ""))
Round2Data$Winner[match] <- Round2Data$Team2[match]
Round2Data
}
}
# Fill in the data frame for the next round using the results of the
# previous round
for (winner in 1:nrow(Round3Data)) {
<- as.character(Round2Data$Winner[winner*2-1])
team1Winner
if (Round2Data$Winner[winner*2-1] %in% Round2Data$Team1){
<- Round2Data$Team1Seed[winner*2-1]
team1Seed else {
}<- Round2Data$Team2Seed[winner*2-1]
team1Seed
}
<- Round2Data$Winner[winner*2]
team2Winner
if (Round2Data$Winner[winner*2] %in% Round2Data$Team1){
<- Round2Data$Team1Seed[winner*2]
team2Seed else {
}<- Round2Data$Team2Seed[winner*2]
team2Seed
}
# I discovered that the team 1 needs to be the lower seed
# (i.e. higher seed number)
if(team1Seed > team2Seed){
$Team1[winner] <- team1Winner
Round3Data$Team1Seed[winner] <- team1Seed
Round3Data$Team2[winner] <- team2Winner
Round3Data$Team2Seed[winner] <- team2Seed
Round3Dataelse {
} $Team1[winner] <- team2Winner
Round3Data$Team1Seed[winner] <- team2Seed
Round3Data$Team2[winner] <- team1Winner
Round3Data$Team2Seed[winner] <- team1Seed
Round3Data
}
}
## Round of 16 ##
print("Sweet 16")
for (match in 1:nrow(Round3Data)){
<- runif(1,0,1)
score print(score)
if (score <= ProportionsData[Round3Data$Team1Seed[match],
$Team2Seed[match]]) {
Round3Dataprint(paste(Round3Data$Team1[match], " beats ",
$Team2[match], sep = ""))
Round3Data$Winner[match] <- Round3Data$Team1[match]
Round3Dataelse {
}print(paste(Round3Data$Team2[match], " beats ",
$Team1[match], sep = ""))
Round3Data$Winner[match] <- Round3Data$Team2[match]
Round3Data
}
}
# Fill in the data frame for the next round using the results of the
# previous round
for (winner in 1:nrow(Round4Data)) {
<- as.character(Round3Data$Winner[winner*2-1])
team1Winner
if (Round3Data$Winner[winner*2-1] %in% Round3Data$Team1){
<- Round3Data$Team1Seed[winner*2-1]
team1Seed else {
}<- Round3Data$Team2Seed[winner*2-1]
team1Seed
}
<- Round3Data$Winner[winner*2]
team2Winner
if (Round3Data$Winner[winner*2] %in% Round3Data$Team1){
<- Round3Data$Team1Seed[winner*2]
team2Seed else {
}<- Round3Data$Team2Seed[winner*2]
team2Seed
}
# I discovered that the team 1 needs to be the lower seed
# (i.e. higher seed number)
if(team1Seed > team2Seed){
$Team1[winner] <- team1Winner
Round4Data$Team1Seed[winner] <- team1Seed
Round4Data$Team2[winner] <- team2Winner
Round4Data$Team2Seed[winner] <- team2Seed
Round4Dataelse {
} $Team1[winner] <- team2Winner
Round4Data$Team1Seed[winner] <- team2Seed
Round4Data$Team2[winner] <- team1Winner
Round4Data$Team2Seed[winner] <- team1Seed
Round4Data
}
}
## Round of 8 ##
print("Elite 8")
for (match in 1:nrow(Round4Data)){
<- runif(1,0,1)
score print(score)
if (score <= ProportionsData[Round4Data$Team1Seed[match],
$Team2Seed[match]]) {
Round4Dataprint(paste(Round4Data$Team1[match], " beats ",
$Team2[match], sep = ""))
Round4Data$Winner[match] <- Round4Data$Team1[match]
Round4Dataelse {
}print(paste(Round4Data$Team2[match], " beats ",
$Team1[match], sep = ""))
Round4Data$Winner[match] <- Round4Data$Team2[match]
Round4Data
}
}
# Fill in the data frame for the next round using the results of the
# previous round
for (winner in 1:nrow(Round5Data)) {
<- as.character(Round4Data$Winner[winner*2-1])
team1Winner
if (Round4Data$Winner[winner*2-1] %in% Round4Data$Team1){
<- Round4Data$Team1Seed[winner*2-1]
team1Seed else {
}<- Round4Data$Team2Seed[winner*2-1]
team1Seed
}
<- Round4Data$Winner[winner*2]
team2Winner
if (Round4Data$Winner[winner*2] %in% Round4Data$Team1){
<- Round4Data$Team1Seed[winner*2]
team2Seed else {
}<- Round4Data$Team2Seed[winner*2]
team2Seed
}
# I discovered that the team 1 needs to be the lower seed
# (i.e. higher seed number)
if(team1Seed > team2Seed){
$Team1[winner] <- team1Winner
Round5Data$Team1Seed[winner] <- team1Seed
Round5Data$Team2[winner] <- team2Winner
Round5Data$Team2Seed[winner] <- team2Seed
Round5Dataelse {
} $Team1[winner] <- team2Winner
Round5Data$Team1Seed[winner] <- team2Seed
Round5Data$Team2[winner] <- team1Winner
Round5Data$Team2Seed[winner] <- team1Seed
Round5Data
}
}
## Semi-Finals ##
print("Final Four")
for (match in 1:nrow(Round5Data)){
<- runif(1,0,1)
score print(score)
if (score <= ProportionsData[Round5Data$Team1Seed[match],
$Team2Seed[match]]) {
Round5Dataprint(paste(Round5Data$Team1[match], " beats ",
$Team2[match], sep = ""))
Round5Data$Winner[match] <- Round5Data$Team1[match]
Round5Dataelse {
}print(paste(Round5Data$Team2[match], " beats ",
$Team1[match], sep = ""))
Round5Data$Winner[match] <- Round5Data$Team2[match]
Round5Data
}
}
# Fill in the data frame for the next round using the results of the
# previous round
for (winner in 1:nrow(Round6Data)) {
<- as.character(Round5Data$Winner[winner*2-1])
team1Winner
if (Round5Data$Winner[winner*2-1] %in% Round5Data$Team1){
<- Round5Data$Team1Seed[winner*2-1]
team1Seed else {
}<- Round5Data$Team2Seed[winner*2-1]
team1Seed
}
<- Round5Data$Winner[winner*2]
team2Winner
if (Round5Data$Winner[winner*2] %in% Round5Data$Team1){
<- Round5Data$Team1Seed[winner*2]
team2Seed else {
}<- Round5Data$Team2Seed[winner*2]
team2Seed
}
# I discovered that the team 1 needs to be the lower seed
# (i.e. higher seed number)
if(team1Seed > team2Seed){
$Team1[winner] <- team1Winner
Round6Data$Team1Seed[winner] <- team1Seed
Round6Data$Team2[winner] <- team2Winner
Round6Data$Team2Seed[winner] <- team2Seed
Round6Dataelse {
} $Team1[winner] <- team2Winner
Round6Data$Team1Seed[winner] <- team2Seed
Round6Data$Team2[winner] <- team1Winner
Round6Data$Team2Seed[winner] <- team1Seed
Round6Data
}
}
## Finals ##
print("National Final")
<- runif(1,0,1)
score print(score)
if (score <= ProportionsData[Round6Data$Team1Seed[1],
$Team2Seed[1]]) {
Round6Dataprint(paste(Round6Data$Team1[1], " beats ",
$Team2[1], " in the Final Round!", sep = ''))
Round6Data$Winner[1] <- Round6Data$Team1[1]
Round6Dataelse {
}print(paste(Round6Data$Team2[1], " beats ",
$Team1[1], " in the Final Round!", sep = ''))
Round6Data$Winner[1] <- Round6Data$Team2[1]
Round6Data }
The above steps were repeated 1000 times to simulate performance.
Data Viz
The results from this simulation were saved into a variable named resultsHolder
that can be retrieved from this website’s Github data repository like this:
<- 'https://github.com/mkmiecik14/mkmiecik14.github.io/blob/master/data/resultsHolder.RData?raw=true'
nickResLink source_data(nickResLink)
Let’s see what teams won the tournament most often:
# Converting list to long data frame
<- resultsHolder %>% map_df(bind_rows, .id = 'iter')
resultsLong
# Summary information for results
<- resultsLong %>%
resultsLongSum filter(round == 6) %>%
group_by(Winner) %>%
summarise(n = n(), seed = unique(WinnerSeed))
# Tournament champion histogram
ggplot(resultsLongSum, aes(reorder(Winner, -1*n), n)) +
geom_bar(stat = 'identity') +
labs(x = 'Tournament Champion',
y = 'Wins',
caption = '"Play in #" teams were not available at time of simulation') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
And it looks like this simulation predicted Xavier to win most often!
Conclusion
Of course, much more goes into tournament performance than seed placement and historical win probabilities. At the time that we first wrote this, a #16 seed had never beaten a #1 seed in the first round. Models using purely historical data (such as Nick’s) were unable to account for #16 seed UMBC’s historical first round victory over #1 seed Virginia in this year’s NCAA Men’s tournament. Nevertheless, this was a super fun exercise in R and thinking about modeling tournament style performance. Hope you enjoyed this post and any feedback is always welcome!
Acknowledgments
A huge thanks to my friends Nick Ray, David Hoagey, Linh Lazarus, Giuseppe Miranda, and Ekarin Pongpipat for their enthusiasm and help with this blog post. They helped turn a small idea into an awesome collaboration where we got to share thoughts, work off each other’s ideas, and learn more about R!