Predicting the NCAA Tournament Using Monte Carlo Simulation
I have created a simulation model in Microsoft Excel using Frontline Systems’ Analytic Solver Platform to predict the 2014 NCAA Tournament using the technique I described in my previous post.
Click here to download the spreadsheet.
To try it out, go to solver.com and download a free trial of Analytic Solver Platform by clicking on Products –> Analytic Solver Platform:
Once you’ve installed the trial, open the spreadsheet. You’ll see a filled-out bracket in the “Bracket” worksheet:
Winners are determined by comparing the ratings of each time, using Excel formulas. Basically…a bunch of IF statements:
The magic of simulation is that it accounts for uncertainty in the assumptions we make. In this case, the uncertainty is my crazy rating system: it might be wrong. So instead of a single number that represents the strength of, say, Florida, we actually have a range of possible ratings based on a probability distribution. I have entered these probability distributions for the ratings for each team in column F. Double click on cell F9 (Florida’s rating), and you can see the range of ratings that the simulation considers:
The peak of the bell curve (normal) distribution is at 0.1245, the rating calculated in my previous post. Analytic Solver Platform samples different values from this distribution (and the other 63 teams), producing slightly different ratings, over and over again. As the ratings jiggle around for different trials, different teams win games and there are different champions for these simulated tournaments. In fact, if you hit F9 (or the “Calculate Now” button in the ribbon), you can see that all of the ratings change and the NCAA champion in cell Y14 sometimes changes from Virginia to Florida to Duke and so on.
Click the “play” button on the right hand side to simulate the NCAA tournament 10,000 times:
Now move over to the Results worksheet. In columns A and B you see the number of times each team won the simulated tournament (the sum of column B adds up to 10,000):
There is a pivot table in columns E and F that summarizes the results. Right click to Refresh it, and the nifty chart below:
We see that even though Virginia is predicted to be the most likely winner, Florida and Duke are also frequent winners.
What’s nice about the spreadsheet is that you can change it to do your own simulations. Change the values in columns D and E in the Bracket worksheet to incorporate your own rating system and see who your model predicts will win. The simulation only scratches the surface of what Analytic Solver Platform can do. Go crazy with correlated distributions (perhaps by conference?) or even simulation-optimization models to tune your model. Have fun.
NCAA Tournament Analytics Model 2014: Methodology
I revealed my analytics model’s 2014 NCAA Tournament picks in yesterday’s post. Today, I want to describe how the ratings were determined. (Fair warning: this post will be quite a bit more technical and geeky.)
Click here to download the Python model source code.
My NCAA prediction model computes a numerical rating for each team in the field. Picks are generated by comparing team ratings: the team with the higher rating is predicted to advance. As I outlined in my preview, the initial model combines two ideas:
- A “win probability” model developed by Joel Sokol in 2010 as described on Net Prophet.
- An eigenvalue centrality model based on this post on BioPhysEngr Blog.
The eigenvalue centrality model creates a big network (also called a graph) that links all NCAA teams. The arrows in the network represent games between teams. Eigenvalue centrality analyzes the network to determine which network nodes (which teams), are strongest. The model I described in my preview was pretty decent, but it failed to address two important issues:
- Recently played games should count more than games at the beginning of the season.
- Edge weights should reflect the probability one team is stronger than another, rather than probability one will beat another on a neutral floor.
The first issue is easy to explain. In my initial model, game-by-game results were analyzed to produce edge weights in a giant network linking teams. The weight was simply the formula given by Joel Sokol in his 2010 paper. However, it seems reasonable that more recently played games are more important, from a predictive perspective, than early season games. To account for this factor, I scale the final margin of victory for more recently played games by a “recency” factor R. If one team beats another by K points at the start of the season, we apply the Sokol formula with K. However, if one team beats another by K points at the end of the season, we apply the formula with R*K. If R=2, that means a 10 point victory at the start of the season is worth the same as a 5 point victory at the end. If the game was in the middle of the season, we’d apply half of the adjustment: 7.5 points.
The second issue – regarding edge weights and team strength – is more subtle. As you saw in the “Top 25” from my preview post, there were some strange results. For example, Canisius was rated #24. The reason is that the Sokol formula is not very sensitive to small margins of victory.
Let’s look at an example. Here is the Sokol formula: phi(0.0189 * x – 0.0756)
If you try the values 1..6 you get the probabilities [0.477, 0.485, 0.492, 0.5, 0.508, 0.515]. This means that the difference between a 1-point home win and a 6-point home win is only 0.515 – 0.477 = 0.0377 ~= 3%. This means that most of the nonzero values in the big adjacency matrix that we create are around 0.5, and consequently our centrality method is determining teams that are influential in the network, rather than teams that are dominant. One way to find teams that are dominant is to scale the margin of victory so that a 6-point victory is worth much more than a 1-point victory. So the hack here is to substitute S*x for x in the formula, where S is a “sensitivity” scaling factor.
One last tiny adjustment I made was to pretend that Joel Embiid did not play this year, so that Kansas’s rating reflects their strength without him. Long story short, I subtracted 1.68 points for all games that Joel Embiid appeared in. This post has the details.
My Python code implements everything I described in this post and the preview. I generated the picks by choosing the recency parameter R = 1.5 and strength parameter S = 2. Here is a sample call and output:
scoreNcaa(25, 20, 2, 1.5, 0)
['Virginia', 0.13098760857436742] ['Florida', 0.12852960094006807] ['Duke', 0.12656196253849666] ['Kansas', 0.12443601960952431] ['Michigan St', 0.12290861109638007] ['Arizona', 0.12115701603335856] ['Wisconsin', 0.11603580613955565] ['Pittsburgh', 0.11492421298144373] ['Michigan', 0.11437543620057213] ['Iowa St', 0.1128795675290855]
If you’ve made it this far, and have the source code, you can figure out what most of the other parameters mean. (Or you can ask in the comments!)
The answer to the question, “why did Virginia come out first” is difficult to answer succinctly. Basically:
- Virginia, Florida, and Duke are all pretty close.
- Virginia had a consistently strong schedule.
- Their losses were generally speaking close games to strong opponents.
- They had several convincing, recent victories over other very strong teams.
- In a future post, I will provide an Excel spreadsheet that will allow you to build and simulate your own NCAA tournament models!
NCAA Tournament Analytics Model 2014: Picks
Here are my picks for the 2014 NCAA Tournament, based on the analytics model I described in this post. This post contains the picks and my next post will contain the code and methodology for the geeks among us. I use analytics for my NCAA picks for my own education and enjoyment, and to absolve responsibility for them. No guarantees!
Here is a link to picks for all rounds in PDF format.
Here is a spreadsheet with all picks and ratings.
This year’s model examined every college basketball game played in Division I, II, III, and Canada based on data from Prof. Peter Wolfe and from MasseyRatings.com. The ratings implicitly account for strength of opposition, and explicitly account for neutral site games, recency, and Joel Imbiid’s back (it turned out not to matter). I officially deem these picks “not crappy”.
The last four rounds are given at the end – the values next to each team are the scores generated by the model.
The model predicts Virginia, recent winners of the ACC tournament, will win it all in 2014 in a rematch with Duke. Arizona was rated the sixth best team in the field but is projected to make it to the Final Four because it plays in the weakest region (the West). Florida, the second strongest team in the field (juuust behind Virginia) joins them. Wichita State was rated surprisingly low (25th) even though it is currently undefeated, basically due to margin of victory against relatively weaker competition (although the Missouri Valley has been an underrated conference over the past several years). Wichita State was placed in the Midwest region, clearly the toughest region in the bracket, and is projected to lose to underseeded Kentucky in the second round. Here is the average and median strengths of the four regions. The last column is the 75th percentile, which is an assessment of the strength of the elite teams in each bracket. Green means easy:
Region | Avg | Med | Top Q |
South | 0.0824 | 0.0855 | 0.1101 |
East | 0.0816 | 0.0876 | 0.1064 |
West | 0.0752 | 0.0831 | 0.1008 |
Midwest | 0.0841 | 0.0890 | 0.1036 |
The model predicts a few upsets (though not too many). The winners of the “play-in games” are projected to knock off higher seeded Saint Louis and UMass. Kentucky is also projected to beat Louisville, both of whom probably should have been seeded higher. Baylor is projected to knock off Creighton, busting Warren Buffett’s billion dollar bracket in Round 2.
Sweet 16 | Elite 8 | |||
Florida | 0.1285 | Florida | 0.1285 | |
VA Commonwealth | 0.1097 | |||
Syracuse | 0.1111 | Kansas | 0.1281 | |
Kansas | 0.1244 | |||
Virginia | 0.1310 | Virginia | 0.1281 | |
Michigan St | 0.1229 | |||
Iowa St | 0.1129 | Iowa St | 0.1129 | |
Villanova | 0.1060 | |||
Arizona | 0.1212 | Arizona | 0.1212 | |
Oklahoma | 0.1001 | |||
Baylor | 0.1013 | Wisconsin | 0.1160 | |
Wisconsin | 0.1160 | |||
Kentucky | 0.1081 | Kentucky | 0.1081 | |
Louisville | 0.1065 | |||
Duke | 0.1266 | Duke | 0.1266 | |
Michigan | 0.1144 |
Final Four | Championship | |||
Florida | 0.1285 | Virginia | 0.1310 | |
Virginia | 0.1310 | Duke | 0.1266 | |
Arizona | 0.1212 | |||
Duke | 0.1266 |
Using Analytics to Assess Joel Embiid’s Injury and Kansas’s Chances
Joel Embiid is the starting center of the Kansas Jayhawks and one of the most talented college basketball players in the country. Unfortunately he suffered a stress fracture in his back and is likely to miss at least the first weekend of the upcoming NCAA tournament. Some think that Kansas is headed for an early round exit while others think that Kansas’s seed should not be affected at all. Can we use analytics, even roughly, to assess the impact on Kansas’ NCAA tournament prospects?
How about looking at win shares? A “win share” is a statistical estimate of the number of team wins that can be attributed to an individual’s performance. According to the amazing Iowa-powered basketball-reference.com, Embiid’s win shares per 40 minutes are an impressive 0.212 (an average player is around .100). HIs primary replacement, Tarik Black, is at 0.169. That’s a difference of 0.042 win shares per 40 minutes. I probably can’t technically do what I am about to do, but who cares. Since Kansas averages 80 points a game, the win share difference is 80 x 0.042 = 3.36 points per game. However, Embiid was only playing around 23 minutes a game, and Black isn’t even getting all of his minutes. Certain other teammates (Wiggins!) may simply play more minutes than usual to compensate. So 3.36 is probably on the high side. If we estimate that Embiid’s presence will be missed for only 20 player-minutes per game, an estimate of 1.68 points per game is probably reasonable. I will use this assumption in my upcoming NCAA Tournament model.
If we look at Kansas’s schedule we see that this difference would possibly only have swayed two games (Oklahoma State and Texas Tech). Embiid’s loss should not affect his team’s seeding any more than it already has by having lost to Iowa State in the Big 12 tournament. Kansas is a solid 2 seed, but Embiid’s loss, if prolonged, could delay a fifteenth Final Four appearance.
Better to Be Right Than Fast
Mae West said that too much of a good thing is wonderful. For we shipbuilders who write numerical code that is certainly true of speed and accuracy. How seldom we find ourselves in the happy situation of a piece of code that is both fast enough and accurate enough! A colleague and I were chatting about speed and accuracy today and I realized that when I am building software, I prefer a piece of code that is accurate but slow over one that is less accurate but faster. With profiling and careful thought applied to new code, it’s usually pretty easy to make it faster. Addressing a wide-spread numerical issue often requires a complete re-think.
If I am simply using the software (rather than build it), then all bets are off; it depends on what I am trying to do.
NCAA Tournament Analytics Model 2014 Preview
My NCAA Tournament Prediction Model posts have traditionally been pretty popular, so I thought I would put in a bit more effort this year. In this post I want to share some “raw materials” that you might find helpful, and describe the methodology behind this year’s model.
Here are some resources that you might find helpful if you want to build your own computer-based model for NCAA picks:
- The Net Prophet blog has useful descriptions of a number of approaches, including Microsoft Research’s Trueskill ratings.
- Joel Sokol’s LRMC rankings are here.
- This page has the results of every men’s college basketball game played this season.
- Jeff Sagarin’s (the godfather!) college basketball ratings are located here.
- My previous posts are located here. As you can see I did a pretty bad job last year.
- This year I am going to combine two ideas to build my model. The first is a “win probability” model developed by Joel Sokol which is described on Net Prophet. As the blog post says, this model estimates the probability that Team A will beat Team B on a neutral site given Team A beat Team B at home by a given number of points. So for example if A loses to B by 40 at home, this probability is close to zero. You can hijack this model to assign a “strength of victory” rating: a blowout win is a greater show of team strength than a one-point thriller.
The second idea is a graph theoretical approach stolen from this excellent post on BioPhysEngr Blog. The idea here is to create a giant network based on the results of individual games. So for example if Iowa beats Ohio State then there are arrows between the Iowa and Ohio State nodes. The weight on the edge is a representation of the strength of the victory (or loss). Given this network we can apply an eigenvalue centrality approach. In English, this means determining the importance of all of the nodes in the network, which in my application means the overall strength of each team. I like this approach because it is easy for me to code: computing the largest eigenvalue using the power method is simple enough for even Wikipedia to describe succinctly. (And shockingly enough, according to the inscription on my Numerical Analysis text written by the great Ken Atkinson, I learned it twenty years ago!)
The difference between my approach and the BioPhysEngr approach is that I am using Sokol’s win probability logic to calculate the edge weights. As you’ll see when I post the code, it’s about 150 lines of Python, including all the bits to read in the game data.
I ran a preliminary version of my code against all college basketball games up until March 9, and my model’s Top 25 is given below. Mostly reasonable with a few odd results (Manhattan? Canisius? Iona?) I will make a few tweaks and post my bracket after the selection show on Sunday.
1 |
Wichita St |
2 |
Louisville |
3 |
Villanova |
4 |
Duke |
5 |
Kansas |
6 |
Florida |
7 |
Arizona |
8 |
Virginia |
9 |
Michigan St |
10 |
North Carolina |
11 |
Ohio State |
12 |
Wisconsin |
13 |
Manhattan |
14 |
Syracuse |
15 |
Iowa |
16 |
Kentucky |
17 |
Iona |
18 |
Pittsburgh |
19 |
Creighton |
20 |
VA Commonwealth |
21 |
Tennessee |
22 |
Oklahoma St |
23 |
Michigan |
24 |
Canisius |
25 |
Connecticut |
Beautiful Box Plots in Excel 2013
Box plots are widely used among data scientists and statisticians. They’re useful because they show variation both between and within data series. R, Python’s matplotlib, and many other charting libraries support box plots right out of the…box, but Excel does not. In Excel 2013, with a little bit of imagination you can create nice looking box plots without writing any code.
Read this post to find out how to create box plots that look like this:
Here is a workbook that has the finished product if you don’t want to follow along.
You’ll need to start with a table containing the data you want to plot. I am using the data from the Michelson-Morley experiment:
A box plot shows the median of each data series as a line, with a “box” whose top edge is the third quartile and whose bottom edge is the first quartile. Often we draw “whiskers” at the top and bottom representing the extreme values of each series.
If we create an auxiliary data containing this data and follow my advice from my Error Bars in Excel post, we can create a nice looking box plot.
Step 1: Calculate Quartiles and Extremes.
Create another table with the following rows for each series: min, q1, q2, q3, max. These will be the primary data in your box plot. Min and max are easy – use the =MIN() and =MAX() formulas on each data series (represented as columns A – E in my example). To compute Q1-Q3 use the QUARTILE.INC() function. (INC means “inclusive”. QUARTILE.EXC() would work fine if that’s what you want.) Enter the formulas for the first series and then “fill right”:
Step 2: Calculate box and whisker edges
We are going to create a stacked column chart with error bars, and “hide” the bottommost column in the stack to make the chart look like a box plot. Therefore we have to calculate the tops and bottoms of our boxes and whiskers:
- The bottom of each box is Q1.
- The ‘middle’ of each box is Q2 (the median). Since this is a stacked column chart, we actually want to compute Q2 – Q1.
- The top of each box is Q3. Since we want to represent this as a “slice” in the stacked column chart, we want Q3 – Q2.
- The error going “down” in the chart is Q2 – min, since the whiskers start at the median.
- The error going “up” is max – Q2.
Compute these five quantities as rows and you’ll have this:
Step 3: Create a stacked column chart.
Go to the INSERT tab and select a stacked column chart:
Now right click on the blank chart, choose Select Data Range and select the “box lo, box mid, mix hi range” as your data:
Step 4: Make the chart look like a Box Plot.
This is simple: the bottom bar (the blue ones in my example) need to go away. So right click on a blue bar and change both the outline and fill to nothing.
Step 5: Add Whiskers.
Follow the steps in my celebrated “Add Error Bars” post. Click on the “+” next to the chart, select Error Bars. Choose Series 2 (which corresponds to the median). Click on “More options” in the Error Bars flyout menu next to the “+”. In the task pane on the right, for Error Amount choose Custom and then click the Specify Value button:
For “Positive Error Value” select the “err up” row and for “Negative Error Value” select “err down”. Both rows contain positive values, and that is totally fine. Here’s what mine looks like:
That’s it! You can of course customize the other bars as desired.