Every year since 2010 I have used analytics to make my NCAA picks. Here is a link to the picks made by my model [PDF]: the projected Final Four is Villanova, Duke, North Carolina, and Virginia with Villanova defeating North Carolina in the final. (I think my model likes Virginia too much, by the way.)
Here’s how these selections were made. First, the ground rules I set for myself:
- The picks should not be embarrassingly bad.
- I shall spend no more than on this activity (and 30 minutes for this post).
- I will share my code and raw data.
Okay: the model. The model combines two concepts:
- A “win probability” model developed by Joel Sokol in 2010 as described on Net Prophet.
- An eigenvalue centrality model based on this post on BioPhysEngr Blog.
The win probability model accounts for margin of victory and serves as preprocessing for step 2. I added a couple of other features to make the model more accurate:
- Home-court advantage is considered: 2.5 points which was a rough estimate I made a few years ago and presumably is still reasonable.
- The win probability is scaled by an adjustment factor which has been selected for best results (see below).
- Recency is considered: more recent victories are weighted more strongly.
The eigenvalue centrality model requires game-by-game results. I pulled four years of game results for all divisions from masseyratings.com (holla!) and saved them as CSV. You can get all the data here. It sounds complicated, but it’s not (otherwise I wouldn’t do it) – the model requires less than 200 lines of Python, also available here. (The code is poor quality.)
How do I know these picks aren’t crap? I don’t. The future is uncertain. But, I did a little bit of backtesting. I trained the model using different “win probability” and “recency” parameters on the 2013-2015 seasons, selecting the combination of parameters that correctly predicted the highest percentage of NCAA tournament games during those seasons, getting approximately 68% of those games right. I don’t know if that’s good, but it seems to be better than applying either the eigenvalue centrality model or the win probability model separately.
In general, picks produced by my models rank in the upper quartile in pools that I enter. I hope that’s the case this year too.