Every year since 2010 I have used analytics to predict the results of the NCAA Men’s Basketball Tournament. I missed the boat on posting the model prior to the start of this year’s tournament. However, I did build and run a model, and I did submit picks based on the results. Here are my model’s picks – as I write this (before the Final Four) these picks are better than 88% of those submitted to ESPN.
Here are the ground rules I set for myself:
- The picks should not be embarrassingly bad.
- I shall spend no more than one work day on this activity (and 30 minutes for this post).
- I will share my code and raw data. (Here it is.)
I used a combination of game-by-game results and team metrics from 2003-2016 to build the features in my model. Here is a summary:
- Team features
- Game features
- Home court. (1 = home, -1 = away, 0 = neutral)
- (I had ‘day of season’ in there but pulled it. This is implicitly accounted for in some the team features above.)
I also performed some post-processing:
- I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
- Standard normalization.
- One hot encoding of categorical features.
- Upset generation. I found the results aesthetically displeasing for bracket purposes, so I added a post-processing function that looks for games between Davids and Goliaths (i.e. I compare seeds) where David and Goliath are relatively close in strength. For those games, I go with David.
I submitted the model to the Kaggle NCAA competition, which asks for win probabilities for all possible tourney games, where submissions are scored by evaluating the log-loss of actual results and predictions. This naturally suggests logistic regression, which I used. I also built a fancy pants neural network model using Keras (which means to run my code you’ll need to get TensorFlow and Keras in addition to the usual Anaconda packages). Keras produces slightly better results in the log-loss sense. Both models predict something like 78% of past NCAA tournament games correctly.
There are a couple of obvious problems with the model:
- I did not actually train the model on past tournaments, only on regular season games. That’s just because I didn’t take the time.
- Not accounting for injuries.
- NCAA games are not purely “neutral site” games because sometimes game sites are closer to one team than another. I have code for this that I will probably use next year.
- I am splitting the difference between trying to create a good Kaggle submission and trying to create a “good” bracket. There are subtle differences between the two but I will spare you the details.
I will create a github repo for this code…sometime. For now, you can look at the code and raw data files here. The code is in ncaa.py.