Every year since 2010 I have used data science to predict the results of the NCAA Men’s Basketball Tournament. In this post I will describe the methodology that I used to create my picks (full bracket here). The model has Virginia, Michigan, Villanova, and Michigan State in the Final Four with Virginia defeating Villanova in the championship game:
Here are my ground rules:
- The picks should not be embarrassingly bad.
- I shall spend no more than one work day on this activity (and 30 minutes for this post). This year I spent two hours cleaning up and running my code from last year.
- I will share my code and raw data. (The data is available on Kaggle. The code is not cleaned up but here it is anyway.)
I used a combination of game-by-game results and team metrics from 2003-2017 to build the features in my model. Here is a summary:
- Team features
- Massey ratings, obtained here.
- My own eigenvalue centrality model, which tends to do a good job all by itself.
- Metrics from Ken Pomeroy’s page. (These are public and are therefore fair game.)
- Dean Oliver’s “Four Factors”, which I calculate from per-game data.
- Game features
- Home court. (1 = home, -1 = away, 0 = neutral)
- Year
- (I did not have time to use site locations – the code is buggy.)
I also performed some post-processing:
- I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
- Standard normalization.
- One hot encoding of categorical features.
- Upset generation. I found the results to be not interesting enough, so I added a post-processing function that looks for games where the win probability for the underdog (a significantly lower seed) is quite close to 0.5. In those cases the model picks the underdog instead.
The model predicts the probability one team defeats another, for all pairs of teams in the tournament. The model is implemented in Python and uses logistic regression. The model usually performs well. Let’s see how it does this year!