2017 NCAA Tournament Picks

Every year since 2010 I have used analytics to predict the results of the NCAA Men’s Basketball Tournament. I missed the boat on posting the model prior to the start of this year’s tournament. However, I did build and run a model, and I did submit picks based on the results. Here are my model’s picks – as I write this (before the Final Four) these picks are better than 88% of those submitted to ESPN.

Here are the ground rules I set for myself:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than one work day on this activity (and 30 minutes for this post).
  • I will share my code and raw data. (Here it is.)

I used a combination of game-by-game results and team metrics from 2003-2016 to build the features in my model. Here is a summary:

I also performed some post-processing:

  • I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
  • Standard normalization.
  • One hot encoding of categorical features.
  • Upset generation. I found the results aesthetically displeasing for bracket purposes, so I added a post-processing function that looks for games between Davids and Goliaths (i.e. I compare seeds) where David and Goliath are relatively close in strength. For those games, I go with David.

I submitted the model to the Kaggle NCAA competition, which asks for win probabilities for all possible tourney games, where submissions are scored by evaluating the log-loss of actual results and predictions. This naturally suggests logistic regression, which I used. I also built a fancy pants neural network model using Keras (which means to run my code you’ll need to get TensorFlow and Keras in addition to the usual Anaconda packages). Keras produces slightly better results in the log-loss sense. Both models predict something like 78% of past NCAA tournament games correctly.

There are a couple of obvious problems with the model:

  • I did not actually train the model on past tournaments, only on regular season games. That’s just because I didn’t take the time.
  • Not accounting for injuries.
  • NCAA games are not purely “neutral site” games because sometimes game sites are closer to one team than another. I have code for this that I will probably use next year.
  • I am splitting the difference between trying to create a good Kaggle submission and trying to create a “good” bracket. There are subtle differences between the two but I will spare you the details.

I will create a github repo for this code…sometime. For now, you can look at the code and raw data files here. The code is in ncaa.py.

Advertisements

Author: natebrix

Follow me on twitter at @natebrix.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s