2018 NCAA Tournament Picks

Every year since 2010 I have used data science to predict the results of the NCAA Men’s Basketball Tournament. In this post I will describe the methodology that I used to create my picks (full bracket here). The model has Virginia, Michigan, Villanova, and Michigan State in the Final Four with Virginia defeating Villanova in the championship game:

Screen Shot 2018-03-13 at 9.20.05 PM

Here are my ground rules:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than one work day on this activity (and 30 minutes for this post). This year I spent two hours cleaning up and running my code from last year.
  • I will share my code and raw data. (The data is available on Kaggle. The code is not cleaned up but here it is anyway.)

I used a combination of game-by-game results and team metrics from 2003-2017 to build the features in my model. Here is a summary:

I also performed some post-processing:

  • I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
  • Standard normalization.
  • One hot encoding of categorical features.
  • Upset generation. I found the results to be not interesting enough, so I added a post-processing function that looks for games where the win probability for the underdog (a significantly lower seed) is quite close to 0.5. In those cases the model picks the underdog instead.

The model predicts the probability one team defeats another, for all pairs of teams in the tournament. The model is implemented in Python and uses logistic regression. The model usually performs well. Let’s see how it does this year!

2017 NCAA Tournament Picks

Every year since 2010 I have used analytics to predict the results of the NCAA Men’s Basketball Tournament. I missed the boat on posting the model prior to the start of this year’s tournament. However, I did build and run a model, and I did submit picks based on the results. Here are my model’s picks – as I write this (before the Final Four) these picks are better than 88% of those submitted to ESPN.

Here are the ground rules I set for myself:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than one work day on this activity (and 30 minutes for this post).
  • I will share my code and raw data. (Here it is.)

I used a combination of game-by-game results and team metrics from 2003-2016 to build the features in my model. Here is a summary:

I also performed some post-processing:

  • I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
  • Standard normalization.
  • One hot encoding of categorical features.
  • Upset generation. I found the results aesthetically displeasing for bracket purposes, so I added a post-processing function that looks for games between Davids and Goliaths (i.e. I compare seeds) where David and Goliath are relatively close in strength. For those games, I go with David.

I submitted the model to the Kaggle NCAA competition, which asks for win probabilities for all possible tourney games, where submissions are scored by evaluating the log-loss of actual results and predictions. This naturally suggests logistic regression, which I used. I also built a fancy pants neural network model using Keras (which means to run my code you’ll need to get TensorFlow and Keras in addition to the usual Anaconda packages). Keras produces slightly better results in the log-loss sense. Both models predict something like 78% of past NCAA tournament games correctly.

There are a couple of obvious problems with the model:

  • I did not actually train the model on past tournaments, only on regular season games. That’s just because I didn’t take the time.
  • Not accounting for injuries.
  • NCAA games are not purely “neutral site” games because sometimes game sites are closer to one team than another. I have code for this that I will probably use next year.
  • I am splitting the difference between trying to create a good Kaggle submission and trying to create a “good” bracket. There are subtle differences between the two but I will spare you the details.

I will create a github repo for this code…sometime. For now, you can look at the code and raw data files here. The code is in ncaa.py.

2016 NCAA Tournament Picks

Every year since 2010 I have used analytics to make my NCAA picks. Here is a link to the picks made by my model [PDF]: the projected Final Four is Villanova, Duke, North Carolina, and Virginia with Villanova defeating North Carolina in the final. (I think my model likes Virginia too much, by the way.)

Here’s how these selections were made. First, the ground rules I set for myself:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than on this activity (and 30 minutes for this post).
  • I will share my code and raw data.

Okay: the model. The model combines two concepts:

  1. A “win probability” model developed by Joel Sokol in 2010 as described on Net Prophet.
  2. An eigenvalue centrality model based on this post on BioPhysEngr Blog.

The win probability model accounts for margin of victory and serves as preprocessing for step 2. I added a couple of other features to make the model more accurate:

  • Home-court advantage is considered: 2.5 points which was a rough estimate I made a few years ago and presumably is still reasonable.
  • The win probability is scaled by an adjustment factor which has been selected for best results (see below).
  • Recency is considered: more recent victories are weighted more strongly.

The eigenvalue centrality model requires game-by-game results. I pulled four years of game results for all divisions from masseyratings.com (holla!) and saved them as CSV. You can get all the data here. It sounds complicated, but it’s not (otherwise I wouldn’t do it) – the model requires less than 200 lines of Python, also available here. (The code is poor quality.)

How do I know these picks aren’t crap? I don’t. The future is uncertain. But, I did a little bit of backtesting. I trained the model using different “win probability” and “recency” parameters on the 2013-2015 seasons, selecting the combination of parameters that correctly predicted the highest percentage of NCAA tournament games during those seasons, getting approximately 68% of those games right. I don’t know if that’s good, but it seems to be better than applying either the eigenvalue centrality model or the win probability model separately.

In general, picks produced by my models rank in the upper quartile in pools that I enter. I hope that’s the case this year too.

Predicting the NCAA Tournament Using Monte Carlo Simulation

I have created a simulation model in Microsoft Excel using Frontline Systems’ Analytic Solver Platform to predict the 2014 NCAA Tournament using the technique I described in my previous post.

Click here to download the spreadsheet.

To try it out, go to solver.com and download a free trial of Analytic Solver Platform by clicking on Products –> Analytic Solver Platform:


Once you’ve installed the trial, open the spreadsheet. You’ll see a filled-out bracket in the “Bracket” worksheet:


Winners are determined by comparing the ratings of each time, using Excel formulas. Basically…a bunch of IF statements:


The magic of simulation is that it accounts for uncertainty in the assumptions we make. In this case, the uncertainty is my crazy rating system: it might be wrong. So instead of a single number that represents the strength of, say, Florida, we actually have a range of possible ratings based on a probability distribution. I have entered these probability distributions for the ratings for each team in column F. Double click on cell F9 (Florida’s rating), and you can see the range of ratings that the simulation considers:


The peak of the bell curve (normal) distribution is at 0.1245, the rating calculated in my previous post. Analytic Solver Platform samples different values from this distribution (and the other 63 teams), producing slightly different ratings, over and over again. As the ratings jiggle around for different trials, different teams win games and there are different champions for these simulated tournaments. In fact, if you hit F9 (or the “Calculate Now” button in the ribbon), you can see that all of the ratings change and the NCAA champion in cell Y14 sometimes changes from Virginia to Florida to Duke and so on.

Click the “play” button on the right hand side to simulate the NCAA tournament 10,000 times:


Now move over to the Results worksheet. In columns A and B you see the number of times each team won the simulated tournament (the sum of column B adds up to 10,000):


There is a pivot table in columns E and F that summarizes the results. Right click to Refresh it, and the nifty chart below:



We see that even though Virginia is predicted to be the most likely winner, Florida and Duke are also frequent winners.

What’s nice about the spreadsheet is that you can change it to do your own simulations. Change the values in columns D and E in the Bracket worksheet to incorporate your own rating system and see who your model predicts will win. The simulation only scratches the surface of what Analytic Solver Platform can do. Go crazy with correlated distributions (perhaps by conference?) or even simulation-optimization models to tune your model. Have fun.

NCAA Tournament Analytics Model 2014: Methodology

I revealed my analytics model’s 2014 NCAA Tournament picks in yesterday’s post. Today, I want to describe how the ratings were determined. (Fair warning: this post will be quite a bit more technical and geeky.)

Click here to download the Python model source code.

My NCAA prediction model computes a numerical rating for each team in the field. Picks are generated by comparing team ratings: the team with the higher rating is predicted to advance. As I outlined in my preview, the initial model combines two ideas:

  1. A “win probability” model developed by Joel Sokol in 2010 as described on Net Prophet.
  2. An eigenvalue centrality model based on this post on BioPhysEngr Blog.

The eigenvalue centrality model creates a big network (also called a graph) that links all NCAA teams. The arrows in the network represent games between teams. Eigenvalue centrality analyzes the network to determine which network nodes (which teams), are strongest. The model I described in my preview was pretty decent, but it failed to address two important issues:

  • Recently played games should count more than games at the beginning of the season.
  • Edge weights should reflect the probability one team is stronger than another, rather than probability one will beat another on a neutral floor.

The first issue is easy to explain. In my initial model, game-by-game results were analyzed to produce edge weights in a giant network linking teams. The weight was simply the formula given by Joel Sokol in his 2010 paper. However, it seems reasonable that more recently played games are more important, from a predictive perspective, than early season games. To account for this factor, I scale the final margin of victory for more recently played games by a “recency” factor R. If one team beats another by K points at the start of the season, we apply the Sokol formula with K. However, if one team beats another by K points at the end of the season, we apply the formula with R*K. If R=2, that means a 10 point victory at the start of the season is worth the same as a 5 point victory at the end. If the game was in the middle of the season, we’d apply half of the adjustment: 7.5 points.

The second issue – regarding edge weights and team strength – is more subtle. As you saw in the “Top 25” from my preview post, there were some strange results. For example, Canisius was rated #24. The reason is that the Sokol formula is not very sensitive to small margins of victory.

Let’s look at an example. Here is the Sokol formula: phi(0.0189 * x – 0.0756)

If you try the values 1..6 you get the probabilities [0.477, 0.485, 0.492, 0.5, 0.508, 0.515]. This means that the difference between a 1-point home win and a 6-point home win is only 0.515 – 0.477 = 0.0377 ~= 3%. This means that most of the nonzero values in the big adjacency matrix that we create are around 0.5, and consequently our centrality method is determining teams that are influential in the network, rather than teams that are dominant. One way to find teams that are dominant is to scale the margin of victory so that a 6-point victory is worth much more than a 1-point victory. So the hack here is to substitute S*x for x in the formula, where S is a “sensitivity” scaling factor.

One last tiny adjustment I made was to pretend that Joel Embiid did not play this year, so that Kansas’s rating reflects their strength without him. Long story short, I subtracted 1.68 points for all games that Joel Embiid appeared in. This post has the details.

My Python code implements everything I described in this post and the preview. I generated the picks by choosing the recency parameter R = 1.5 and strength parameter S = 2. Here is a sample call and output:

scoreNcaa(25, 20, 2, 1.5, 0)
['Virginia', 0.13098760857436742] ['Florida', 0.12852960094006807] ['Duke', 0.12656196253849666] ['Kansas', 0.12443601960952431] ['Michigan St', 0.12290861109638007] ['Arizona', 0.12115701603335856] ['Wisconsin', 0.11603580613955565] ['Pittsburgh', 0.11492421298144373] ['Michigan', 0.11437543620057213] ['Iowa St', 0.1128795675290855]

If you’ve made it this far, and have the source code, you can figure out what most of the other parameters mean. (Or you can ask in the comments!)

The answer to the question, “why did Virginia come out first” is difficult to answer succinctly. Basically:

  • Virginia, Florida, and Duke are all pretty close.
  • Virginia had a consistently strong schedule.
  • Their losses were generally speaking close games to strong opponents.
  • They had several convincing, recent victories over other very strong teams.
    In a future post, I will provide an Excel spreadsheet that will allow you to build and simulate your own NCAA tournament models!

NCAA Tournament Analytics Model 2014: Picks

Here are my picks for the 2014 NCAA Tournament, based on the analytics model I described in this post. This post contains the picks and my next post will contain the code and methodology for the geeks among us. I use analytics for my NCAA picks for my own education and enjoyment, and to absolve responsibility for them. No guarantees!

Here is a link to picks for all rounds in PDF format.

Here is a spreadsheet with all picks and ratings.

This year’s model examined every college basketball game played in Division I, II, III, and Canada based on data from Prof. Peter Wolfe and from MasseyRatings.com. The ratings implicitly account for strength of opposition, and explicitly account for neutral site games, recency, and Joel Imbiid’s back (it turned out not to matter). I officially deem these picks “not crappy”.

The last four rounds are given at the end – the values next to each team are the scores generated by the model.

The model predicts Virginia, recent winners of the ACC tournament, will win it all in 2014 in a rematch with Duke. Arizona was rated the sixth best team in the field but is projected to make it to the Final Four because it plays in the weakest region (the West). Florida, the second strongest team in the field (juuust behind Virginia) joins them. Wichita State was rated surprisingly low (25th) even though it is currently undefeated, basically due to margin of victory against relatively weaker competition (although the Missouri Valley has been an underrated conference over the past several years). Wichita State was placed in the Midwest region, clearly the toughest region in the bracket, and is projected to lose to underseeded Kentucky in the second round. Here is the average and median strengths of the four regions. The last column is the 75th percentile, which is an assessment of the strength of the elite teams in each bracket. Green means easy:

Region Avg Med Top Q
South 0.0824 0.0855 0.1101
East 0.0816 0.0876 0.1064
West 0.0752 0.0831 0.1008
Midwest 0.0841 0.0890 0.1036

The model predicts a few upsets (though not too many). The winners of the “play-in games” are projected to knock off higher seeded Saint Louis and UMass. Kentucky is also projected to beat Louisville, both of whom probably should have been seeded higher. Baylor is projected to knock off Creighton, busting Warren Buffett’s billion dollar bracket in Round 2.

Sweet 16     Elite 8  
Florida 0.1285   Florida 0.1285
VA Commonwealth 0.1097      
Syracuse 0.1111   Kansas 0.1281
Kansas 0.1244      
Virginia 0.1310   Virginia 0.1281
Michigan St 0.1229      
Iowa St 0.1129   Iowa St 0.1129
Villanova 0.1060      
Arizona 0.1212   Arizona 0.1212
Oklahoma 0.1001      
Baylor 0.1013   Wisconsin 0.1160
Wisconsin 0.1160      
Kentucky 0.1081   Kentucky 0.1081
Louisville 0.1065      
Duke 0.1266   Duke 0.1266
Michigan 0.1144      


Final Four     Championship
Florida 0.1285   Virginia 0.1310
Virginia 0.1310   Duke 0.1266
Arizona 0.1212      
Duke 0.1266      

Using Analytics to Assess Joel Embiid’s Injury and Kansas’s Chances

Joel Embiid is the starting center of the Kansas Jayhawks and one of the most talented college basketball players in the country. Unfortunately he suffered a stress fracture in his back and is likely to miss at least the first weekend of the upcoming NCAA tournament. Some think that Kansas is headed for an early round exit while others think that Kansas’s seed should not be affected at all. Can we use analytics, even roughly, to assess the impact on Kansas’ NCAA tournament prospects?

How about looking at win shares? A “win share” is a statistical estimate of the number of team wins that can be attributed to an individual’s performance. According to the amazing Iowa-powered basketball-reference.com, Embiid’s win shares per 40 minutes are an impressive 0.212 (an average player is around .100). HIs primary replacement, Tarik Black, is at 0.169. That’s a difference of 0.042 win shares per 40 minutes. I probably can’t technically do what I am about to do, but who cares. Since Kansas averages 80 points a game, the win share difference is 80 x 0.042 = 3.36 points per game. However, Embiid was only playing around 23 minutes a game, and Black isn’t even getting all of his minutes. Certain other teammates (Wiggins!) may simply play more minutes than usual to compensate. So 3.36 is probably on the high side. If we estimate that Embiid’s presence will be missed for only 20 player-minutes per game, an estimate of 1.68 points per game is probably reasonable. I will use this assumption in my upcoming NCAA Tournament model.

If we look at Kansas’s schedule we see that this difference would possibly only have swayed two games (Oklahoma State and Texas Tech). Embiid’s loss should not affect his team’s seeding any more than it already has by having lost to Iowa State in the Big 12 tournament. Kansas is a solid 2 seed, but Embiid’s loss, if prolonged, could delay a fifteenth Final Four appearance.