2017 NCAA Tournament Picks

Every year since 2010 I have used analytics to predict the results of the NCAA Men’s Basketball Tournament. I missed the boat on posting the model prior to the start of this year’s tournament. However, I did build and run a model, and I did submit picks based on the results. Here are my model’s picks – as I write this (before the Final Four) these picks are better than 88% of those submitted to ESPN.

Here are the ground rules I set for myself:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than one work day on this activity (and 30 minutes for this post).
  • I will share my code and raw data. (Here it is.)

I used a combination of game-by-game results and team metrics from 2003-2016 to build the features in my model. Here is a summary:

I also performed some post-processing:

  • I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
  • Standard normalization.
  • One hot encoding of categorical features.
  • Upset generation. I found the results aesthetically displeasing for bracket purposes, so I added a post-processing function that looks for games between Davids and Goliaths (i.e. I compare seeds) where David and Goliath are relatively close in strength. For those games, I go with David.

I submitted the model to the Kaggle NCAA competition, which asks for win probabilities for all possible tourney games, where submissions are scored by evaluating the log-loss of actual results and predictions. This naturally suggests logistic regression, which I used. I also built a fancy pants neural network model using Keras (which means to run my code you’ll need to get TensorFlow and Keras in addition to the usual Anaconda packages). Keras produces slightly better results in the log-loss sense. Both models predict something like 78% of past NCAA tournament games correctly.

There are a couple of obvious problems with the model:

  • I did not actually train the model on past tournaments, only on regular season games. That’s just because I didn’t take the time.
  • Not accounting for injuries.
  • NCAA games are not purely “neutral site” games because sometimes game sites are closer to one team than another. I have code for this that I will probably use next year.
  • I am splitting the difference between trying to create a good Kaggle submission and trying to create a “good” bracket. There are subtle differences between the two but I will spare you the details.

I will create a github repo for this code…sometime. For now, you can look at the code and raw data files here. The code is in ncaa.py.

2015 NFL Statistics by Player and Team

I have downloaded stats for the recently completed 2015 NFL regular season from yahoo.com, cleaned the data, and saved the data in CSV format. The files are located here. If you prefer a github repository, check here. The column headers should be self-explanatory.

You will find seven CSV files, which you can open in Excel or Google Sheets:

  • QB: quarterback data.
  • RB: running backs.
  • WR: wide receivers.
  • TE: tight ends.
  • K: kickers. I have broken out attempted and made field goals by distance into separate columns for convenience.
  • DEF: defensive stats by team.
  • ST: special teams stats by team.

Enjoy!

2016 NCAA Tournament Picks

Every year since 2010 I have used analytics to make my NCAA picks. Here is a link to the picks made by my model [PDF]: the projected Final Four is Villanova, Duke, North Carolina, and Virginia with Villanova defeating North Carolina in the final. (I think my model likes Virginia too much, by the way.)

Here’s how these selections were made. First, the ground rules I set for myself:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than on this activity (and 30 minutes for this post).
  • I will share my code and raw data.

Okay: the model. The model combines two concepts:

  1. A “win probability” model developed by Joel Sokol in 2010 as described on Net Prophet.
  2. An eigenvalue centrality model based on this post on BioPhysEngr Blog.

The win probability model accounts for margin of victory and serves as preprocessing for step 2. I added a couple of other features to make the model more accurate:

  • Home-court advantage is considered: 2.5 points which was a rough estimate I made a few years ago and presumably is still reasonable.
  • The win probability is scaled by an adjustment factor which has been selected for best results (see below).
  • Recency is considered: more recent victories are weighted more strongly.

The eigenvalue centrality model requires game-by-game results. I pulled four years of game results for all divisions from masseyratings.com (holla!) and saved them as CSV. You can get all the data here. It sounds complicated, but it’s not (otherwise I wouldn’t do it) – the model requires less than 200 lines of Python, also available here. (The code is poor quality.)

How do I know these picks aren’t crap? I don’t. The future is uncertain. But, I did a little bit of backtesting. I trained the model using different “win probability” and “recency” parameters on the 2013-2015 seasons, selecting the combination of parameters that correctly predicted the highest percentage of NCAA tournament games during those seasons, getting approximately 68% of those games right. I don’t know if that’s good, but it seems to be better than applying either the eigenvalue centrality model or the win probability model separately.

In general, picks produced by my models rank in the upper quartile in pools that I enter. I hope that’s the case this year too.

2014 NFL Statistics by Player and Team in Excel

I have downloaded stats for the recently completed 2014 NFL regular season from yahoo.com, cleaned the data, and saved in Excel and CSV formats. The files are located here. The column headers should be self-explanatory.

The Excel workbook has seven worksheets:

  • QB: quarterback data.
  • RB: running backs.
  • WR: wide receivers.
  • TE: tight ends.
  • K: kickers. I have broken out attempted and made field goals by distance into separate columns for convenience.
  • DEF: defensive stats by team.
  • ST: special teams stats by team.

The same folder also has separate CSV files for each position, which may be more helpful if you are a coder.

Predicting the 2014-2015 NBA Season

Over the weekend I created a model in Excel to predict the 2014-2015 NBA season. The model simulates the full 82-game schedule, using player Win Shares from the 2013-2014 season to estimate the strength of each team, accounting for roster changes. This model is not perfect, or even particularly sophisticated, but it is interesting. Here are the model’s predictions as of 10/15/2014, with projected playoff teams in bold.

Eastern W L PCT GB Home Road
Cleveland 61 21 0.744 0 31-10 30-12
Toronto 57 25 0.695 4 30-11 27-14
Chicago 49 33 0.598 12 26-15 23-18
Washington 44 38 0.537 17 24-17 21-20
New York 44 38 0.537 17 23-18 21-20
Miami 42 40 0.512 19 22-19 20-21
Detroit 42 40 0.512 19 22-19 20-21
Charlotte 41 41 0.500 20 22-19 19-22
Atlanta 41 41 0.500 20 22-19 19-22
Indiana 37 45 0.451 24 20-21 17-24
Boston 30 52 0.366 31 16-25 14-27
Brooklyn 26 56 0.317 35 14-27 11-30
Orlando 25 57 0.305 36 14-27 12-29
Milwaukee 24 58 0.293 37 14-27 11-30
Philadelphia 14 68 0.171 47 8-33 6-35
Western W L PCT GB Home Road
LA Clippers 57 25 0.695 0 30-11 27-14
San Antonio 57 25 0.695 0 30-11 27-14
Oklahoma City 54 28 0.646 3 29-14 26-15
Phoenix 53 29 0.646 4 28-13 25-16
Golden State 53 29 0.646 4 28-13 25-16
Portland 49 33 0.598 8 26-15 23-18
Houston 47 35 0.573 10 25-16 22-19
Dallas 47 36 0.566 10 24-17 22-19
Memphis 41 41 0.500 16 22-19 19-22
Denver 40 42 0.488 17 21-20 19-22
Minnesota 37 45 0.451 20 20-21 17-24
Sacramento 32 50 0.390 25 17-24 15-26
LA Lakers 31 51 0.378 26 17-24 14-27
New Orleans 27 55 0.329 30 15-26 12-29
Utah 25 58 0.301 33 14-27 11-30

You can download my full spreadsheet here. It’s complicated but not impossible to follow. It does not exactly match the results presented above because I have a messier version that accounts for recent injuries, e.g. Kevin Durant.

The Cavs, Spurs, and Clips are the favorites to win the title in this model (a previous version of this model also had the Thunder in this class, but Kevin Durant is now injured). Comparing these estimates to over-unders in Vegas, the biggest differences are Brooklyn (lower), Indiana (higher), Memphis (lower), Minnesota (higher), New Orleans (lower), Phoenix (higher). If you take the time to read through the methodology at the end of this post, you may be able to see why some of these differences exist. Some are probably reasonable, others may not be.

How It Works

Many of the ingredients for this model were presented in my previous three posts, where I compiled game-by-game results for the 2013-2014 season, built a simple model to predict rookie performance, and tracked roster changes. Now the task is pretty simple: estimate the strength of each team, figure out how unpredictable games are, and then simulate the season using the strengths, accounting for uncertainty. At the end I discuss weaknesses of this model, which if you are a glass-half-full type of person also suggest areas for improvement.

Step 1: Estimate the strength of each team. Team strengths are estimated by adding up the 2013-2014 Win Shares for the top twelve players on each NBA team. In my last post I gave a spreadsheet with Win Shares for all 2013-2014 NBA players based on data from basketball-reference.com. I made three adjustments to this data for the purposes of this analysis:

  • Added rookies. I estimated projected 2014-2015 Win Shares for rookies using the logarithmic curve given in this post.
  • Accounted for injuries to good players. Kobe Bryant, Derrick Rose, Rajon Rondo, and a couple of other high profile players were injured in 2013-2014. I replaced their Win Share total with the average of the past three seasons, including the season they were injured. Is this reasonable? I don’t know.
  • Trimmed to 12. I manually trimmed rosters so that only the 12 players with the highest Win Shares remained.

Adding Win Shares gives an overall “strength rating” for each team.

Step 2: Estimate the unpredictability of game results. Most of the time, a good team will beat a bad team. Most of the time. Can we quantify this more precisely? Sure. From a previous post, I determined that home court advantage is approximately 2.6 points per game. We also found that although the difference in total season wins is a predictor of who will win in a matchup between two teams, it is a rather weak predictor. in other words, bad teams beat good teams quite often, especially at home. For our prediction model we make another simple assumption: every team’s performance over the course of the season varies according to a normal distribution, with the mean of this distribution corresponding to their overall team strength.

Normal distributions are defined by two parameters: mean and standard deviation. If I know what the normal distribution looks like then I can estimate the probability of the home team winning in a matchup: take the difference of their team strengths, then calculate the cdf of the distribution at –2.6 (the home court advantage). But what is the standard deviation? I can estimate it by “replaying” all of the games in the previous season. If I guess a value for the standard deviation, I can calculate win probabilities for all games. If I add up the win probabilities, for say, Boston, then this should sum to Boston’s win total for the season (sadly, 25). So if I want to estimate the standard deviation, all I have to do is minimize the sum of deviations from estimated and actual 2013-2014 win totals. I can do this using Excel’s Solver: it’s a nonlinear minimization problem involving only one variable (the standard deviation).

It turns out that the resulting estimate does a very good job of matching 2013-2014 results:

  • The estimated win totals for all teams were within 2 wins of their actual values.
  • The estimated home winning percentage matches the actual value quite closely!

Step 3: Determine win totals for the 2014-2015 season. I obtained the 2014-2015 schedule for $5 from nbastuffer.com. Using this schedule, I calculated win probabilities for each game using the team strengths in Step 1 and the standard deviation in Step 2. If I add up the totals for each team, I get their estimated win totals. Voila! Since the prediction is created by looking at each game on the schedule, we also get home and away records, in conference records, and so on. It’s also easy to update the estimate during the season as games are played, players are traded or injured, and so on.

Why this model stinks. The biggest virtue of this model is that it was easy to build. I can think of at least ten potential shortcomings with this model:

  1. Win Shares are probably not the best metric for individual and team strength.
  2. It assumes that individual performance for 2014-2015 will be the same as 2013-2014. Paul Pierce isn’t getting any younger.
  3. It does not account for predictable changes in playing time from season-to-season.
  4. 2014-2015 win shares are not normalized to account for players leaving and entering the league.
  5. 2014-2015 win shares do not account for positive and negative synergies between players.
  6. There is no reason to believe that the standard deviation calculated in Step 2 should be the same for all teams.
  7. I have not given any justification for using a normal distribution at all!
  8. The vagaries of the NBA schedule are not accounted for. For example, teams play worse in their second consecutive road game.
  9. Several teams, including the Philadelphia 76ers, will tank games.
  10. Injuries were handled in an arbitrary and inconsistent fashion.

It will be interesting to see how this model performs, in spite of its shortcomings.

NBA Rosters and Team Changes: 2013-2014

I have downloaded statistics for all NBA players from basketball-reference.com, and accounted for roster changes (as of Sunday, October 5).

The Tm2013 column is a three-letter abbreviation for the player’s 2013-2014 team. For players that played on two or more teams, the entry represents the team for which the player played the most minutes. The Tm2014 is the player’s current NBA team (as of Sunday, October 5) according to nba.com. Rookies are not included in this spreadsheet.

If you’ve read my previous two posts, you may have guessed that I am leading up to a prediction of the upcoming 2014-2015 NBA season. You’d be correct – I will post my model and predictions tomorrow.

In the meantime, you can actually use the spreadsheet above to create a very crude prediction. The last column in this spreadsheet is “win shares”. If you create a pivot table based on Tm2014 and Win Shares, you get the sum of player win shares for current rosters – a crude measure of team strength. Here is what that table looks like:

Team Sum of WS
CLE 60.7
SAS 59.4
TOR 58.6
LAC 58
IND 56.6
OKC 54.1
GSW 53.8
PHO 52.6
POR 50.8
DAL 48.6
HOU 47.9
WAS 46.8
NYK 44.3
DET 43.9
MEM 43.5
MIA 40.8
CHI 40.4
ATL 38.9
DEN 36.9
MIN 34.3
SAC 32.6
CHA 32.1
NOP 31
LAL 26.9
BRK 26.1
BOS 25.9
UTA 23.8
ORL 23.1
MIL 21.3
PHI 7.2

I can think of at least five weaknesses in this “pivot table model”. Can you?

Predicting NBA Rookie Performance By Draft Position

Nate Silver (and others) have tracked how NBA draft position relates to total career performance, see for example this article. But what about first-year performance?

I pulled two sets of data from basketball-reference.com to answer this question:

I then merged them using Power Query and then created a pivot table to calculate the average number of rookie season “win shares” by draft position. You can download my Excel workbook here. Here is what I found:

AverageWinSharesByDraftPosition

The first pick in the draft averages nearly five Win Shares in his rookie season, and while the pattern is irregular, win shares decrease as we get deeper into the draft (duh). (The blip at the end is due to Isaiah Thomas, drafted by the Kings who promptly screwed up by letting him go.) I have drawn a logarithmic trendline which fits the data not-to-shabbily: R^2 of 0.7397. Obviously we could do much better if we considered additional factors related to the player (such as their college performance) and team (the strength of teammates playing the same position, who will compete with the rookie for playing time). Here are the averages for the first 30 draft positions:

Draft POSITION Win Shares
1 4.96
2 2.69
3 2.96
4 4.14
5 2.23
6 1.84
7 3.36
8 1.68
9 2.59
10 1.52
11 0.84
12 1.51
13 1.48
14 1.36
15 1.64
16 1.19
17 2.37
18 1.02
19 0.71
20 1.09
21 1.74
22 2.14
23 1.54
24 2.29
25 0.98
26 1.23
27 1.08
28 0.40
29 0.54
30 0.94
31 0.79