Archive
John Wooden
I’ve got sports on the brain – perhaps you’ve noticed. Recently, without thinking about it too deeply, I posted former UCLA coach John Wooden’s Pyramid of Success on my office door. Wooden’s definition of success – the peace of mind which is a direct result of self-satisfaction in knowing you made the effort to become the best of which you are capable – is meaningful and practical to me. Wooden’s teams not only had unprecedented success on the basketball court (10 national championships), but also got along exceptionally well in a time when social and racial tensions in America ran very high. The successes his teams achieved were enduring and any organization would do well to try to emulate them. Wooden himself was by all accounts an admirable and decent man, in contrast to boorish, small, me-first coaches that are often seen strutting sidelines in fancy suits this time of year.
After sticking the pyramid on my door it dawned on me that my job at Nielsen is to be a coach. I’m a recruiter who tries to attract and retain players who are skilled, coachable, and pleasant to be around. I’m a bench coach who tries to make adjustments in the flow of the game – changing defenses, substituting players, hollering at the refs. I’m a practice coach who tries to teach my team about the game as best I can, requiring me to know what the heck I am talking about. I don’t score a single point myself (at least, not very many) but my team is successful based on the effort that we all put in together, for which I am accountable.
In this TED talk (1) Wooden describes the difference between winning and success. I think it’s pretty amazing that he could give an unscripted talk like this at age 98! A few nice lines include:
- Failing to prepare is preparing to fail.
- There is no progress without change, but change does not guarantee progress.
- Things will work out as they should, provided we do what we should.
- If you surf YouTube for other Wooden interviews, you will find many occasions where former players say, “What Coach says sounds corny, but…” An earnest approach like Wooden’s can be disarming and even uncomfortable because the message is so raw. We often wrap ourselves in cynicism and irony to give the appearance that what we’re going through doesn’t matter that much, and can’t hurt us. The genius of Wooden is that he thought carefully about the ultimate meaning of all this dribbling, cutting, boxing out and shooting, and directed his energy to help all those under his watch to find that meaning for themselves. That meaning, rather than the banners hanging from Pauley Pavilion, is Wooden’s lasting legacy.
(1): I read the recent (dead on right) HBR article about TED. The TED brand has become a joke, but this video is worth it, I promise.
NCAA Tournament Prediction Model 2013
I wrote a computer program to rate the strength of every NCAA men’s college basketball team based on the Iterative Strength Rating algorithm. Last post I previewed it and and now I am presenting my picks for the 2013 NCAA Tournament.
Summary
Peter Wolfe at UCLA has graciously provided scores for every single college basketball game (over 21,000), found here. I used this information to produce a rating for each team. I then produced a bracket by simply choosing the team with the higher rating.
My complete bracket is below: click to enlarge. Check my progress or lack thereof once the tournament starts by clicking here.
My Final Four is:
- Midwest: Duke beats Louisville
- West: New Mexico beats Gonzaga
- South: Georgetown beats Kansas
- East: Indiana beats Miami FL
with Duke beating Georgetown in the championship game. Notable upsets include Boise State defeating Arizona and Wisconsin, Bucknell defeating Butler, and Minnesota beating UCLA. The bracket is interesting in the sense that it is reasonable but the higher seed is not always selected.
Methodology
Now, the gory details. I’ve based my rating on the Iterative Strength Rating by Boyd Nation. Here’s how ISR works. First, give each team an equal rating, say 1.0. Next, go through each game and give each time some points. The winning team gets the rating of the losing team plus a “winning bonus” of 0.25. The losing team gets the rating of the winning team minus a penalty of 0.25. Once all of the games have been scored, we can update ratings for each team by dividing the team’s total score by the number of games played. Now, we can rescore the games using the updated ratings again and again until the scores stabilize. The Net Prophet blog shows that this is a pretty good way to rate teams. (By the way: I highly recommend this blog. Scott Turner has done an amazing job evaluating a number of different approaches, all using freely available software. Kudos Scott!)
This year, I created my own variant of ISR. There are two main modifications. First, I am accounting for margin of victory. How? In a 2006 paper by Paul Kvam and Joel Sokol, the authors derive an expression for the probability that Team A will defeat Team B, given that Team A beat Team B by x points on Team’ A’s home court:
RH(x) = exp(0.292 x - 0.6228) / (1 + exp(0.292 x - 0.6228))
This function levels off as the margin gets higher: the values for x=21 and x=20 are almost identical and close to 1. This function is also an indirect measure of strength of victory. Given the score of a game, and taking into account the home floor, we can evaluate this function and scale the “winning bonus” – so a large margin of victory will result in a winning bonus greater than 0.25 and a smaller margin of victory will result in a smaller winning bonus.
The second variation is to weight games differently. I divide the season into three segments:
- The first 10 games,
- The next 10 games,
- The rest of the season.
The segments have weights: [0.8, 1.0, 1.2]. Why? Because I felt like it: games later in the season are probably a better predictor. A better approach is to find optimized weights based on tournament predictive power. After each modified ISR iteration I renormalize team ratings so they are in the range [0, 1]. Effectively this means I compute three scores for each team instead of one, but I don’t think this screws up the predictive power of the model too much given the number of observations per team (around 30).
I ran the algorithm on the complete set of 2012-2013 college basketball games, found here courtesy of Peter Wolfe of UCLA. This list is exhaustive and includes NAIA schools, Canadian schools, exhibition games, the Washington Generals, cats and dogs living together, etc. I’m not sure the teams are fully connected, so I do a pass through all of the games once, excluding exhibitions to identify a cluster of top-tier teams (presumably all Division I and II). The algorithm is about 140 lines of Python including the code to read the data. No fancy stuff. I will post the code later.
I have been doing little NCAA models like this for a few years now, and this is the first one I am proud of. We’ll see how it does. The main difference, of course, is that I am looking at individual games rather than aggregate team statistics over a season. A colleague of mine sometimes quotes the Papa John’s slogan “Better Ingredients, Better Pizza” when referring to the use of more granular data in models. I hope this year’s pizza tastes as good as it smells. (No endorsement implied…)
NCAA Tournament Prediction Model 2013 Preview
It’s NCAA Tournament selection Sunday! As in past years, I am going to write a program to make my picks following two principles: code it fast and do something reasonable. I am ahead of the game this year – I’m done!
The approach I am using is a modified version of the Iterative Strength Rating as described on the Net Prophet blog. I am making two modifications:
- Incorporate margin of victory.
- Weight late-season games more than early-season games.
I ran a preliminary version of the code on games played through March 17 (note: post updated 3/18 to include games from the last week). The model seeds the teams as follows. Let’s see how close these seeds are to the actual ones released this afternoon!
| 1 | Louisville |
| 1 | Duke |
| 1 | Indiana |
| 1 | Miami FL |
| 2 | Kansas |
| 2 | Ohio State |
| 2 | Georgetown |
| 2 | Florida |
| 3 | Michigan |
| 3 | St Louis U. |
| 3 | Michigan St |
| 3 | New Mexico |
| 4 | Oklahoma St |
| 4 | Gonzaga |
| 4 | Marquette |
| 4 | Syracuse |
2012 Fantasy Football Prediction Model: Retrospective
Several months ago I laid out a simple model for forecasting fantasy football performance. My post included a table ranking players by value for draft purposes. Appropriately, in our league at Nielsen I used my own model to draft (selecting Ray Rice in the first round, followed by Victor Cruz and Wes Welker). I screwed up in a later round and accidentally selected Jason Whiten instead of Matt Ryan – but other than that I strictly followed the advice of my blog: always choose the highest ranked player from positions that needed to be filled.
So how did I do? Well, I finished third out of twelve – not bad. Let’s take a retrospective look at my model to see how accurately I predicted the 2012 fantasy football season.
My previous post provides actual 2012 statistics for all positions. I scored each player and compared to my projections. Note that my projections did not include players who did not participate in the 2011 season (e.g. Jamaal Charles) or rookies (e.g. RG III), so they are not included in the analysis. If I regard the difference between actual and projected fantasy points as “error”, then I can compute R^2 to summarize fit. I computed R^2 for two projections:
- Project 2012 points as equal to 2011 points (“2011” in the table below)
- Project 2012 points by adjusting 2011 points as described in my post (“Model”)
The results are as follows:
| Pos | N | 2011 | Model |
| DEF | 29 | 0.899414 | 0.899414 |
| K | 26 | 0.961511 | 0.961511 |
| QB | 31 | 0.909683 | 0.887465 |
| RB | 101 | 0.774406 | 0.797133 |
| TE | 78 | 0.81725 | 0.824893 |
| WR | 124 | 0.806109 | 0.810808 |
The overall R^2 for the model is around 0.852. I didn’t really dig into the details too much, but I think we can reasonably conclude that:
- Kickers and Defense are pretty darn stable (probably because they largely depend on the strength of the team and opposition, which does not generally massively change between seasons).
- RB performance was the hardest to predict.
- The model worked in the sense that most positions ended up with better R^2 than if I had simply used 2011 numbers.
- The exception is for the QB position, which perhaps implies that my assumptions about touchdown production do not apply to the QB position.
In the table below I compare the points per game projected by the model (Projected) with the actual number of fantasy points in 2012 (Actual) and the relative error. I highlight a row green when the relative error is less than 10%. If it is less than 25% it is yellow, and red otherwise.
|
Name |
Pos |
Projected |
Actual |
Rel. Err. |
|
Drew Brees |
QB |
20.65682 |
21.59875 |
0.043611 |
|
Aaron Rodgers |
QB |
22.32247 |
21.35625 |
0.045243 |
|
Tom Brady |
QB |
20.06681 |
21.2675 |
0.056456 |
|
Cam Newton |
QB |
19.92028 |
20.17875 |
0.012809 |
|
Adrian Peterson |
RB |
13.7407 |
19.0875 |
0.280121 |
|
Matt Ryan |
QB |
15.70105 |
19.05375 |
0.17596 |
|
Tony Romo |
QB |
15.42652 |
17.18875 |
0.102522 |
|
Matthew Stafford |
QB |
18.70318 |
17.08 |
0.095034 |
|
Ben Roethlisberger |
QB |
14.58558 |
17.06154 |
0.145119 |
|
Arian Foster |
RB |
18.69446 |
16.38125 |
0.141211 |
|
Andy Dalton |
QB |
12.47327 |
15.6725 |
0.20413 |
|
Marshawn Lynch |
RB |
13.48645 |
15.4125 |
0.124967 |
|
Josh Freeman |
QB |
14.20246 |
15.40625 |
0.078137 |
|
Michael Vick |
QB |
21.37533 |
15.168 |
0.409239 |
|
Carson Palmer |
QB |
20.68852 |
14.688 |
0.408532 |
|
Eli Manning |
QB |
16.32711 |
14.5575 |
0.12156 |
|
Joe Flacco |
QB |
12.35432 |
14.555 |
0.151198 |
|
Kevin Kolb |
QB |
11.33343 |
14.12667 |
0.197728 |
|
Matt Schaub |
QB |
22.60787 |
13.96375 |
0.61904 |
|
Sam Bradford |
QB |
16.83711 |
13.92375 |
0.209236 |
|
Ray Rice |
RB |
17.53886 |
13.88125 |
0.263493 |
|
Calvin Johnson |
WR |
15.14139 |
13.775 |
0.099193 |
|
Brandon Marshall |
WR |
9.919735 |
13.55 |
0.267916 |
|
Ryan Fitzpatrick |
QB |
13.05769 |
13.35625 |
0.022354 |
|
C.J. Spiller |
RB |
7.19662 |
13.26875 |
0.457626 |
|
Philip Rivers |
QB |
14.7622 |
13.015 |
0.134245 |
|
Dez Bryant |
WR |
9.147546 |
13.0125 |
0.297019 |
|
Rob Gronkowski |
TE |
13.32919 |
13 |
0.025322 |
|
Stevan Ridley |
RB |
3.349849 |
12.4625 |
0.731206 |
|
A.J. Green |
WR |
9.735794 |
12.4375 |
0.217223 |
|
Demaryius Thomas |
WR |
6.994023 |
12.3375 |
0.433109 |
|
Jay Cutler |
QB |
20.61836 |
12.308 |
0.6752 |
|
Frank Gore |
RB |
10.8703 |
12.3 |
0.116236 |
|
Alex Smith |
QB |
12.57711 |
12.268 |
0.025197 |
|
LeSean McCoy |
RB |
16.10583 |
12.10833 |
0.330144 |
|
Christian Ponder |
QB |
14.64461 |
12.04375 |
0.215951 |
|
Jake Locker |
QB |
9.065129 |
12.01273 |
0.245373 |
|
Chicago Bears |
DE |
8.4375 |
11.875 |
0.289474 |
|
Matt Forte |
RB |
14.61648 |
11.82667 |
0.235892 |
2012 NFL Statistics by player and Team in CSV format
I have downloaded stats for the recently completed 2012 NFL regular season from yahoo.com, cleaned the data, and saved in CSV format. The files are located here. The column headers should be self-explanatory.
There are seven files:
- QB: quarterback data.
- RB: running backs.
- WR: wide receivers.
- TE: tight ends.
- K: kickers. For the yardage columns, the first entry indicates FG made, the second attempted. So 4-5 means 4 made out of 5 from the distance range.
- Def: defensive stats by team.
- ST: special teams stats by team.
I have also provided a spreadsheet that combines all of the above.
Big Ten Expansion: Virginia and Vanderbilt?
Maryland and Rutgers recently joined the Big Ten Conference, bringing the total number of teams to 14. A recent article by Bryce Miller of the Des Moines Register speculates that the Big Ten may be looking to add two more teams from the south to form a 16 team “superconference”.
Miller points out that nearly all Big Ten members are also members of the Association of American Universities. Further, we know that Big Ten commissioner Jim Delaney has said that he wants to choose schools in states that are adjacent to existing members. From those two facts we can narrow down the list of schools to 10:
| University | Distance | Need |
| Duke University | 2 | Virginia |
| Iowa State University | 0 | |
| University at Buffalo, The State University of New York | 1 | |
| University of Colorado Boulder | 1 | |
| The University of Kansas | 1 | |
| University of Missouri-Columbia | 1 | |
| The University of North Carolina at Chapel Hill | 2 | Virginia |
| University of Pittsburgh | 0 | |
| University of Virginia | 1 | |
| Vanderbilt University | 2 | Missouri, Virginia |
Duke, Vanderbilt, and North Carolina are not adjacent to current Big Ten members, but they are adjacent to states that are. The neighbors a school needs to be admitted to join are listed in the “Need” column.
Some thoughts:
- Missouri recently joined the SEC and is unlikely to leave.
- Colorado recently joined the Pac 12.
- Iowa State and Kansas have already been passed over in previous rounds of expansion, including when Nebraska (a former Big 12 member) joined.
- Buffalo is a bottom-tier FBS team whose athletic program does not meet Big Ten standards. I mean, seriously. Buffalo.
- It seems unlikely that Duke and North Carolina would split up – but both require Virginia to join to be contiguous and there are only two spots.
This means that we’re down to Pittsburgh, Virginia, and Vanderbilt. The previous expansion shored up the northeast – so my speculation is that if the Big Ten looks to expand again, first on the list would be Virginia and Vanderbilt.
Analytics and the Greg Popovich decision
NBA commissioner David Stern is upset with San Antonio Spurs coach Greg Popovich for sending four of his top players home on a plane to rest, instead of playing them in last night’s game against the Miami Heat. Popovich rested his players in order to prepare them for a Saturday home game. ESPN notes:
The Spurs were playing their fourth game in five nights, their sixth game of a road trip and their 11th road game in November.
On top of this, the Heat had several days of rest before the game.
Analytics Question 1: How did this happen?
When I read this, I thought that surely this is something the NBA schedule makers should have taken into consideration – the quote above reads like a list of optimization constraints. It turns out they probably did:
We always hear about the "four games in five nights" and back-to-backs. Are there certain limitations or considerations you take into account?
This year, for instance, we had an outer limit of 23 back-to-backs and four "four out of fives." At some points during the process, there were teams with more than that. You look at those things and you correct them before the schedule is final. "Oh, this team has 25 back-to-backs? We can’t go with this. We have to find a way to get them down."
The NBA schedule makers use software to assist them with the scheduling process. But what is the approach? Does it involve optimization? (A quick search did not turn up anything definitive.) Does it consider not only “four games in five nights” situations, but also interactions between teams? The Spurs situation would have been far less egregious had the Heat been coming off a few games themselves – they’d be equally fatigued. Ideally, the schedule creation process should take this into account to create “fair” matchups.
Analytics Question 2: Did Popovich do the right thing?
How would we evaluate this coaching decision? The analytics question is to forecast the expected number of wins for sitting and playing the four players involved. The factors involved include:
- The increased chances of losing the Heat game. Of course, we now know that the Spurs lost last night’s game. But let’s pretend that it is a couple of weeks ago and we advising Pop. We’d want to consider not only the comparative value of the players who would play in the place of the resting players, but also the “team effects” as described in this Michael Trick post. Wayne Winston looks at the relative effectiveness of different lineups on his blog all the time. This seems relatively straightforward.
- The short- and long-term value gained from giving his guys a break. We can break this down into two factors. First, the reduced chance of injury. Presumably not too hard. Second, the increased effectiveness due to Pop’s four top players being well rested. This could be estimated by running a regression on a player performance metric with a parameter being the number of days of rest.
I have no idea whether Popovich did the right thing, and running the numbers would not give us an authoritative answers (because all models have assumptions). But wouldn’t it be interesting to see what analytics would tell us?
Touchdowns are lognormally distributed
…well, not exactly. But it’s snappier if I put it that way.
What I really mean is: the number of pass attempts (or receptions, or carries) per touchdown is lognormally distributed, and that fact can be used to produce more stable fantasy football forecasts.
Click here to download the SAS source [estimate2.sas]
In my last two posts, I laid out simple fantasy football forecasting engines in SAS and R. An important component of a fantasy football score is the number of touchdowns scored by each player. Touchdowns can vary considerably among players with otherwise similar performance. For example, let’s look at the top three running backs from my previous post:
| Name | Rush | Rush_Yds | Rush_Avg | Rush_TD | FFPts |
| Ray Rice | 291 | 1364 | 4.7 | 12 | 292.8 |
| LeSean McCoy | 273 | 1309 | 4.8 | 17 | 280.4 |
| Maurice Jones-Drew | 343 | 1606 | 4.7 | 8 | 262 |
LeSean McCoy scored more than twice as many touchdowns as Maurice Jones-Drew. He scored several more than Ray Rice, but otherwise have very similar stats. The gut instinct that drives this post is that I don’t think LeSean McCoy is not going to score that many touchdowns this year!
How can I analyze touchdowns? I could simply draw a histogram of touchdowns per player, but that wouldn’t be very insightful. Players who get the ball more are more likely to score more touchdowns. So let’s control for that by dividing by the number of rushing attempts each player makes: let’s chart the touchdown rate. The histogram of rushing attempts per touchdown for the top 60 running backs in my 2011 dataset is interesting:
To my eye, it looks lognormally distributed. It’s not perfect, but it looks like a very reasonable approximation. A lognormal distribution makes sense – we expect that the distribution would be “heavy tailed” because going towards the left (1 touchdown per rush) is much harder than going to the right. Nobody scores every time they get the ball. Here is the SAS code that produces the histogram and the best fitting lognormal distribution. (I’m not doing this in R because I don’t know how to fit distributions in that environment. I am sure it is easy to do.)
** Plot a histogram, and save the lognormal distribution parameters. **; proc univariate data=rb(obs=60) noprint; var Rush_Per_TD; histogram / lognormal nendpoints=15 cfill=blue outhistogram=rb_hist; ods output ParameterEstimates=rb_fit; run;
The options for the “histogram” statement specify the distribution type, chart style, and an output dataset for the bins (which I then copied over to the free Excel 2013 preview to make a less-crappy looking chart). The “ods output” statement is a fancy way to save the lognormal parameters into a dataset for later use.
I can understand why there is a wide variation of values. Off the top of my head:
- Skill of the RB.
- Skill of the offensive line that blocks for the RB.
- How often the player gets carries near the goalline.
- Some teams call more red zone rush plays than others.
- Quality of opposition.
- Luck.
- Stuff like this. (This moment still burns…)
With these reasons in mind, I certainly don’t expect that all RBs will end up with the same rush/TD ratio in the long run. However, I think that it is likely that players on the ends of the distribution (either way) in 2011 are likely to be closer to the middle in 2012. Here’s what we can do: compute the conditional distribution function (cdf) for the fitted lognormal distribution for each player’s rush/TD ratio. This is a number between 0 and 1 that indicates “how extreme” the player is – 0 means all the way on the left. For example, LeSean McCoy is 0.0553 and is Maurice Jones- Drew is 0.5208. This means that LeSean McCoy is an outlier (close to 0), and MJD is not (close to 1/2).
To project next year’s ratio, I take a weighted average of the player’s binomial CDF and the middle of the distribution (0.5). I somewhat arbitrarily chose to take 2/3 times the CDF and add 1/3 times 0.5. This means that while I believe that players will regress to the mean somewhat, that I do believe that there are significant structural differences between players that will persevere from one season to the next.
Once I have the projected rush/TD figures, I can multiply by rushes and get a projected 2012 TD figure that I can use in fantasy scoring. If I take the rather large leap that touchdowns for all positions behave in this way, I can write a generic “normalizing” function that I can use for touchdowns at all positions.
** Recalibrate a variable with the assumption that it is lognormally distributed. **; ** -- position: a dataset with player information. It should have a variable called **; ** CalibrateVar. **; ** -- obscount: the number of observations to use for analysis. **; ** -- CalibrateVar: the variable under analysis. **; ** The macro will create a new variable ending in _1 with the calibrated values. **; %macro Recalibrate(position, obscount, CalibrateVar); ** Sort the data by the initial score computed in my first post. **; proc sort data=&position; by descending FFPts0; run; ** Plot a histogram, and save the lognormal distribution parameters. **; proc univariate data=&position(obs=&obscount) noprint; var &CalibrateVar; histogram / lognormal nendpoints=15 cfill=blue outhistogram=&position._hist; ods output ParameterEstimates=&position._fit; run; ** Get the lognormal parameters into macro variables so I can use them for computation. **; data _null_; set &position._fit; if Parameter = 'Scale' then call symput('Scale', Estimate); if Parameter = 'Shape' then call symput('Shape', Estimate); run; ** Compute the projected values for each player using the distribution. **; data &position; set &position; LogNormCdf = cdf('LOGNORMAL', &CalibrateVar, &Scale, &Shape); &CalibrateVar._1 = quantile('LOGNORMAL', 0.67 * LogNormCdf + 0.33 * 0.5, &Scale, &Shape); run; %mend;
A call to this macro looks like this:
%Recalibrate(rb, 60, Rush_Per_TD);
After this call I will have a variable called Rush_Per_TD1 in my rb dataset.
I have modified the forecasting engine to recalibrate touchdowns for all positions – see estimate2.sas. You can see below how the rankings change when I recalibrate: here are the top 20 running backs. Players in green moved up in the ratings after recalibration; players in red moved down. Unsurprisingly, LeSean McCoy moved down.
| Pos | Name | Team | G | Rush | Rush_Yds | Rush_YG | Rush_Avg | Rush_TD | Rec | Rec_Yds | Rec_YG | Rec_Avg | Rec_Lng | YAC | Rec_1stD | Rec_TD | Fum | FumL | Rush_Per_TD | Rec_Per_TD | FFPts0 | LogNormCdf | Rec_Per_TD_1 | Rush_Per_TD_1 | Rush_TD_1 | Rec_TD_1 | FFPts | FFPtsN | Rank New | Rank Old |
| RB | Ray Rice | BAL | 16 | 291 | 1364 | 85.3 | 4.7 | 12 | 76 | 704 | 44 | 9.3 | 52 | 9.2 | 30 | 3 | 2 | 2 | 24.25 | 25.33333 | 292.8 | 0.183094 | 23.80672 | 29.76091 | 9.777928 | 3.192375 | 280.62182 | 158.7998 | 1 | 1 |
| RB | Maurice Jones-Drew | JAC | 16 | 343 | 1606 | 100.4 | 4.7 | 8 | 43 | 374 | 23.4 | 8.7 | 48 | 9.8 | 18 | 3 | 6 | 1 | 42.88 | 14.33333 | 262 | 0.520781 | 16.43331 | 42.43739 | 8.082496 | 2.616637 | 260.1947976 | 138.3728 | 2 | 3 |
| RB | Arian Foster | HOU | 13 | 278 | 1224 | 94.2 | 4.4 | 10 | 53 | 617 | 47.5 | 11.6 | 78 | 12.1 | 19 | 2 | 5 | 3 | 27.80 | 26.5 | 250.1 | 0.249994 | 24.51103 | 32.10521 | 8.65903 | 2.162292 | 243.0279329 | 121.2059 | 3 | 4 |
| RB | LeSean McCoy | PHI | 15 | 273 | 1309 | 87.3 | 4.8 | 17 | 48 | 315 | 21 | 6.6 | 26 | 8.8 | 18 | 3 | 1 | 1 | 16.06 | 16 | 280.4 | 0.05537 | 17.57991 | 25.27579 | 10.80085 | 2.730389 | 241.587431 | 119.7654 | 4 | 2 |
| RB | Michael Turner | ATL | 16 | 301 | 1340 | 83.8 | 4.5 | 11 | 17 | 168 | 10.5 | 9.9 | 32 | 8.8 | 8 | 0 | 3 | 2 | 27.36 | 212.8 | 0.241639 | 31.81053 | 9.462275 | 0 | 203.5736476 | 81.75161 | 5 | 6 | ||
| RB | Marshawn Lynch | SEA | 15 | 285 | 1204 | 80.3 | 4.2 | 12 | 28 | 212 | 14.1 | 7.6 | 26 | 8.1 | 8 | 1 | 3 | 2 | 23.75 | 28 | 215.6 | 0.173974 | 25.38375 | 29.44301 | 9.679716 | 1.103068 | 202.2967045 | 80.47466 | 6 | 5 |
| RB | Steven Jackson | STL | 15 | 260 | 1145 | 76.3 | 4.4 | 5 | 42 | 333 | 22.2 | 7.9 | 50 | 7.6 | 17 | 1 | 2 | 1 | 52.00 | 42 | 181.8 | 0.646438 | 31.61723 | 48.20039 | 5.394148 | 1.32839 | 186.1352231 | 64.31318 | 7 | 11 |
| RB | Ryan Mathews | SDG | 14 | 222 | 1091 | 77.9 | 4.9 | 6 | 50 | 455 | 32.5 | 9.1 | 42 | 9.3 | 18 | 0 | 5 | 2 | 37.00 | 186.6 | 0.422678 | 38.45808 | 5.77252 | 0 | 185.2351183 | 63.41308 | 8 | 8 | ||
| RB | Michael Bush | OAK | 16 | 256 | 977 | 61.1 | 3.8 | 7 | 37 | 418 | 26.1 | 11.3 | 55 | 9.4 | 14 | 1 | 1 | 1 | 36.57 | 37 | 185.5 | 0.415045 | 29.7855 | 38.16249 | 6.708157 | 1.242215 | 185.2022349 | 63.38019 | 9 | 9 |
| RB | Darren Sproles | NOR | 16 | 87 | 603 | 37.7 | 6.9 | 2 | 86 | 710 | 44.4 | 8.3 | 39 | 8.4 | 35 | 7 | 0 | 0 | 43.50 | 12.28571 | 185.3 | 0.530443 | 15.07624 | 42.85039 | 2.03032 | 5.70434 | 177.7079564 | 55.88591 | 10 | 10 |
| RB | Reggie Bush | MIA | 15 | 216 | 1086 | 72.4 | 5 | 6 | 43 | 296 | 19.7 | 6.9 | 34 | 7.6 | 12 | 1 | 4 | 2 | 36.00 | 43 | 176.2 | 0.404778 | 31.93509 | 37.76765 | 5.719181 | 1.346481 | 176.5939721 | 54.77193 | 11 | 13 |
| RB | Matt Forte | CHI | 12 | 203 | 997 | 83.1 | 4.9 | 3 | 52 | 490 | 40.8 | 9.4 | 56 | 8.8 | 19 | 1 | 2 | 2 | 67.67 | 52 | 168.7 | 0.793148 | 34.17389 | 56.47246 | 3.594673 | 1.521629 | 175.397812 | 53.57577 | 12 | 15 |
| RB | Frank Gore | SFO | 16 | 282 | 1211 | 75.7 | 4.3 | 8 | 17 | 114 | 7.1 | 6.7 | 13 | 6.1 | 5 | 0 | 2 | 2 | 35.25 | 176.5 | 0.391156 | 37.24835 | 7.570805 | 0 | 173.92483 | 52.10279 | 13 | 12 | ||
| RB | Chris Johnson | TEN | 16 | 262 | 1047 | 65.4 | 4 | 4 | 57 | 418 | 26.1 | 7.3 | 34 | 6.8 | 13 | 0 | 3 | 1 | 65.50 | 168.5 | 0.777213 | 55.46099 | 4.724041 | 0 | 172.8442456 | 51.0222 | 14 | 16 | ||
| RB | Fred Jackson | BUF | 10 | 170 | 934 | 93.4 | 5.5 | 6 | 39 | 442 | 44.2 | 11.3 | 49 | 12.8 | 13 | 0 | 2 | 2 | 28.33 | 169.6 | 0.26023 | 32.4672 | 5.236054 | 0 | 165.0163236 | 43.19428 | 15 | 14 | ||
| RB | Adrian Peterson | MIN | 12 | 208 | 970 | 80.8 | 4.7 | 12 | 18 | 139 | 11.6 | 7.7 | 22 | 7 | 5 | 1 | 1 | 0 | 17.33 | 18 | 188.9 | 0.071217 | 18.96808 | 25.84139 | 8.049102 | 0.948963 | 164.8883862 | 43.06634 | 16 | 7 |
| RB | Shonn Greene | NYJ | 16 | 253 | 1054 | 65.9 | 4.2 | 6 | 30 | 211 | 13.2 | 7 | 36 | 7.2 | 6 | 0 | 1 | 0 | 42.17 | 162.5 | 0.509643 | 41.96655 | 6.02861 | 0 | 162.6716623 | 40.84962 | 17 | 18 | ||
| RB | Beanie Wells | ARI | 14 | 245 | 1047 | 74.8 | 4.3 | 10 | 10 | 52 | 3.7 | 5.2 | 10 | 2.2 | 1 | 0 | 4 | 2 | 24.50 | 165.9 | 0.187692 | 29.92123 | 8.188167 | 0 | 155.0290026 | 33.20696 | 18 | 17 | ||
| RB | Willis McGahee | DEN | 15 | 249 | 1199 | 79.9 | 4.8 | 4 | 12 | 51 | 3.4 | 4.3 | 12 | 3.9 | 2 | 1 | 4 | 3 | 62.25 | 12 | 149 | 0.750944 | 14.89466 | 53.86313 | 4.622828 | 0.805658 | 151.5709151 | 29.74887 | 19 | 22 |
| RB | Rashard Mendenhall | PIT | 15 | 228 | 928 | 61.9 | 4.1 | 9 | 18 | 154 | 10.3 | 8.6 | 35 | 9.3 | 5 | 0 | 1 | 1 | 25.33 | 160.2 | 0.203174 | 30.46166 | 7.48482 | 0 | 151.1089178 | 29.28688 | 20 | 19 |
I actually used this as draft guidance (I selected Ray Rice with my first pick in a recent draft). Let’s see if it holds water!
Fantasy Football Player Forecasting in R
I rewrote my previous post in R, mainly because Paul Rubin mentioned the fact that SAS costs too much on Twitter:
Click here to download the R source [estimate.r]
The R code is a bit shorter (120 lines instead of 157), but they are basically the same complexity. I am following exactly the same logic as last time so I won’t go into the football part of it. I will simply highlight a couple of things about the code.
I rely on read.csv to read each input file into a data frame. Once I have a data frame, the primary operation is to “score” the position. Here is the code to score the running back position. The “within” statement makes it easy.
score.rb <- function(rb) { rb <- within(rb, { FFPts <- (Rush.TD + Rec.TD) * PtsTD + FumL * PtsFum + Rush.Yds / RushYdsPt + Rec.Yds / RecYdsPt }) return(rb) }
Remember from the last post that the next thing I do is to “normalize” the scores by subtracting the score of the projected worst starter on each team. This means sorting the players by score and picking one in a particular slot. Here’s how that works:
normalize <- function(p, count) {
FFPtsMin <- as.numeric(p[with(p, order(-as.numeric(FFPts))), ][count,"FFPts"])
p$FFPtsN <- apply(p,1,function(row) (as.numeric(row["FFPts"]) - FFPtsMin))
return(p)
}
So by calling score.rb and normalize in sequence, I get final scores for the RB position. Once I score all of the positions, I simply need to join the data frames for each position together into a final data frame (called “players”). I write this out to CSV. I have saved the complete rankings so you can see the results.
Click here to download player ratings [NFL 2011 Ratings.csv]
It was a lot easier for me to write this code in SAS because it’s much more familiar to me. Even taking my prejudice into account, for this task I prefer SAS to R because the use of DATA steps leads to very simple, easy to understand code. R has great libraries, better extensibility, and is free, but the language itself feels a bit clunky.
Fantasy Football Player Forecasting in less than 200 lines of SAS
In my last post I provided data for NFL players and teams for the 2011 season. In this post I develop a simple, pretty darn decent forecasting engine in less than 200 lines of SAS.
Click here to download the SAS source [estimate.sas].
For the uninitiated: fantasy football involves a bunch of 30-something males selecting players from real NFL teams and competing against each other for increasingly high stakes. The score for a fantasy team is computed by applying a set of scoring rules to the real-life performance of each player during each week of NFL season. For example, if touchdowns are valued at 6 points, and throwing an interception is penalized 2 points, if Drew Brees throws 4 TDs and 2 INTs his score for the week is 4 * 6 – 2 * 2 = 20. There are typically additional scoring rules that involve the number of yards gained by players, as well as the performances of kickers and defensive units based on more esoteric considerations. A fantasy football participant drafts a set of players (and defensive units) and selects a portion of them to “play” on his team each week. Typically you can play only a certain number of players of each position per week: for example 1 quarterback, 2 running backs, etc. Fantasy teams are matched against each other each week – the team with the highest combined team score wins.
So a smart fantasy football player tries to draft a combination of players that will result in the highest projected points per week. The forecasting engine described in this post computes a rating for each player that can be used to prioritize draft selection. The basic assumption behind the forecasting engine is that a player (or team’s) performance for the 2012 season will be exactly the same as 2011. This is obviously incorrect:
- Players improve or decline in ability over time.
- Players suffer injuries.
- Rookies have no performance in 2011 since they didn’t play.
- and so on.
All of these things can be accounted for, but I won’t here. That makes things simpler: all we really want to do is apply the rules of the league to compute the number of fantasy points for each player. Let’s take running backs as an example. In my league, running backs accumulate points as follows:
- 1 point for every 10 rushing yards.
- 1 point for every 10 receiving yards.
- 6 points per touchdown.
- 2 points deducted per fumble.
- So the first step is to read the running back data into a SAS dataset. Here’s a macro to do that:
** Read a CSV file into a SAS dataset. **;
%macro ReadCSV(position);
proc import datafile="C:\data\Football\NFL 2011 &position..csv" dbms=csv
out=&position replace;
getnames=yes;
run;
%mend;
The next step is to score each player. That’s easily done using a SAS data step:
** Compute RB ratings. **;
%macro ScoreRB;
%ReadCsv(RB);
data rb;
set rb;
FFPts = (Rush_TD + Rec_TD) * &PtsTD + FumL * &PtsFum + Rush_Yds / &RushYdsPt + Rec_Yds / &RecYdsPt;
run;
%mend;
Now the SAS table RB will have an additional column called FFPts that has the forecasted fantasy points for each player over the course of the season. I have introduced macro variables to represent, e.g. the number of points per touchdown. As you will see in the full code, you can customize those according to the rules for your league.
It’s pretty easy to write similar macros for quarterbacks, kickers, and so on. If you combined all of the resulting datasets and sorted them by FFPts, you’d have a “draft board” that could be used to select players. But this would stink. Why?
The reason is that simply sorting players by expected number of points does not take into account that when drafting players we also care about the variance between players of the same position. Here’s what I mean. By virtue of the scoring rules, quarterbacks usually score more fantasy points than tight ends on average. Consider a league where the average quarterback scores 400 points per year. Now suppose that tight ends score 200 points on average, but the best tight end in the league scores 280 (call him John Doe). Given the choice, it is smarter to draft John Doe over a quarterback that scores 400 because John will outscore his competition at that position by 80 points. 400 point QBs are easy to come by, but 280 point TEs are not.
Therefore I “center” the scores for each position by finding the score for the “worst starter” for each position. In other words, if my league has 12 teams then I find the score of the 12th best quarterback. Then I subtract that value from the scores of all quarterbacks. I know have a “position invariant” metric that I can use to compare players across positions. Computing centered scored is very easy using PROC MEANS:
** Create cross-position value estimates by subtracting the value of the projected **; ** worst starter at that position. The number of league-wide starters for the **; ** position are given by obscount. This value will depend on your league. **; %macro Normalize(position, obscount); proc sort data=&position; by descending FFPts; run; proc means data=&position.(obs=&obscount) min noprint; var FFPts; output out=&position._summ; run; data _null_; set &position._summ; if _STAT_='MIN'; call symput('FFPtsMin', FFPts); run; data &position; length Pos $ 8; set &position; Pos = upcase("&Position"); FFPtsN = FFPts - &FFPtsMin; run; %mend;
We just need to call Normalize after we do the initial scoring. Again, here is the link to the full source.
Once this is done then we can combine all of the results and sort. What we get is a perfectly plausible draft board! Here are the first 25 players with both “raw” and “centered” points. Run the code to get ratings for all 640 players and teams. Poor Billy Volek is a the bottom, through no fault of his own.
| Pos | Name | FFPts | FFPtsN |
| QB | Aaron Rodgers | 487.42 | 216.2388 |
| QB | Drew Brees | 449.6625 | 178.4813 |
| RB | Ray Rice | 292.8 | 173.8 |
| RB | LeSean McCoy | 280.4 | 161.4 |
| WR | Calvin Johnson | 262.1 | 146.5 |
| QB | Tom Brady | 416.5313 | 145.35 |
| TE | Rob Gronkowski | 240.9 | 145.3 |
| RB | Maurice Jones-Drew | 262 | 143 |
| RB | Arian Foster | 250.1 | 131.1 |
| QB | Matthew Stafford | 394.9875 | 123.8063 |
| QB | Cam Newton | 379.35 | 108.1688 |
| WR | Jordy Nelson | 216.3 | 100.7 |
| TE | Jimmy Graham | 195 | 99.4 |
| RB | Marshawn Lynch | 215.6 | 96.6 |
| WR | Wes Welker | 210.9 | 95.3 |
| RB | Michael Turner | 212.8 | 93.8 |
| WR | Victor Cruz | 205.6 | 90 |
| WR | Larry Fitzgerald | 189.1 | 73.5 |
| RB | Adrian Peterson | 188.9 | 69.9 |
| RB | Ryan Mathews | 186.6 | 67.6 |
| RB | Michael Bush | 185.5 | 66.5 |
| RB | Darren Sproles | 185.3 | 66.3 |
| RB | Steven Jackson | 181.8 | 62.8 |
| WR | Roddy White | 177.6 | 62 |
| WR | Steve Smith | 177.4 | 61.8 |