In my last post I provided data for NFL players and teams for the 2011 season. In this post I develop a simple, pretty darn decent forecasting engine in less than 200 lines of SAS.
For the uninitiated: fantasy football involves a bunch of 30-something males selecting players from real NFL teams and competing against each other for increasingly high stakes. The score for a fantasy team is computed by applying a set of scoring rules to the real-life performance of each player during each week of NFL season. For example, if touchdowns are valued at 6 points, and throwing an interception is penalized 2 points, if Drew Brees throws 4 TDs and 2 INTs his score for the week is 4 * 6 – 2 * 2 = 20. There are typically additional scoring rules that involve the number of yards gained by players, as well as the performances of kickers and defensive units based on more esoteric considerations. A fantasy football participant drafts a set of players (and defensive units) and selects a portion of them to “play” on his team each week. Typically you can play only a certain number of players of each position per week: for example 1 quarterback, 2 running backs, etc. Fantasy teams are matched against each other each week – the team with the highest combined team score wins.
So a smart fantasy football player tries to draft a combination of players that will result in the highest projected points per week. The forecasting engine described in this post computes a rating for each player that can be used to prioritize draft selection. The basic assumption behind the forecasting engine is that a player (or team’s) performance for the 2012 season will be exactly the same as 2011. This is obviously incorrect:
- Players improve or decline in ability over time.
- Players suffer injuries.
- Rookies have no performance in 2011 since they didn’t play.
- and so on.
All of these things can be accounted for, but I won’t here. That makes things simpler: all we really want to do is apply the rules of the league to compute the number of fantasy points for each player. Let’s take running backs as an example. In my league, running backs accumulate points as follows:
- 1 point for every 10 rushing yards.
- 1 point for every 10 receiving yards.
- 6 points per touchdown.
- 2 points deducted per fumble.
- So the first step is to read the running back data into a SAS dataset. Here’s a macro to do that:
** Read a CSV file into a SAS dataset. **; %macro ReadCSV(position); proc import datafile="C:\data\Football\NFL 2011 &position..csv" dbms=csv out=&position replace; getnames=yes; run; %mend;
The next step is to score each player. That’s easily done using a SAS data step:
** Compute RB ratings. **; %macro ScoreRB; %ReadCsv(RB); data rb; set rb; FFPts = (Rush_TD + Rec_TD) * &PtsTD + FumL * &PtsFum + Rush_Yds / &RushYdsPt + Rec_Yds / &RecYdsPt; run; %mend;
Now the SAS table RB will have an additional column called FFPts that has the forecasted fantasy points for each player over the course of the season. I have introduced macro variables to represent, e.g. the number of points per touchdown. As you will see in the full code, you can customize those according to the rules for your league.
It’s pretty easy to write similar macros for quarterbacks, kickers, and so on. If you combined all of the resulting datasets and sorted them by FFPts, you’d have a “draft board” that could be used to select players. But this would stink. Why?
The reason is that simply sorting players by expected number of points does not take into account that when drafting players we also care about the variance between players of the same position. Here’s what I mean. By virtue of the scoring rules, quarterbacks usually score more fantasy points than tight ends on average. Consider a league where the average quarterback scores 400 points per year. Now suppose that tight ends score 200 points on average, but the best tight end in the league scores 280 (call him John Doe). Given the choice, it is smarter to draft John Doe over a quarterback that scores 400 because John will outscore his competition at that position by 80 points. 400 point QBs are easy to come by, but 280 point TEs are not.
Therefore I “center” the scores for each position by finding the score for the “worst starter” for each position. In other words, if my league has 12 teams then I find the score of the 12th best quarterback. Then I subtract that value from the scores of all quarterbacks. I know have a “position invariant” metric that I can use to compare players across positions. Computing centered scored is very easy using PROC MEANS:
** Create cross-position value estimates by subtracting the value of the projected **; ** worst starter at that position. The number of league-wide starters for the **; ** position are given by obscount. This value will depend on your league. **; %macro Normalize(position, obscount); proc sort data=&position; by descending FFPts; run; proc means data=&position.(obs=&obscount) min noprint; var FFPts; output out=&position._summ; run; data _null_; set &position._summ; if _STAT_='MIN'; call symput('FFPtsMin', FFPts); run; data &position; length Pos $ 8; set &position; Pos = upcase("&Position"); FFPtsN = FFPts - &FFPtsMin; run; %mend;
We just need to call Normalize after we do the initial scoring. Again, here is the link to the full source.
Once this is done then we can combine all of the results and sort. What we get is a perfectly plausible draft board! Here are the first 25 players with both “raw” and “centered” points. Run the code to get ratings for all 640 players and teams. Poor Billy Volek is a the bottom, through no fault of his own.