SAS – Nathan Brixius

2012 Fantasy Football Prediction Model: Retrospective

Several months ago I laid out a simple model for forecasting fantasy football performance. My post included a table ranking players by value for draft purposes. Appropriately, in our league at Nielsen I used my own model to draft (selecting Ray Rice in the first round, followed by Victor Cruz and Wes Welker). I screwed up in a later round and accidentally selected Jason Whiten instead of Matt Ryan – but other than that I strictly followed the advice of my blog: always choose the highest ranked player from positions that needed to be filled.

So how did I do? Well, I finished third out of twelve – not bad. Let’s take a retrospective look at my model to see how accurately I predicted the 2012 fantasy football season.

My previous post provides actual 2012 statistics for all positions. I scored each player and compared to my projections. Note that my projections did not include players who did not participate in the 2011 season (e.g. Jamaal Charles) or rookies (e.g. RG III), so they are not included in the analysis. If I regard the difference between actual and projected fantasy points as “error”, then I can compute R^2 to summarize fit. I computed R^2 for two projections:

Project 2012 points as equal to 2011 points (“2011” in the table below)
Project 2012 points by adjusting 2011 points as described in my post (“Model”)

The results are as follows:

Pos	N	2011	Model
DEF	29	0.899414	0.899414
K	26	0.961511	0.961511
QB	31	0.909683	0.887465
RB	101	0.774406	0.797133
TE	78	0.81725	0.824893
WR	124	0.806109	0.810808

The overall R^2 for the model is around 0.852. I didn’t really dig into the details too much, but I think we can reasonably conclude that:

Kickers and Defense are pretty darn stable (probably because they largely depend on the strength of the team and opposition, which does not generally massively change between seasons).
RB performance was the hardest to predict.
The model worked in the sense that most positions ended up with better R^2 than if I had simply used 2011 numbers.
The exception is for the QB position, which perhaps implies that my assumptions about touchdown production do not apply to the QB position.

In the table below I compare the points per game projected by the model (Projected) with the actual number of fantasy points in 2012 (Actual) and the relative error. I highlight a row green when the relative error is less than 10%. If it is less than 25% it is yellow, and red otherwise.

Name	Pos	Projected	Actual	Rel. Err.
Drew Brees	QB	20.65682	21.59875	0.043611
Aaron Rodgers	QB	22.32247	21.35625	0.045243
Tom Brady	QB	20.06681	21.2675	0.056456
Cam Newton	QB	19.92028	20.17875	0.012809
Adrian Peterson	RB	13.7407	19.0875	0.280121
Matt Ryan	QB	15.70105	19.05375	0.17596
Tony Romo	QB	15.42652	17.18875	0.102522
Matthew Stafford	QB	18.70318	17.08	0.095034
Ben Roethlisberger	QB	14.58558	17.06154	0.145119
Arian Foster	RB	18.69446	16.38125	0.141211
Andy Dalton	QB	12.47327	15.6725	0.20413
Marshawn Lynch	RB	13.48645	15.4125	0.124967
Josh Freeman	QB	14.20246	15.40625	0.078137
Michael Vick	QB	21.37533	15.168	0.409239
Carson Palmer	QB	20.68852	14.688	0.408532
Eli Manning	QB	16.32711	14.5575	0.12156
Joe Flacco	QB	12.35432	14.555	0.151198
Kevin Kolb	QB	11.33343	14.12667	0.197728
Matt Schaub	QB	22.60787	13.96375	0.61904
Sam Bradford	QB	16.83711	13.92375	0.209236
Ray Rice	RB	17.53886	13.88125	0.263493
Calvin Johnson	WR	15.14139	13.775	0.099193
Brandon Marshall	WR	9.919735	13.55	0.267916
Ryan Fitzpatrick	QB	13.05769	13.35625	0.022354
C.J. Spiller	RB	7.19662	13.26875	0.457626
Philip Rivers	QB	14.7622	13.015	0.134245
Dez Bryant	WR	9.147546	13.0125	0.297019
Rob Gronkowski	TE	13.32919	13	0.025322
Stevan Ridley	RB	3.349849	12.4625	0.731206
A.J. Green	WR	9.735794	12.4375	0.217223
Demaryius Thomas	WR	6.994023	12.3375	0.433109
Jay Cutler	QB	20.61836	12.308	0.6752
Frank Gore	RB	10.8703	12.3	0.116236
Alex Smith	QB	12.57711	12.268	0.025197
LeSean McCoy	RB	16.10583	12.10833	0.330144
Christian Ponder	QB	14.64461	12.04375	0.215951
Jake Locker	QB	9.065129	12.01273	0.245373
Chicago Bears	DE	8.4375	11.875	0.289474
Matt Forte	RB	14.61648	11.82667	0.235892

Touchdowns are lognormally distributed

…well, not exactly. But it’s snappier if I put it that way.

What I really mean is: the number of pass attempts (or receptions, or carries) per touchdown is lognormally distributed, and that fact can be used to produce more stable fantasy football forecasts.

Click here to download the SAS source [estimate2.sas]

In my last two posts, I laid out simple fantasy football forecasting engines in SAS and R. An important component of a fantasy football score is the number of touchdowns scored by each player. Touchdowns can vary considerably among players with otherwise similar performance. For example, let’s look at the top three running backs from my previous post:

Name	Rush	Rush_Yds	Rush_Avg	Rush_TD	FFPts
Ray Rice	291	1364	4.7	12	292.8
LeSean McCoy	273	1309	4.8	17	280.4
Maurice Jones-Drew	343	1606	4.7	8	262

LeSean McCoy scored more than twice as many touchdowns as Maurice Jones-Drew. He scored several more than Ray Rice, but otherwise have very similar stats. The gut instinct that drives this post is that I don’t think LeSean McCoy is not going to score that many touchdowns this year!

How can I analyze touchdowns? I could simply draw a histogram of touchdowns per player, but that wouldn’t be very insightful. Players who get the ball more are more likely to score more touchdowns. So let’s control for that by dividing by the number of rushing attempts each player makes: let’s chart the touchdown rate. The histogram of rushing attempts per touchdown for the top 60 running backs in my 2011 dataset is interesting:

To my eye, it looks lognormally distributed. It’s not perfect, but it looks like a very reasonable approximation. A lognormal distribution makes sense – we expect that the distribution would be “heavy tailed” because going towards the left (1 touchdown per rush) is much harder than going to the right. Nobody scores every time they get the ball. Here is the SAS code that produces the histogram and the best fitting lognormal distribution. (I’m not doing this in R because I don’t know how to fit distributions in that environment. I am sure it is easy to do.)

** Plot a histogram, and save the lognormal distribution parameters. **;
proc univariate data=rb(obs=60) noprint;
  var Rush_Per_TD;
  histogram / lognormal nendpoints=15 cfill=blue outhistogram=rb_hist;
  ods output ParameterEstimates=rb_fit;
run;

The options for the “histogram” statement specify the distribution type, chart style, and an output dataset for the bins (which I then copied over to the free Excel 2013 preview to make a less-crappy looking chart). The “ods output” statement is a fancy way to save the lognormal parameters into a dataset for later use.

I can understand why there is a wide variation of values. Off the top of my head:

Skill of the RB.
Skill of the offensive line that blocks for the RB.
How often the player gets carries near the goalline.
Some teams call more red zone rush plays than others.
Quality of opposition.
Luck.
Stuff like this. (This moment still burns…)

With these reasons in mind, I certainly don’t expect that all RBs will end up with the same rush/TD ratio in the long run. However, I think that it is likely that players on the ends of the distribution (either way) in 2011 are likely to be closer to the middle in 2012. Here’s what we can do: compute the conditional distribution function (cdf) for the fitted lognormal distribution for each player’s rush/TD ratio. This is a number between 0 and 1 that indicates “how extreme” the player is – 0 means all the way on the left. For example, LeSean McCoy is 0.0553 and is Maurice Jones- Drew is 0.5208. This means that LeSean McCoy is an outlier (close to 0), and MJD is not (close to 1/2).

To project next year’s ratio, I take a weighted average of the player’s binomial CDF and the middle of the distribution (0.5). I somewhat arbitrarily chose to take 2/3 times the CDF and add 1/3 times 0.5. This means that while I believe that players will regress to the mean somewhat, that I do believe that there are significant structural differences between players that will persevere from one season to the next.

Once I have the projected rush/TD figures, I can multiply by rushes and get a projected 2012 TD figure that I can use in fantasy scoring. If I take the rather large leap that touchdowns for all positions behave in this way, I can write a generic “normalizing” function that I can use for touchdowns at all positions.

** Recalibrate a variable with the assumption that it is lognormally distributed.   **;
** -- position: a dataset with player information. It should have a variable called **;
**              CalibrateVar.                                                       **;
** -- obscount: the number of observations to use for analysis.                     **;
** -- CalibrateVar: the variable under analysis.                                    **;
** The macro will create a new variable ending in _1 with the calibrated values.    **;
%macro Recalibrate(position, obscount, CalibrateVar);
  ** Sort the data by the initial score computed in my first post. **;
  proc sort data=&position;
    by descending FFPts0;
  run;

  ** Plot a histogram, and save the lognormal distribution parameters. **;
  proc univariate data=&position(obs=&obscount) noprint;
    var &CalibrateVar;
    histogram / lognormal nendpoints=15 cfill=blue outhistogram=&position._hist;
    ods output ParameterEstimates=&position._fit;
  run;

  ** Get the lognormal parameters into macro variables so I can use them for computation. **;
  data _null_;
    set &position._fit;
    if Parameter = 'Scale' then call symput('Scale', Estimate);
    if Parameter = 'Shape' then call symput('Shape', Estimate);
  run;

  ** Compute the projected values for each player using the distribution. **;
  data &position;
    set &position;
    LogNormCdf = cdf('LOGNORMAL', &CalibrateVar, &Scale, &Shape);
    &CalibrateVar._1 = quantile('LOGNORMAL', 0.67 * LogNormCdf + 0.33 * 0.5, &Scale, &Shape);
  run;

%mend;

A call to this macro looks like this:

%Recalibrate(rb, 60, Rush_Per_TD);

After this call I will have a variable called Rush_Per_TD1 in my rb dataset.

I have modified the forecasting engine to recalibrate touchdowns for all positions – see estimate2.sas. You can see below how the rankings change when I recalibrate: here are the top 20 running backs. Players in green moved up in the ratings after recalibration; players in red moved down. Unsurprisingly, LeSean McCoy moved down.

Pos	Name	Team	G	Rush	Rush_Yds	Rush_YG	Rush_Avg	Rush_TD	Rec	Rec_Yds	Rec_YG	Rec_Avg	Rec_Lng	YAC	Rec_1stD	Rec_TD	Fum	FumL	Rush_Per_TD	Rec_Per_TD	FFPts0	LogNormCdf	Rec_Per_TD_1	Rush_Per_TD_1	Rush_TD_1	Rec_TD_1	FFPts	FFPtsN	Rank New	Rank Old
RB	Ray Rice	BAL	16	291	1364	85.3	4.7	12	76	704	44	9.3	52	9.2	30	3	2	2	24.25	25.33333	292.8	0.183094	23.80672	29.76091	9.777928	3.192375	280.62182	158.7998	1	1
RB	Maurice Jones-Drew	JAC	16	343	1606	100.4	4.7	8	43	374	23.4	8.7	48	9.8	18	3	6	1	42.88	14.33333	262	0.520781	16.43331	42.43739	8.082496	2.616637	260.1947976	138.3728	2	3
RB	Arian Foster	HOU	13	278	1224	94.2	4.4	10	53	617	47.5	11.6	78	12.1	19	2	5	3	27.80	26.5	250.1	0.249994	24.51103	32.10521	8.65903	2.162292	243.0279329	121.2059	3	4
RB	LeSean McCoy	PHI	15	273	1309	87.3	4.8	17	48	315	21	6.6	26	8.8	18	3	1	1	16.06	16	280.4	0.05537	17.57991	25.27579	10.80085	2.730389	241.587431	119.7654	4	2
RB	Michael Turner	ATL	16	301	1340	83.8	4.5	11	17	168	10.5	9.9	32	8.8	8	0	3	2	27.36		212.8	0.241639		31.81053	9.462275	0	203.5736476	81.75161	5	6
RB	Marshawn Lynch	SEA	15	285	1204	80.3	4.2	12	28	212	14.1	7.6	26	8.1	8	1	3	2	23.75	28	215.6	0.173974	25.38375	29.44301	9.679716	1.103068	202.2967045	80.47466	6	5
RB	Steven Jackson	STL	15	260	1145	76.3	4.4	5	42	333	22.2	7.9	50	7.6	17	1	2	1	52.00	42	181.8	0.646438	31.61723	48.20039	5.394148	1.32839	186.1352231	64.31318	7	11
RB	Ryan Mathews	SDG	14	222	1091	77.9	4.9	6	50	455	32.5	9.1	42	9.3	18	0	5	2	37.00		186.6	0.422678		38.45808	5.77252	0	185.2351183	63.41308	8	8
RB	Michael Bush	OAK	16	256	977	61.1	3.8	7	37	418	26.1	11.3	55	9.4	14	1	1	1	36.57	37	185.5	0.415045	29.7855	38.16249	6.708157	1.242215	185.2022349	63.38019	9	9
RB	Darren Sproles	NOR	16	87	603	37.7	6.9	2	86	710	44.4	8.3	39	8.4	35	7	0	0	43.50	12.28571	185.3	0.530443	15.07624	42.85039	2.03032	5.70434	177.7079564	55.88591	10	10
RB	Reggie Bush	MIA	15	216	1086	72.4	5	6	43	296	19.7	6.9	34	7.6	12	1	4	2	36.00	43	176.2	0.404778	31.93509	37.76765	5.719181	1.346481	176.5939721	54.77193	11	13
RB	Matt Forte	CHI	12	203	997	83.1	4.9	3	52	490	40.8	9.4	56	8.8	19	1	2	2	67.67	52	168.7	0.793148	34.17389	56.47246	3.594673	1.521629	175.397812	53.57577	12	15
RB	Frank Gore	SFO	16	282	1211	75.7	4.3	8	17	114	7.1	6.7	13	6.1	5	0	2	2	35.25		176.5	0.391156		37.24835	7.570805	0	173.92483	52.10279	13	12
RB	Chris Johnson	TEN	16	262	1047	65.4	4	4	57	418	26.1	7.3	34	6.8	13	0	3	1	65.50		168.5	0.777213		55.46099	4.724041	0	172.8442456	51.0222	14	16
RB	Fred Jackson	BUF	10	170	934	93.4	5.5	6	39	442	44.2	11.3	49	12.8	13	0	2	2	28.33		169.6	0.26023		32.4672	5.236054	0	165.0163236	43.19428	15	14
RB	Adrian Peterson	MIN	12	208	970	80.8	4.7	12	18	139	11.6	7.7	22	7	5	1	1	0	17.33	18	188.9	0.071217	18.96808	25.84139	8.049102	0.948963	164.8883862	43.06634	16	7
RB	Shonn Greene	NYJ	16	253	1054	65.9	4.2	6	30	211	13.2	7	36	7.2	6	0	1	0	42.17		162.5	0.509643		41.96655	6.02861	0	162.6716623	40.84962	17	18
RB	Beanie Wells	ARI	14	245	1047	74.8	4.3	10	10	52	3.7	5.2	10	2.2	1	0	4	2	24.50		165.9	0.187692		29.92123	8.188167	0	155.0290026	33.20696	18	17
RB	Willis McGahee	DEN	15	249	1199	79.9	4.8	4	12	51	3.4	4.3	12	3.9	2	1	4	3	62.25	12	149	0.750944	14.89466	53.86313	4.622828	0.805658	151.5709151	29.74887	19	22
RB	Rashard Mendenhall	PIT	15	228	928	61.9	4.1	9	18	154	10.3	8.6	35	9.3	5	0	1	1	25.33		160.2	0.203174		30.46166	7.48482	0	151.1089178	29.28688	20	19

I actually used this as draft guidance (I selected Ray Rice with my first pick in a recent draft). Let’s see if it holds water!

Fantasy Football Player Forecasting in less than 200 lines of SAS

In my last post I provided data for NFL players and teams for the 2011 season. In this post I develop a simple, pretty darn decent forecasting engine in less than 200 lines of SAS.

Click here to download the SAS source [estimate.sas].

For the uninitiated: fantasy football involves a bunch of 30-something males selecting players from real NFL teams and competing against each other for increasingly high stakes. The score for a fantasy team is computed by applying a set of scoring rules to the real-life performance of each player during each week of NFL season. For example, if touchdowns are valued at 6 points, and throwing an interception is penalized 2 points, if Drew Brees throws 4 TDs and 2 INTs his score for the week is 4 * 6 – 2 * 2 = 20. There are typically additional scoring rules that involve the number of yards gained by players, as well as the performances of kickers and defensive units based on more esoteric considerations. A fantasy football participant drafts a set of players (and defensive units) and selects a portion of them to “play” on his team each week. Typically you can play only a certain number of players of each position per week: for example 1 quarterback, 2 running backs, etc. Fantasy teams are matched against each other each week – the team with the highest combined team score wins.

So a smart fantasy football player tries to draft a combination of players that will result in the highest projected points per week. The forecasting engine described in this post computes a rating for each player that can be used to prioritize draft selection. The basic assumption behind the forecasting engine is that a player (or team’s) performance for the 2012 season will be exactly the same as 2011. This is obviously incorrect:

Players improve or decline in ability over time.
Players suffer injuries.
Rookies have no performance in 2011 since they didn’t play.
and so on.

All of these things can be accounted for, but I won’t here. That makes things simpler: all we really want to do is apply the rules of the league to compute the number of fantasy points for each player. Let’s take running backs as an example. In my league, running backs accumulate points as follows:

1 point for every 10 rushing yards.
1 point for every 10 receiving yards.
6 points per touchdown.
2 points deducted per fumble.

So the first step is to read the running back data into a SAS dataset. Here’s a macro to do that:

** Read a CSV file into a SAS dataset.       **;
%macro ReadCSV(position);
  proc import datafile="C:\data\Football\NFL 2011 &position..csv" dbms=csv
    out=&position replace;
    getnames=yes;
  run;
%mend;

The next step is to score each player. That’s easily done using a SAS data step:

** Compute RB ratings. **;
%macro ScoreRB;
  %ReadCsv(RB);
  data rb;
    set rb;
    FFPts = (Rush_TD + Rec_TD) * &PtsTD + FumL * &PtsFum + Rush_Yds / &RushYdsPt + Rec_Yds / &RecYdsPt;
  run;
%mend;

Now the SAS table RB will have an additional column called FFPts that has the forecasted fantasy points for each player over the course of the season. I have introduced macro variables to represent, e.g. the number of points per touchdown. As you will see in the full code, you can customize those according to the rules for your league.

It’s pretty easy to write similar macros for quarterbacks, kickers, and so on. If you combined all of the resulting datasets and sorted them by FFPts, you’d have a “draft board” that could be used to select players. But this would stink. Why?

The reason is that simply sorting players by expected number of points does not take into account that when drafting players we also care about the variance between players of the same position. Here’s what I mean. By virtue of the scoring rules, quarterbacks usually score more fantasy points than tight ends on average. Consider a league where the average quarterback scores 400 points per year. Now suppose that tight ends score 200 points on average, but the best tight end in the league scores 280 (call him John Doe). Given the choice, it is smarter to draft John Doe over a quarterback that scores 400 because John will outscore his competition at that position by 80 points. 400 point QBs are easy to come by, but 280 point TEs are not.

Therefore I “center” the scores for each position by finding the score for the “worst starter” for each position. In other words, if my league has 12 teams then I find the score of the 12th best quarterback. Then I subtract that value from the scores of all quarterbacks. I know have a “position invariant” metric that I can use to compare players across positions. Computing centered scored is very easy using PROC MEANS:

** Create cross-position value estimates by subtracting the value of the projected  **;
** worst starter at that position. The number of league-wide starters for the       **;
** position are given by obscount. This value will depend on your league.           **;
%macro Normalize(position, obscount);
  proc sort data=&position;
    by descending FFPts;
  run;

  proc means data=&position.(obs=&obscount) min noprint;
    var FFPts;
    output out=&position._summ;
  run;

  data _null_;
    set &position._summ;
    if _STAT_='MIN';
    call symput('FFPtsMin', FFPts);
  run;

  data &position;
    length Pos $ 8;
    set &position;
    Pos = upcase("&Position");
    FFPtsN = FFPts - &FFPtsMin;
  run;
%mend;

We just need to call Normalize after we do the initial scoring. Again, here is the link to the full source.

Once this is done then we can combine all of the results and sort. What we get is a perfectly plausible draft board! Here are the first 25 players with both “raw” and “centered” points. Run the code to get ratings for all 640 players and teams. Poor Billy Volek is a the bottom, through no fault of his own.

Pos	Name	FFPts	FFPtsN
QB	Aaron Rodgers	487.42	216.2388
QB	Drew Brees	449.6625	178.4813
RB	Ray Rice	292.8	173.8
RB	LeSean McCoy	280.4	161.4
WR	Calvin Johnson	262.1	146.5
QB	Tom Brady	416.5313	145.35
TE	Rob Gronkowski	240.9	145.3
RB	Maurice Jones-Drew	262	143
RB	Arian Foster	250.1	131.1
QB	Matthew Stafford	394.9875	123.8063
QB	Cam Newton	379.35	108.1688
WR	Jordy Nelson	216.3	100.7
TE	Jimmy Graham	195	99.4
RB	Marshawn Lynch	215.6	96.6
WR	Wes Welker	210.9	95.3
RB	Michael Turner	212.8	93.8
WR	Victor Cruz	205.6	90
WR	Larry Fitzgerald	189.1	73.5
RB	Adrian Peterson	188.9	69.9
RB	Ryan Mathews	186.6	67.6
RB	Michael Bush	185.5	66.5
RB	Darren Sproles	185.3	66.3
RB	Steven Jackson	181.8	62.8
WR	Roddy White	177.6	62
WR	Steve Smith	177.4	61.8

Simple SAS Performance Profiling Macros

As the old saying goes, the first rule of performance is to measure. My team is trying to tune the performance of a rather large SAS application, but it hasn’t been as easy as I thought it would be because profiling is too hard. Here’s what I understand about what SAS offers in the way of performance profiling (my focus is on CPU time):

First: SAS prints CPU time information to the SAS Log. For example:

17         data hello_world;
18           set temp.sampledata;
19         run;

NOTE: There were 42390 observations read from the data set TEMP.SAMPLEDATA.
NOTE: The data set WORK.HELLO_WORLD has 42390 observations and 37 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds

This is simply too low-level to be useful because I am usually interested in understanding the performance of larger components rather than a single step. (Although once a problem has been narrowed down, it is very helpful to have this information produced automatically.)

Second: SAS also provides a LOGPARSE utility that parses logs to retrieve this information and put it in a more structured format. This is a bit better, but it still does not address the problem of understanding performance at a macro (component) level. I suppose this is due in part to the nature of SAS programming, which does not reduce to a call tree in the same way as many procedural languages do. It also does not address the very real problem of wanting to profile SAS Macro execution time, as well as including the time spent my external processes (such as callouts to system utilities or non-SAS components).

Third: for SCL (SAS Component Language) programs there is the SCL Performance Analyzer, which is helpful, but only if you happen to be writing SCL – which is not usually the case. Also the output of the analyzer is a little less than what I am used to.

I am not dissing SAS – I just think I am spoiled. I am not a SAS developer by background. The past decade or so I have been doing my primary development on the .Net Framework using various versions of Visual Studio. Several editions of Visual Studio 2012 include a performance analysis feature that makes it dead simple to profile .Net code. You can access it from the Debug menu:

A couple of clicks through a wizard, run your program, and you get all sorts of amazing information about the performance characteristics of the program:

It even highlights which lines took the most time:

It could hardly be much easier. Perhaps you can understand why I was so disappointed with SAS’s built-in tools.

I decided to be the change I want to see in the world and write a couple of simple, self-contained macros that I could use to profile SAS code. I am not a SAS expert (yet) so it’s likely that there are things that could be improved. Please help me out by posting your comments!

I wrote three macros: one to initialize the timing infrastructure, one to start timing a labeled code block, and one to stop. I have a few guidelines:

The macros are self-contained and small.
If profiling is disabled, the overhead is as close to zero as possible.
If profiling is enabled and the code blocks that are being timed are nontrivial, the overhead is small.
Arbitrary blocks of SAS/SAS Macro can be timed.
Timers can be nested.
Timing blocks can be given sensible names.
The results are reported in a SAS data set.

Here is an example of how to use the library. It times a small block of SAS named “hello”:

%timer_init(Y);
** … **;

%timer_start(hello);
data hello_world;
set temp.sampledata;
run;
%timer_stop(hello);

The results are stored in a dataset called “_profile”:

Here is the source for the macros. “timer_init” takes a yes/no argument. If profiling is enabled, it creates a dataset called _tick. _tick holds the start times for all of the active timers.

************************************************************;
** timer_init(profilingEnabled): initialize timers.       **;
**   profilingEnabled: Y means profiling is on.           **;
************************************************************;
%macro timer_init(profilingEnabled);
%global timerEnabled;
%let timerEnabled=&profilingEnabled;
%if &timerEnabled=Y %then %do;
data _tick;
    root = datetime();
run;
%end;
%mend;

“timer_start” starts a new timer with the specified name. We rely on the datetime() method, which determines the timer resolution. Starting a new timer just means sticking a new variable in _tick.

************************************************************;
** timer_start(desc): start timing a code block.          **;
**   desc: a description of the block.                    **;
************************************************************;
%macro timer_start(label);
%if &timerEnabled=Y %then %do;
data _tick;
    set _tick;
    &label = datetime();
run;
%end;
%mend;

“timer_start” stops a timer. It’s slightly more complicated than you might think because I want to allow for the possibility of the same block being timed more than once, for example if it appears inside a SAS Macro %for loop. I suspect a better SAS programmer could simplify this code.

************************************************************;
** timer_stop(desc): stop timing a code block.            **;
**   desc: a description of the block.                    **;
************************************************************;
%macro timer_stop(label);
%if &timerEnabled=Y %then %do;
data _tick;
    set _tick;
    _now = datetime();
    _duration = _now – &label;
    &label = _now;
    call symputx(‘_duration’, _duration);
    drop _now _duration;
run;

%if %sysfunc(exist(_profile)) %then %do;
    data _profile;
      set _profile;
      &label = sum(&label,0) + input(&_duration, best20.);
    run;
%end;
%else %do;
    data _profile;
      &label = input(&_duration, best20.);
    run;
%end;
%end;
%mend;

Enjoy!

NCAA Tournament Wrapup

Hail to the champions, the University of Kentucky. I only got one of the four Final Four picks correct, but luckily that turned out to be the champion. My SAS-generated picks turned out to be in the 71st percentile of all those who participated in the Yahoo NCAA challenge. Not bad.

Careful readers may have noticed that my code had a somewhat serious bug in it: the exponent in the “pythagorean” formula that I relied upon was supposed to be 11.5, but I had 1.5 in my code. Oh well!

NCAA tournament prediction using SAS

I don’t watch as much basketball as I used to, so the past few years I have been writing programs to do my picks. Two years ago I did very well, and last year was a complete disaster. This year since I am in a new role at a new company, I chose a new language and a different approach.

I gave myself two hours to write SAS code to generate picks. This was something of a challenge because I am an inexperienced SAS programmer! As in past years I am using simple season-level statistics for all of the teams in the tournament. I obtained the following data:

Win-loss records for each team.
RPI rating.
Strength of schedule.
Points scored per game.
Points allowed per game.
Number of games.

This year’s twist is that I am also incorporating the “wisdom of the crowds” by using the percentage that each team was selected to win by all of the entries on ESPN.com. (Here is the link.)

For each round, I compute a score for each team. The team with the higher score advances. Here’s how I arrive at a score:

Compute the”adjusted points per game” and “adjusted points allowed per game” by scaling points for and against by the average points per game in all NCAA games this year. This means that if a team scores on average 50 and allows 45, their adjusted PPG will be something closer to 75-67.
Using adjusted PPG, use the “pythagorean formula” with an exponent of 11.5.
Add this to the RPI. I call this the “base score”. Everything up until this point is similar to my entry from two years ago.
When computing the “score” for a team in a round, if the team was judged to be in the top K of brackets on ESPN.com then we multiply the score by 100. We start K at 16 for the first round (with 64 teams) and decrease it by round. This means that if the base score for a team Ais lower than its opponent B but the public strongly believes that A will advance, we will ignore the base score and choose A.

Once the tournament starts you should be able to view my picks here: http://tournament.fantasysports.yahoo.com/t1/2239853

Unfortunately, these picks are INCREDIBLY BORING: all of the #1 and #2 seeds advance to the Elite Eight. I am somewhat disappointed it turned out that way, but that’s the way the code worked out. Final Four is Kentucky, Michigan State, Syracuse, North Carolina with Kentucky beating Syracuse in the final. A couple of other observations:

Most of the code is data preparation. The raw data files referenced in the code are screen scraped from espn.com and ncaa.com. You can download them below – rename the extensions to csv to use. I did a little bit of massaging of the data files, but I tried to do most of that in SAS.
I intentionally tried to avoid macro language. I didn’t worry about trying to factor the code nicely (as you can see by the repeated “team canonicalization” code).
I found the code to read in the more unstructured data files a bit tedious to write. I could have coded it much faster in C#.
I probably should have left out the bit about RPI. That would have made the picks more interesting. RPI is already a prediction model so we essentially have a “model of a model”.
I wanted to add a bit more code at the end that parses bracket.csv (which is a list of all the teams with opponents grouped next to each other as in the bracket) and computes the winners for each round. This is not that hard to do (I don’t think), but I ran out of time. A SAS person could whip this up easily, but as a C/C++/C# guy I still tend to think about this procedurally (and would want to just write a few for-loops).
PROC MEANS is awesome.
In order to continue to make this interesting I will need to consider per-game data rather than per-season data. Unfortunately that takes more time and I generally don’t have a lot of that to spare.

Here’s the code:

%macro DoPicks;

data espnpicksraw;

array r{6} $ 50;

infile ‘c:\temp\ncaa\espnpicks.csv’ dlm=‘,’;

input r{1} r{2} r{3} r{4} r{5} r{6};

run;

data espnpicks;

length Pct 8 Round 3;

set espnpicksraw;

array r{6} r1-r6;

do Round = 1 to 6;

* example: “1 Kentucky – 99.4%” *;

*Seed = scan(r{i}, 1, ‘ ‘);

Pct = compress(scan(r{Round}, –1, ‘ ‘), ‘%’);

first = find(r{Round}, ‘ ‘, 1);

last = find(r{Round}, ‘-‘, -length(r{Round}));

Team = substr(r{Round}, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

output;

end;

keep Team Round Pct;

run;

proc sort data=espnpicks;

by Team Round;

run;

proc transpose data=espnpicks out=espnscores(drop=_Name_) prefix=Round;

BY Team;

VAR Pct;

ID Round;

run;

%local r ;

%do r = 1 %to 6 ;

proc sort data=espnscores;

by descending Round&r.;

run;

data espnscores;

set espnscores;

Rank&r.=_N_;

run;

%end;

data espnscores;

retain Team Round1-Round6 Rank1-Rank6;

set espnscores;

run;

proc sort data=espnscores;

by Team;

run;

data rpi;

infile ‘c:\temp\ncaa\rpi.csv’ dlm=‘,’ dsd;

length Rank 3 Team $ 30 Wins 4 Losses 4 RPI 8;

length SOS 8 SOS_Projected 8 SOS_Rank 8 SOS_Projected_Rank 8 Conference $ 25 Schedule $25;

input Rank Team Wins Losses RPI SOS SOS_Projected SOS_Rank SOS_Projected_Rank Conference Schedule;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

keep Team Wins Losses RPI;

run;

data points_off;

infile ‘c:\temp\ncaa\points_off.csv’ dlm=‘,’ dsd;

length Rank 3 Team $ 30 Games 4 WinsLosses $12 PointsOff 8 PPGOff 8;

input Rank Team Games WinsLosses PointsOff PPGOff;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

keep Team Games PointsOff PPGOff;

run;

data points_def;

infile ‘c:\temp\ncaa\points_def.csv’ dlm=‘,’ dsd;

length Rank 3 Team $ 30 Games 4 WinsLosses $12 PointsDef 8 PPGDef 8;

input Rank Team Games WinsLosses PointsDef PPGDef;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

keep Team PointsDef PPGDef;

run;

data bracket;

infile ‘c:\temp\ncaa\bracket.csv’;

input Line & $50.;

first = find(Line, ‘ ‘, 1);

last = find(Line, ‘(‘, first);

Team = substr(Line, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

ID = _N_;

keep ID Team;

run;

*****;

proc sort data=rpi;

by Team;

run;

proc sort data=points_off;

by Team;

run;

proc sort data=points_def;

by Team;

run;

proc sort data=bracket;

by team;

run;

* Get data for the 68 tournament teams. *;

data team_stats;

merge bracket rpi points_off points_def espnscores (in=in_espn);

by Team;

if in_espn;

if ID;

run;

* Find the average number of points per game over all teams. *;

proc means noprint data=team_stats n mean max min range std sum;

var PointsOff Games;

output out=ncaa_points(keep=_Points _Games)

sum(PointsOff Games)= _Points _Games;

run;

* The average is in _PPG. I use the value 72.8 below. *;

data ncaa_points;

set ncaa_points;

_PPG = _Points/_Games;

run;

data team_stats;

set team_stats;

array s{6} s1-s6;

array r{6} Rank1-Rank6;

PointsTotal = PointsOff + PointsDef;

* Scale points for and against based on NCAA average. *;

AdjPointsOff = (72.8 * 2 / PointsTotal) * PointsOff;

AdjPointsDef = (72.8 * 2 / PointsTotal) * PointsDef;

Pythag = AdjPointsOff ** 1.5 / (AdjPointsOff ** 1.5 + AdjPointsDef ** 1.5);

* The base score. *;

Score = Pythag + RPI;

do Round = 1 to 6;

* If this team was very popular on ESPN.com, jack up the score. *;

if r{Round} <= max(2**(5-Round),4) then s{Round} = 100 * Score;

else s{Round} = Score;

end;

run;

proc sort data=team_stats;

by id;

run;

%mend;

%DoPicks;

· Win-loss records for each team.

· RPI rating.

· Strength of schedule.

· Points scored per game.

· Points allowed per game.

· Number of games.

This year’s twist is that I am also incorporating the “wisdom of the crowds” by using the percentage that each team was selected to win by all of the entries on ESPN.com.

Here’s how I arrive at a score:

· Compute the”adjusted points per game” and “adjusted points allowed per game” by scaling points for and against by the average points per game in all NCAA games this year. This means that if a team scores on average 50 and allows 45, their adjusted PPG will be something closer to 75-67.

· Using adjusted PPG, use the “pythagorean formula” with an exponent of 11.5.

· Add this to the RPI. I call this the “base score”.

· When computing the “score” for a team in a round, if the team was judged to be in the top K of brackets on ESPN.com then we multiply the score by 100. We start K at 16 for the first round (with 64 teams) and decrease it by round. This means that if the base score for a team Ais lower than its opponent B but the public strongly believes that A will advance, we will ignore the base score and choose A.

Once the tournament starts you should be able to view my picks here: http://tournament.fantasysports.yahoo.com/t1/2239853

Unfortunately, these picks are INCREDIBLY BORING: all of the #1 and #2 seeds advance to the Elite Eight. I am somewhat disappointed it turned out that way, but that’s the way the code worked out.

%macro DoPicks;

data espnpicksraw;

array r{6} $ 50;

infile ‘c:\temp\ncaa\espnpicks.csv’ dlm=’,’;

input r{1} r{2} r{3} r{4} r{5} r{6};

run;

data espnpicks;

length Pct 8 Round 3;

set espnpicksraw;

array r{6} r1-r6;

do Round = 1 to 6;

* example: “1 Kentucky – 99.4%” *;

*Seed = scan(r{i}, 1, ‘ ‘);

Pct = compress(scan(r{Round}, -1, ‘ ‘), ‘%’);

first = find(r{Round}, ‘ ‘, 1);

last = find(r{Round}, ‘-‘, -length(r{Round}));

Team = substr(r{Round}, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

output;

end;

keep Team Round Pct;

run;

proc sort data=espnpicks;

by Team Round;

run;

proc transpose data=espnpicks out=espnscores(drop=_Name_) prefix=Round;

BY Team;

VAR Pct;

ID Round;

run;

%local r ;

%do r = 1 %to 6 ;

proc sort data=espnscores;

by descending Round&r.;

run;

data espnscores;

set espnscores;

Rank&r.=_N_;

run;

%end;

data espnscores;

retain Team Round1-Round6 Rank1-Rank6;

set espnscores;

run;

proc sort data=espnscores;

by Team;

run;

data rpi;

infile ‘c:\temp\ncaa\rpi.csv’ dlm=’,’ dsd;

length Rank 3 Team $ 30 Wins 4 Losses 4 RPI 8;

length SOS 8 SOS_Projected 8 SOS_Rank 8 SOS_Projected_Rank 8 Conference $ 25 Schedule $25;

input Rank Team Wins Losses RPI SOS SOS_Projected SOS_Rank SOS_Projected_Rank Conference Schedule;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

keep Team Wins Losses RPI;

run;

data points_off;

infile ‘c:\temp\ncaa\points_off.csv’ dlm=’,’ dsd;

length Rank 3 Team $ 30 Games 4 WinsLosses $12 PointsOff 8 PPGOff 8;

input Rank Team Games WinsLosses PointsOff PPGOff;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

keep Team Games PointsOff PPGOff;

run;

data points_def;

infile ‘c:\temp\ncaa\points_def.csv’ dlm=’,’ dsd;

length Rank 3 Team $ 30 Games 4 WinsLosses $12 PointsDef 8 PPGDef 8;

input Rank Team Games WinsLosses PointsDef PPGDef;

Team=compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

keep Team PointsDef PPGDef;

run;

data bracket;

infile ‘c:\temp\ncaa\bracket.csv’;

input Line & $50.;

first = find(Line, ‘ ‘, 1);

last = find(Line, ‘(‘, first);

Team = substr(Line, first, last – first);

Team = compress(strip(tranwrd(tranwrd(upcase(Team), “SAINT”, “ST”), “STATE”, “ST”)), ‘.’);

ID = _N_;

keep ID Team;

run;

*****;

proc sort data=rpi;

by Team;

run;

proc sort data=points_off;

by Team;

run;

proc sort data=points_def;

by Team;

run;

proc sort data=bracket;

by team;

run;

* Get data for the 68 tournament teams. *;

data team_stats;

merge bracket rpi points_off points_def espnscores (in=in_espn);

by Team;

if in_espn;

if ID;

run;

* Find the average number of points per game over all teams. *;

proc means noprint data=team_stats n mean max min range std sum;

var PointsOff Games;

output out=ncaa_points(keep=_Points _Games)

sum(PointsOff Games)= _Points _Games;

run;

* The average is in _PPG. I use the value 72.8 below. *;

data ncaa_points;

set ncaa_points;

_PPG = _Points/_Games;

run;

data team_stats;

set team_stats;

array s{6} s1-s6;

array r{6} Rank1-Rank6;

PointsTotal = PointsOff + PointsDef;

* Scale points for and against based on NCAA average. *;

AdjPointsOff = (72.8 * 2 / PointsTotal) * PointsOff;

AdjPointsDef = (72.8 * 2 / PointsTotal) * PointsDef;

Pythag = AdjPointsOff ** 1.5 / (AdjPointsOff ** 1.5 + AdjPointsDef ** 1.5);

* The base score. *;

Score = Pythag + RPI;

do Round = 1 to 6;

* If this team was very popular on ESPN.com, jack up the score. *;

if r{Round} <= max(2**(5-Round),4) then s{Round} = 100 * Score;

else s{Round} = Score;

end;

run;

proc sort data=team_stats;

by id;

run;

%mend;

%DoPicks;