Touchdowns are lognormally distributed

…well, not exactly. But it’s snappier if I put it that way.

What I really mean is: the number of pass attempts (or receptions, or carries) per touchdown is lognormally distributed, and that fact can be used to produce more stable fantasy football forecasts.

Click here to download the SAS source [estimate2.sas]

In my last two posts, I laid out simple fantasy football forecasting engines in SAS and R. An important component of a fantasy football score is the number of touchdowns scored by each player. Touchdowns can vary considerably among players with otherwise similar performance. For example, let’s look at the top three running backs from my previous post:

Name Rush Rush_Yds Rush_Avg Rush_TD FFPts
Ray Rice 291 1364 4.7 12 292.8
LeSean McCoy 273 1309 4.8 17 280.4
Maurice Jones-Drew 343 1606 4.7 8 262

LeSean McCoy scored more than twice as many touchdowns as Maurice Jones-Drew. He scored several more than Ray Rice, but otherwise have very similar stats. The gut instinct that drives this post is that I don’t think LeSean McCoy is not going to score that many touchdowns this year!

How can I analyze touchdowns? I could simply draw a histogram of touchdowns per player, but that wouldn’t be very insightful. Players who get the ball more are more likely to score more touchdowns. So let’s control for that by dividing by the number of rushing attempts each player makes: let’s chart the touchdown rate. The histogram of rushing attempts per touchdown for the top 60 running backs in my 2011 dataset is interesting:

image

To my eye, it looks lognormally distributed. It’s not perfect, but it looks like a very reasonable approximation. A lognormal distribution makes sense – we expect that the distribution would be “heavy tailed” because going towards the left (1 touchdown per rush) is much harder than going to the right. Nobody scores every time they get the ball. Here is the SAS code that produces the histogram and the best fitting lognormal distribution. (I’m not doing this in R because I don’t know how to fit distributions in that environment. I am sure it is easy to do.)

** Plot a histogram, and save the lognormal distribution parameters. **;
proc univariate data=rb(obs=60) noprint;
  var Rush_Per_TD;
  histogram / lognormal nendpoints=15 cfill=blue outhistogram=rb_hist;
  ods output ParameterEstimates=rb_fit;
run;

The options for the “histogram” statement specify the distribution type, chart style, and an output dataset for the bins (which I then copied over to the free Excel 2013 preview to make a less-crappy looking chart). The “ods output” statement is a fancy way to save the lognormal parameters into a dataset for later use.

I can understand why there is a wide variation of values. Off the top of my head:

  • Skill of the RB.
  • Skill of the offensive line that blocks for the RB.
  • How often the player gets carries near the goalline.
  • Some teams call more red zone rush plays than others.
  • Quality of opposition.
  • Luck.
  • Stuff like this. (This moment still burns…)

With these reasons in mind, I certainly don’t expect that all RBs will end up with the same rush/TD ratio in the long run. However, I think that it is likely that players on the ends of the distribution (either way) in 2011 are likely to be closer to the middle in 2012. Here’s what we can do: compute the conditional distribution function (cdf) for the fitted lognormal distribution for each player’s rush/TD ratio. This is a number between 0 and 1 that indicates “how extreme” the player is – 0 means all the way on the left. For example, LeSean McCoy is 0.0553 and is Maurice Jones- Drew is 0.5208. This means that LeSean McCoy is an outlier (close to 0), and MJD is not (close to 1/2).

To project next year’s ratio, I take a weighted average of the player’s binomial CDF and the middle of the distribution (0.5). I somewhat arbitrarily chose to take 2/3 times the CDF and add 1/3 times 0.5. This means that while I believe that players will regress to the mean somewhat, that I do believe that there are significant structural differences between players that will persevere from one season to the next.

Once I have the projected rush/TD figures, I can multiply by rushes and get a projected 2012 TD figure that I can use in fantasy scoring. If I take the rather large leap that touchdowns for all positions behave in this way, I can write a generic “normalizing” function that I can use for touchdowns at all positions.

** Recalibrate a variable with the assumption that it is lognormally distributed.   **;
** -- position: a dataset with player information. It should have a variable called **;
**              CalibrateVar.                                                       **;
** -- obscount: the number of observations to use for analysis.                     **;
** -- CalibrateVar: the variable under analysis.                                    **;
** The macro will create a new variable ending in _1 with the calibrated values.    **;
%macro Recalibrate(position, obscount, CalibrateVar);
  ** Sort the data by the initial score computed in my first post. **;
  proc sort data=&position;
    by descending FFPts0;
  run;

  ** Plot a histogram, and save the lognormal distribution parameters. **;
  proc univariate data=&position(obs=&obscount) noprint;
    var &CalibrateVar;
    histogram / lognormal nendpoints=15 cfill=blue outhistogram=&position._hist;
    ods output ParameterEstimates=&position._fit;
  run;

  ** Get the lognormal parameters into macro variables so I can use them for computation. **;
  data _null_;
    set &position._fit;
    if Parameter = 'Scale' then call symput('Scale', Estimate);
    if Parameter = 'Shape' then call symput('Shape', Estimate);
  run;

  ** Compute the projected values for each player using the distribution. **;
  data &position;
    set &position;
    LogNormCdf = cdf('LOGNORMAL', &CalibrateVar, &Scale, &Shape);
    &CalibrateVar._1 = quantile('LOGNORMAL', 0.67 * LogNormCdf + 0.33 * 0.5, &Scale, &Shape);
  run;

%mend;

A call to this macro looks like this:

%Recalibrate(rb, 60, Rush_Per_TD);

After this call I will have a variable called Rush_Per_TD1 in my rb dataset.

I have modified the forecasting engine to recalibrate touchdowns for all positions – see estimate2.sas. You can see below how the rankings change when I recalibrate: here are the top 20 running backs. Players in green moved up in the ratings after recalibration; players in red moved down. Unsurprisingly, LeSean McCoy moved down.

Pos Name Team G Rush Rush_Yds Rush_YG Rush_Avg Rush_TD Rec Rec_Yds Rec_YG Rec_Avg Rec_Lng YAC Rec_1stD Rec_TD Fum FumL Rush_Per_TD Rec_Per_TD FFPts0 LogNormCdf Rec_Per_TD_1 Rush_Per_TD_1 Rush_TD_1 Rec_TD_1 FFPts FFPtsN Rank New Rank Old
RB Ray Rice BAL 16 291 1364 85.3 4.7 12 76 704 44 9.3 52 9.2 30 3 2 2 24.25 25.33333 292.8 0.183094 23.80672 29.76091 9.777928 3.192375 280.62182 158.7998 1 1
RB Maurice Jones-Drew JAC 16 343 1606 100.4 4.7 8 43 374 23.4 8.7 48 9.8 18 3 6 1 42.88 14.33333 262 0.520781 16.43331 42.43739 8.082496 2.616637 260.1947976 138.3728 2 3
RB Arian Foster HOU 13 278 1224 94.2 4.4 10 53 617 47.5 11.6 78 12.1 19 2 5 3 27.80 26.5 250.1 0.249994 24.51103 32.10521 8.65903 2.162292 243.0279329 121.2059 3 4
RB LeSean McCoy PHI 15 273 1309 87.3 4.8 17 48 315 21 6.6 26 8.8 18 3 1 1 16.06 16 280.4 0.05537 17.57991 25.27579 10.80085 2.730389 241.587431 119.7654 4 2
RB Michael Turner ATL 16 301 1340 83.8 4.5 11 17 168 10.5 9.9 32 8.8 8 0 3 2 27.36 212.8 0.241639 31.81053 9.462275 0 203.5736476 81.75161 5 6
RB Marshawn Lynch SEA 15 285 1204 80.3 4.2 12 28 212 14.1 7.6 26 8.1 8 1 3 2 23.75 28 215.6 0.173974 25.38375 29.44301 9.679716 1.103068 202.2967045 80.47466 6 5
RB Steven Jackson STL 15 260 1145 76.3 4.4 5 42 333 22.2 7.9 50 7.6 17 1 2 1 52.00 42 181.8 0.646438 31.61723 48.20039 5.394148 1.32839 186.1352231 64.31318 7 11
RB Ryan Mathews SDG 14 222 1091 77.9 4.9 6 50 455 32.5 9.1 42 9.3 18 0 5 2 37.00 186.6 0.422678 38.45808 5.77252 0 185.2351183 63.41308 8 8
RB Michael Bush OAK 16 256 977 61.1 3.8 7 37 418 26.1 11.3 55 9.4 14 1 1 1 36.57 37 185.5 0.415045 29.7855 38.16249 6.708157 1.242215 185.2022349 63.38019 9 9
RB Darren Sproles NOR 16 87 603 37.7 6.9 2 86 710 44.4 8.3 39 8.4 35 7 0 0 43.50 12.28571 185.3 0.530443 15.07624 42.85039 2.03032 5.70434 177.7079564 55.88591 10 10
RB Reggie Bush MIA 15 216 1086 72.4 5 6 43 296 19.7 6.9 34 7.6 12 1 4 2 36.00 43 176.2 0.404778 31.93509 37.76765 5.719181 1.346481 176.5939721 54.77193 11 13
RB Matt Forte CHI 12 203 997 83.1 4.9 3 52 490 40.8 9.4 56 8.8 19 1 2 2 67.67 52 168.7 0.793148 34.17389 56.47246 3.594673 1.521629 175.397812 53.57577 12 15
RB Frank Gore SFO 16 282 1211 75.7 4.3 8 17 114 7.1 6.7 13 6.1 5 0 2 2 35.25 176.5 0.391156 37.24835 7.570805 0 173.92483 52.10279 13 12
RB Chris Johnson TEN 16 262 1047 65.4 4 4 57 418 26.1 7.3 34 6.8 13 0 3 1 65.50 168.5 0.777213 55.46099 4.724041 0 172.8442456 51.0222 14 16
RB Fred Jackson BUF 10 170 934 93.4 5.5 6 39 442 44.2 11.3 49 12.8 13 0 2 2 28.33 169.6 0.26023 32.4672 5.236054 0 165.0163236 43.19428 15 14
RB Adrian Peterson MIN 12 208 970 80.8 4.7 12 18 139 11.6 7.7 22 7 5 1 1 0 17.33 18 188.9 0.071217 18.96808 25.84139 8.049102 0.948963 164.8883862 43.06634 16 7
RB Shonn Greene NYJ 16 253 1054 65.9 4.2 6 30 211 13.2 7 36 7.2 6 0 1 0 42.17 162.5 0.509643 41.96655 6.02861 0 162.6716623 40.84962 17 18
RB Beanie Wells ARI 14 245 1047 74.8 4.3 10 10 52 3.7 5.2 10 2.2 1 0 4 2 24.50 165.9 0.187692 29.92123 8.188167 0 155.0290026 33.20696 18 17
RB Willis McGahee DEN 15 249 1199 79.9 4.8 4 12 51 3.4 4.3 12 3.9 2 1 4 3 62.25 12 149 0.750944 14.89466 53.86313 4.622828 0.805658 151.5709151 29.74887 19 22
RB Rashard Mendenhall PIT 15 228 928 61.9 4.1 9 18 154 10.3 8.6 35 9.3 5 0 1 1 25.33 160.2 0.203174 30.46166 7.48482 0 151.1089178 29.28688 20 19

I actually used this as draft guidance (I selected Ray Rice with my first pick in a recent draft). Let’s see if it holds water!

Author: natebrix

Follow me on twitter at @natebrix.

4 thoughts on “Touchdowns are lognormally distributed”

  1. Great post! I am just confused about one part though. Where does “average of the player’s binomial CDF and the middle ” binomial cdf come from? I guess I missed that point…

    And also, I would be glad if you could state the prior and posterior distributions in that case. I believe prior distribution was log-normal. Then the observations are distributed in such a way that the posterior of rush/td ratio is a scaled log-normal ( with X being log-normal – posterior dist is 2/3 *X + 0.5 ). Is this correct?

    Again, great post and explanation..

    1. Hi Roark,

      The “average of the player’s binomial CDF and the middle” I pulled out of my…hat. I thought it was a reasonable way to give some credit for a player’s distinctive qualities, but no too much.

      Correct regarding the second point.

      Nate

  2. Thanks Nate for making may day!
    I might have missed something, but before reading your post I couldn’t find a way to get the parameter estimates into a dataset without doing a dummy regression or even purchasing additional components.
    It’s still strange to take a detour through ODS, but I don’t care anymore😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s