Touchdowns are lognormally distributed

…well, not exactly. But it’s snappier if I put it that way.

What I really mean is: the number of pass attempts (or receptions, or carries) per touchdown is lognormally distributed, and that fact can be used to produce more stable fantasy football forecasts.

Click here to download the SAS source [estimate2.sas]

In my last two posts, I laid out simple fantasy football forecasting engines in SAS and R. An important component of a fantasy football score is the number of touchdowns scored by each player. Touchdowns can vary considerably among players with otherwise similar performance. For example, let’s look at the top three running backs from my previous post:

Name	Rush	Rush_Yds	Rush_Avg	Rush_TD	FFPts
Ray Rice	291	1364	4.7	12	292.8
LeSean McCoy	273	1309	4.8	17	280.4
Maurice Jones-Drew	343	1606	4.7	8	262

LeSean McCoy scored more than twice as many touchdowns as Maurice Jones-Drew. He scored several more than Ray Rice, but otherwise have very similar stats. The gut instinct that drives this post is that I don’t think LeSean McCoy is not going to score that many touchdowns this year!

How can I analyze touchdowns? I could simply draw a histogram of touchdowns per player, but that wouldn’t be very insightful. Players who get the ball more are more likely to score more touchdowns. So let’s control for that by dividing by the number of rushing attempts each player makes: let’s chart the touchdown rate. The histogram of rushing attempts per touchdown for the top 60 running backs in my 2011 dataset is interesting:

To my eye, it looks lognormally distributed. It’s not perfect, but it looks like a very reasonable approximation. A lognormal distribution makes sense – we expect that the distribution would be “heavy tailed” because going towards the left (1 touchdown per rush) is much harder than going to the right. Nobody scores every time they get the ball. Here is the SAS code that produces the histogram and the best fitting lognormal distribution. (I’m not doing this in R because I don’t know how to fit distributions in that environment. I am sure it is easy to do.)

** Plot a histogram, and save the lognormal distribution parameters. **;
proc univariate data=rb(obs=60) noprint;
  var Rush_Per_TD;
  histogram / lognormal nendpoints=15 cfill=blue outhistogram=rb_hist;
  ods output ParameterEstimates=rb_fit;
run;

The options for the “histogram” statement specify the distribution type, chart style, and an output dataset for the bins (which I then copied over to the free Excel 2013 preview to make a less-crappy looking chart). The “ods output” statement is a fancy way to save the lognormal parameters into a dataset for later use.

I can understand why there is a wide variation of values. Off the top of my head:

Skill of the RB.
Skill of the offensive line that blocks for the RB.
How often the player gets carries near the goalline.
Some teams call more red zone rush plays than others.
Quality of opposition.
Luck.
Stuff like this. (This moment still burns…)

With these reasons in mind, I certainly don’t expect that all RBs will end up with the same rush/TD ratio in the long run. However, I think that it is likely that players on the ends of the distribution (either way) in 2011 are likely to be closer to the middle in 2012. Here’s what we can do: compute the conditional distribution function (cdf) for the fitted lognormal distribution for each player’s rush/TD ratio. This is a number between 0 and 1 that indicates “how extreme” the player is – 0 means all the way on the left. For example, LeSean McCoy is 0.0553 and is Maurice Jones- Drew is 0.5208. This means that LeSean McCoy is an outlier (close to 0), and MJD is not (close to 1/2).

To project next year’s ratio, I take a weighted average of the player’s binomial CDF and the middle of the distribution (0.5). I somewhat arbitrarily chose to take 2/3 times the CDF and add 1/3 times 0.5. This means that while I believe that players will regress to the mean somewhat, that I do believe that there are significant structural differences between players that will persevere from one season to the next.

Once I have the projected rush/TD figures, I can multiply by rushes and get a projected 2012 TD figure that I can use in fantasy scoring. If I take the rather large leap that touchdowns for all positions behave in this way, I can write a generic “normalizing” function that I can use for touchdowns at all positions.

** Recalibrate a variable with the assumption that it is lognormally distributed.   **;
** -- position: a dataset with player information. It should have a variable called **;
**              CalibrateVar.                                                       **;
** -- obscount: the number of observations to use for analysis.                     **;
** -- CalibrateVar: the variable under analysis.                                    **;
** The macro will create a new variable ending in _1 with the calibrated values.    **;
%macro Recalibrate(position, obscount, CalibrateVar);
  ** Sort the data by the initial score computed in my first post. **;
  proc sort data=&position;
    by descending FFPts0;
  run;

  ** Plot a histogram, and save the lognormal distribution parameters. **;
  proc univariate data=&position(obs=&obscount) noprint;
    var &CalibrateVar;
    histogram / lognormal nendpoints=15 cfill=blue outhistogram=&position._hist;
    ods output ParameterEstimates=&position._fit;
  run;

  ** Get the lognormal parameters into macro variables so I can use them for computation. **;
  data _null_;
    set &position._fit;
    if Parameter = 'Scale' then call symput('Scale', Estimate);
    if Parameter = 'Shape' then call symput('Shape', Estimate);
  run;

  ** Compute the projected values for each player using the distribution. **;
  data &position;
    set &position;
    LogNormCdf = cdf('LOGNORMAL', &CalibrateVar, &Scale, &Shape);
    &CalibrateVar._1 = quantile('LOGNORMAL', 0.67 * LogNormCdf + 0.33 * 0.5, &Scale, &Shape);
  run;

%mend;

A call to this macro looks like this:

%Recalibrate(rb, 60, Rush_Per_TD);

After this call I will have a variable called Rush_Per_TD1 in my rb dataset.

I have modified the forecasting engine to recalibrate touchdowns for all positions – see estimate2.sas. You can see below how the rankings change when I recalibrate: here are the top 20 running backs. Players in green moved up in the ratings after recalibration; players in red moved down. Unsurprisingly, LeSean McCoy moved down.

Pos	Name	Team	G	Rush	Rush_Yds	Rush_YG	Rush_Avg	Rush_TD	Rec	Rec_Yds	Rec_YG	Rec_Avg	Rec_Lng	YAC	Rec_1stD	Rec_TD	Fum	FumL	Rush_Per_TD	Rec_Per_TD	FFPts0	LogNormCdf	Rec_Per_TD_1	Rush_Per_TD_1	Rush_TD_1	Rec_TD_1	FFPts	FFPtsN	Rank New	Rank Old
RB	Ray Rice	BAL	16	291	1364	85.3	4.7	12	76	704	44	9.3	52	9.2	30	3	2	2	24.25	25.33333	292.8	0.183094	23.80672	29.76091	9.777928	3.192375	280.62182	158.7998	1	1
RB	Maurice Jones-Drew	JAC	16	343	1606	100.4	4.7	8	43	374	23.4	8.7	48	9.8	18	3	6	1	42.88	14.33333	262	0.520781	16.43331	42.43739	8.082496	2.616637	260.1947976	138.3728	2	3
RB	Arian Foster	HOU	13	278	1224	94.2	4.4	10	53	617	47.5	11.6	78	12.1	19	2	5	3	27.80	26.5	250.1	0.249994	24.51103	32.10521	8.65903	2.162292	243.0279329	121.2059	3	4
RB	LeSean McCoy	PHI	15	273	1309	87.3	4.8	17	48	315	21	6.6	26	8.8	18	3	1	1	16.06	16	280.4	0.05537	17.57991	25.27579	10.80085	2.730389	241.587431	119.7654	4	2
RB	Michael Turner	ATL	16	301	1340	83.8	4.5	11	17	168	10.5	9.9	32	8.8	8	0	3	2	27.36		212.8	0.241639		31.81053	9.462275	0	203.5736476	81.75161	5	6
RB	Marshawn Lynch	SEA	15	285	1204	80.3	4.2	12	28	212	14.1	7.6	26	8.1	8	1	3	2	23.75	28	215.6	0.173974	25.38375	29.44301	9.679716	1.103068	202.2967045	80.47466	6	5
RB	Steven Jackson	STL	15	260	1145	76.3	4.4	5	42	333	22.2	7.9	50	7.6	17	1	2	1	52.00	42	181.8	0.646438	31.61723	48.20039	5.394148	1.32839	186.1352231	64.31318	7	11
RB	Ryan Mathews	SDG	14	222	1091	77.9	4.9	6	50	455	32.5	9.1	42	9.3	18	0	5	2	37.00		186.6	0.422678		38.45808	5.77252	0	185.2351183	63.41308	8	8
RB	Michael Bush	OAK	16	256	977	61.1	3.8	7	37	418	26.1	11.3	55	9.4	14	1	1	1	36.57	37	185.5	0.415045	29.7855	38.16249	6.708157	1.242215	185.2022349	63.38019	9	9
RB	Darren Sproles	NOR	16	87	603	37.7	6.9	2	86	710	44.4	8.3	39	8.4	35	7	0	0	43.50	12.28571	185.3	0.530443	15.07624	42.85039	2.03032	5.70434	177.7079564	55.88591	10	10
RB	Reggie Bush	MIA	15	216	1086	72.4	5	6	43	296	19.7	6.9	34	7.6	12	1	4	2	36.00	43	176.2	0.404778	31.93509	37.76765	5.719181	1.346481	176.5939721	54.77193	11	13
RB	Matt Forte	CHI	12	203	997	83.1	4.9	3	52	490	40.8	9.4	56	8.8	19	1	2	2	67.67	52	168.7	0.793148	34.17389	56.47246	3.594673	1.521629	175.397812	53.57577	12	15
RB	Frank Gore	SFO	16	282	1211	75.7	4.3	8	17	114	7.1	6.7	13	6.1	5	0	2	2	35.25		176.5	0.391156		37.24835	7.570805	0	173.92483	52.10279	13	12
RB	Chris Johnson	TEN	16	262	1047	65.4	4	4	57	418	26.1	7.3	34	6.8	13	0	3	1	65.50		168.5	0.777213		55.46099	4.724041	0	172.8442456	51.0222	14	16
RB	Fred Jackson	BUF	10	170	934	93.4	5.5	6	39	442	44.2	11.3	49	12.8	13	0	2	2	28.33		169.6	0.26023		32.4672	5.236054	0	165.0163236	43.19428	15	14
RB	Adrian Peterson	MIN	12	208	970	80.8	4.7	12	18	139	11.6	7.7	22	7	5	1	1	0	17.33	18	188.9	0.071217	18.96808	25.84139	8.049102	0.948963	164.8883862	43.06634	16	7
RB	Shonn Greene	NYJ	16	253	1054	65.9	4.2	6	30	211	13.2	7	36	7.2	6	0	1	0	42.17		162.5	0.509643		41.96655	6.02861	0	162.6716623	40.84962	17	18
RB	Beanie Wells	ARI	14	245	1047	74.8	4.3	10	10	52	3.7	5.2	10	2.2	1	0	4	2	24.50		165.9	0.187692		29.92123	8.188167	0	155.0290026	33.20696	18	17
RB	Willis McGahee	DEN	15	249	1199	79.9	4.8	4	12	51	3.4	4.3	12	3.9	2	1	4	3	62.25	12	149	0.750944	14.89466	53.86313	4.622828	0.805658	151.5709151	29.74887	19	22
RB	Rashard Mendenhall	PIT	15	228	928	61.9	4.1	9	18	154	10.3	8.6	35	9.3	5	0	1	1	25.33		160.2	0.203174		30.46166	7.48482	0	151.1089178	29.28688	20	19

I actually used this as draft guidance (I selected Ray Rice with my first pick in a recent draft). Let’s see if it holds water!

Author: natebrix

Follow me on twitter at @natebrix. View all posts by natebrix

4 thoughts on “Touchdowns are lognormally distributed”

Roark says:

11 October 2012 at 10:31 pm

Great post! I am just confused about one part though. Where does “average of the player’s binomial CDF and the middle ” binomial cdf come from? I guess I missed that point…

And also, I would be glad if you could state the prior and posterior distributions in that case. I believe prior distribution was log-normal. Then the observations are distributed in such a way that the posterior of rush/td ratio is a scaled log-normal ( with X being log-normal – posterior dist is 2/3 *X + 0.5 ). Is this correct?

Again, great post and explanation..

1. natebrix says:
  
  15 October 2012 at 1:00 am
  
  Hi Roark,
  
  The “average of the player’s binomial CDF and the middle” I pulled out of my…hat. I thought it was a reasonable way to give some credit for a player’s distinctive qualities, but no too much.
  
  Correct regarding the second point.
  
  Nate
  
Andreas says:

18 February 2014 at 3:43 pm

Thanks Nate for making may day!
I might have missed something, but before reading your post I couldn’t find a way to get the parameter estimates into a dataset without doing a dummy regression or even purchasing additional components.
It’s still strange to take a detour through ODS, but I don’t care anymore 😉

Pingback: Fantasy Football Ratings 2014 | Nathan Brixius

Share this:

Related

Author: natebrix

4 thoughts on “Touchdowns are lognormally distributed”

Leave a comment Cancel reply