Hole Hawg: OR’s PR Problem

Analytics is an anonymously collaborative discipline. Raw materials come from blogs, MOOCs, online tutorials, github, stackoverflow. They come from research papers on paid repositories through university accounts, or free sites like optimization-online.org. They come from pip and install.package and nuget. It used to be different, of course. I remember combing the stacks in MacLean Hall , Pappajohn, Argonne, tracing citations with red pen, turning nickels into gold on the Xerox. I curated hard copies of good papers for years, long after the need had faded – perhaps a story for another time. Those days were good, but now the center of gravity of analytics lies in the digital world.

This shift in how we share and learn data science came to mind during a recent episode involving a long time friend of this blog, the Traveling Salesman Problem (TSP). Randy Olson, a computer science graduate research assistant at Michigan State, blogged about producing optimal shortest path tours between interesting USA landmarks. On his site you can see interactive graphics that display the landmarks and the routes considered by his approach. Before I continue, I want to be clear: Randy’s contributions are admirable. He provided full code for his solution, carefully explained the problem, and connected problem to solution with visuals and context. It’s a great example of analytics storytelling combined with repeatable processes, that happens to be fun and engaging. Olson’s post, and subsequent work, has attracted considerable attention not only from the broader analytics community, but general interest outlets as well, for example the New York Times and the Washington Post. It is not often that one is hailed as a genius without buying or selling something. High praise!

I shall now take a deep breath and address the operations research (aka optimization) community. We are generous folk but this sticks in our collective craws, doesn’t it? We know we should place Dr. Evil style “air quotes” around “optimal” in the previous paragraph because the approach does not necessarily produce the best possible tour. What’s more, Olson says:

If you read my Where’s Waldo article, you’re already aware of how difficult it can be to solve route optimization problems like this one. With 50 landmarks to put in order, we would have to exhaustively evaluate 3 x 1064 possible routes to find the shortest one.

The operations research community knows the last statement is not the case (geeks: this is not to deny TSP’s computational complexity, just that enumeration is not mandatory for provably optimal solutions). This was old news back when I was listening to Depeche Mode on my Discman and when you said “tableau” you were talking about simplex. The Where’s Waldo link reveals another fun example that is similar. In both cases, Olson used a hand-coded genetic algorithm to find shortest tours between points. Operations researchers know that the USA Landmark problem is a classic TSP, and there are a number of highly tuned approaches for solving really big instances, the best of which is Concorde. Bill Cook, TSP master, explains the operations research approach behind TSP and applies it to the Olson problem here. You can read Cook’s TSP book “In Pursuit of the Traveling Salesman” for more. The point is that you can get what is indisputably, provably, the best tour in fractions of a second using Concorde, even on a phone. In other words, Olson’s TSP approach is not guaranteed to find the best possible answer and is not even as efficient (in a comparative sense).

The point of Olson’s blog is to produce interesting visualizations and experiments with data. It is not a production system, and it does not claim to solve grand challenge problems. We shouldn’t shoot a guy because he rolls his own solver for a blog post (though I will qualify this statement in a future post). I applaud him because more people should be doing cool stuff like this. Even so, there are lessons here for both the operations research and the broader data science communities. In the remainder of this post I will focus on the operations research side, saving the data science lessons for another day.

Operations research is an indispensible tool of industry but has never really connected with the technology community in the way that it deserves. This is the fault of the OR community. Neal Stephenson’s 1999 essay “In the Beginning was the Command Line”(1) is an extended meditation on computer user interfaces, including command line interfaces such as seen in Linux. Much of what is said applies to analytics in 2015, particularly an analogy between power drills and user interfaces, which I will quote at length since it’s worth it:

The Hole Hawg is a drill made by the Milwaukee Tool Company. If you look in a typical hardware store you may find smaller Milwaukee drills but not the Hole Hawg, which is too powerful and too expensive for homeowners. The Hole Hawg does not have the pistol-like design of a cheap homeowner’s drill. It is a cube of solid metal with a handle sticking out of one face and a chuck mounted in another. The cube contains a disconcertingly potent electric motor. You can hold the handle and operate the trigger with your index finger, but unless you are exceptionally strong you cannot control the weight of the Hole Hawg with one hand; it is a two-hander all the way. In order to fight off the counter-torque of the Hole Hawg you use a separate handle (provided), which you screw into one side of the iron cube or the other depending on whether you are using your left or right hand to operate the trigger. This handle is not a sleek, ergonomically designed item as it would be in a homeowner’s drill. It is simply a foot-long chunk of regular galvanized pipe, threaded on one end, with a black rubber handle on the other. If you lose it, you just go to the local plumbing supply store and buy another chunk of pipe.

During the Eighties I did some construction work. One day, another worker leaned a ladder against the outside of the building that we were putting up, climbed up to the second-story level, and used the Hole Hawg to drill a hole through the exterior wall. At some point, the drill bit caught in the wall. The Hole Hawg, following its one and only imperative, kept going. It spun the worker’s body around like a rag doll, causing him to knock his own ladder down. Fortunately he kept his grip on the Hole Hawg, which remained lodged in the wall, and he simply dangled from it and shouted for help until someone came along and reinstated the ladder.

But I never blamed the Hole Hawg; I blamed myself. The Hole Hawg is dangerous because it does exactly what you tell it to. It is not bound by the physical limitations that are inherent in a cheap drill, and neither is it limited by safety interlocks that might be built into a homeowner’s product by a liability-conscious manufacturer. The danger lies not in the machine itself but in the user’s failure to envision the full consequences of the instructions he gives to it.

A large part of the operations research community has focused on building and using Hole Hawgs. Not everyone needs one. Stephenson recognizes this; his view is balanced and insightful. He says of graphical user interfaces:

It simply is the case that we are way too busy, nowadays, to comprehend everything in detail. And it’s better to comprehend it dimly, through an interface, than not at all.

Continuing the drill metaphor, many of us want the Black and Decker at Home Depot (or just pay a contractor). The “new analytics” community understand this better: they speak of learners, predicting, and classifying. Many OR folks sneer when someone (an MBA?) goes so far as to say “goal seeking”. I haven’t heard anyone simply say “deciding”, although that’s what solvers do. Data scientists see a problem and think about whether they need a predictor, or a classifier, or whatever, and go and grab their favorite one and see if it works, and if it doesn’t they try another. Practitioners will pick up OR tools when they are easier to use and general purpose – even if raw power is sacrificed. Most of this community does not understand the staggering power that certain operations research tools (such as mixed integer programming) provide. Who could believe that mixed integer programming solvers have sped up by a factor of 5,600,000,000 since 1990? Concorde is incredible, but a far paler variant of Concorde as modeled using mixed integer programming is almost unspeakably powerful compared to a genetic algorithm. The giant caveat is that for this community, if the phrase “subtour elimination” is involved, or arguably even if the practitioner has to recognize the damn thing as a TSP, OR loses. People like Bill Cook have more than done their part by building such great tools and showing how they can be used. We need more of that, and to push it even farther, lest the narrative remain unchanged. As Matthew Saltzman rightly points out:

Are we going to expect people to know these things in order to realize OR’s benefits? I hope not.

(That said, I don’t want to let the broader analytics community off the hook – I will examine that topic next time.)


(1) Read the whole thing if you’ve got time. Otherwise you will miss stuff like “This is exactly how the World Wide Web works: the HTML files are the pithy description on the paper tape, and your Web browser is Ronald Reagan.”

2014 NFL Statistics by Player and Team in Excel

I have downloaded stats for the recently completed 2014 NFL regular season from yahoo.com, cleaned the data, and saved in Excel and CSV formats. The files are located here. The column headers should be self-explanatory.

The Excel workbook has seven worksheets:

  • QB: quarterback data.
  • RB: running backs.
  • WR: wide receivers.
  • TE: tight ends.
  • K: kickers. I have broken out attempted and made field goals by distance into separate columns for convenience.
  • DEF: defensive stats by team.
  • ST: special teams stats by team.

The same folder also has separate CSV files for each position, which may be more helpful if you are a coder.

2014 In Review: Five Data Science Trends

2014 was another transformative, exciting year for data science. Summarizing even one measly year of progress is difficult! Here, in no particular order, are five trends that caught my attention. They are are likely to continue in 2015.

Adoption of higher productivity analytics programming environments. Traditional languages and environments such as C, C++, and SAS, are diminishing in importance as R, Python, and Scala ascend. It is not that data scientists are dumping the old stuff; it is that a flood of new data scientists have entered the fray, overwhelmingly choosing more modern environments. These newer systems provide language conveniences as well as a rich library of built-in (or easy to install) libraries that handle higher abstraction analytics-related tasks. Modern data scientists don’t want to write CSV read routines, JSON parsers, SQL INSERTs or logging systems. R is notable in the sense that its productivity gains come from its packages and community support, not from the language itself. R’s clunkiness will be its downfall as Python, Scala, and other languages gain greater traction within the analytics community and narrow the libraries gap.

Machine learning is king. Data science, however you define it, is a broad discipline, but the thunder belongs to machine learning. Rather, machine learning is becoming synonymous with data science. Optimization, for example, is increasingly being described as a backend algorithm supporting prediction or classification. Objective functions are described as “learning functions”, and so on. We are collectively inventing definitions and classifications as we go along, so in some sense this is merely semantics. There is a real problem however: we risk ignoring the collective wisdom of past masters, throwing rather broad but shallow techniques at models with rather particular structure. Somewhere in a coffee shop right now, somebody is using a genetic algorithm to solve a network model.

Visualization is everywhere. To the point that it’s starting to get annoying. Whether in solutions such as Tableau or TIBCO Spotfire, add-ons such as Excel Power View or Frontline’s XLMiner Visualization app (plug!), or programming libraries such as matplotlib or d3.js, the machinery to build good visualizations is maturing. The explosion of infographics in popular media have raised expectations: users expect visualizations to guide and inform them at all stages of the analytics lifecycle. As you have no doubt seen, the problem is that it is so easy to build shitty viz: bar charts with no content; furious heat maps signifying nothing. We’ll start to see broader, if unarticulated, consensus on appropriate visualizations and visual metaphors for the results of quantitative analysis. I hope.

Spark is supplanting Hadoop. Apache Spark wins on performance, has well-designed streaming, text analytics, machine learning, and data access libraries, and has huge community momentum. This was all true at the beginning of 2014, but now at the end of 2014 we are starting to see a breakthrough in industry investment. Hadoop isn’t going anywhere, but in 2015 many new, “big time” big data projects will be built on Spark. The more flexible graph-based pipeline at the heart of Spark is begging for great data scientists to exploit – what will 2015 bring?

Service oriented architectures are coming to analytics. The ubiquity of REST-based endpoints in web development, combined with a new culture of code sharing, have engendered a new “mixtape” development paradigm. A kid (or even an older guy…) can whip out their MacBook, create a webservice on django, deploy it on AWS using Elastic Beanstalk, connect it to interactive visualizations in an afternoon, and submit an app that night. Amazon has built a $1B+ side business on the strength of cloud-only services, and same these forces will drive analytics forward. The prime mover in data science is not big data. It is cloud. RESTful, service-based analytics services will explode.

Optimization In Google Sheets

Good news for those of you that use spreadsheets to do analytics: Google recently announced a Linear Optimization add-on for Google Sheets, and now Frontline Systems has released a free Solver add-on for Google Sheets that solves not only linear optimization problems, but nonlinear ones as well. It has roughly the same capabilities as the Solver App for Excel Online. If you know how to use Excel’s Solver, then you know how to use this. (Disclaimer: I participated in the development of both the Google Sheets and Excel Online apps during my tenure as CTO of Frontline. I think they are great.)

Here’s how to get started with the Solver add-on for Google sheets.

Step 1: Insert the Add-on. Create a new Google Sheet (for example by going to drive.google.com and clicking “New”). Then, under the Add-ons menu, click “Get add-ons..”. Search for “solver” and you will see both the Google and Frontline apps:

image

Click on the button next to Solver. (Hi Edwin!) Now “Solver” will appear under the Add-ons menu. When you click on it, a pane will show up on the right-hand side of your screen.

image

Step 2: Create an optimization model. You can use the task pane to define the variables, objective, and constraints of your optimization model. Clicking on the “Insert Example” button will paste a sample problem into your sheet. Here’s what it looks like: it’s a production planning problem where we want to determine the number of TVs, Stereos, and Speakers to build in order to maximize profit.

image

In the task pane on the right you can see that the profit cell (F13) has been selected as the objective we are maximizing. Similar to the Excel solver, you can define the constraints by clicking on them in the “Subject To” section.

image

Step 3: Solve. Clicking on Solve will call Frontline’s Simplex solver to solve your model on the cloud (specifically – Windows Azure…). The variables B3:D3 will be updated, as will any formulas that depend on those values. As you can see, profit goes up:

image

WINNING. If you fool around with the app you will see that you can solve models with arbitrary formulas, not just linear models. And it’s free! Go check it out.

Predicting the 2014-2015 NBA Season

Over the weekend I created a model in Excel to predict the 2014-2015 NBA season. The model simulates the full 82-game schedule, using player Win Shares from the 2013-2014 season to estimate the strength of each team, accounting for roster changes. This model is not perfect, or even particularly sophisticated, but it is interesting. Here are the model’s predictions as of 10/15/2014, with projected playoff teams in bold.

Eastern W L PCT GB Home Road
Cleveland 61 21 0.744 0 31-10 30-12
Toronto 57 25 0.695 4 30-11 27-14
Chicago 49 33 0.598 12 26-15 23-18
Washington 44 38 0.537 17 24-17 21-20
New York 44 38 0.537 17 23-18 21-20
Miami 42 40 0.512 19 22-19 20-21
Detroit 42 40 0.512 19 22-19 20-21
Charlotte 41 41 0.500 20 22-19 19-22
Atlanta 41 41 0.500 20 22-19 19-22
Indiana 37 45 0.451 24 20-21 17-24
Boston 30 52 0.366 31 16-25 14-27
Brooklyn 26 56 0.317 35 14-27 11-30
Orlando 25 57 0.305 36 14-27 12-29
Milwaukee 24 58 0.293 37 14-27 11-30
Philadelphia 14 68 0.171 47 8-33 6-35
Western W L PCT GB Home Road
LA Clippers 57 25 0.695 0 30-11 27-14
San Antonio 57 25 0.695 0 30-11 27-14
Oklahoma City 54 28 0.646 3 29-14 26-15
Phoenix 53 29 0.646 4 28-13 25-16
Golden State 53 29 0.646 4 28-13 25-16
Portland 49 33 0.598 8 26-15 23-18
Houston 47 35 0.573 10 25-16 22-19
Dallas 47 36 0.566 10 24-17 22-19
Memphis 41 41 0.500 16 22-19 19-22
Denver 40 42 0.488 17 21-20 19-22
Minnesota 37 45 0.451 20 20-21 17-24
Sacramento 32 50 0.390 25 17-24 15-26
LA Lakers 31 51 0.378 26 17-24 14-27
New Orleans 27 55 0.329 30 15-26 12-29
Utah 25 58 0.301 33 14-27 11-30

You can download my full spreadsheet here. It’s complicated but not impossible to follow. It does not exactly match the results presented above because I have a messier version that accounts for recent injuries, e.g. Kevin Durant.

The Cavs, Spurs, and Clips are the favorites to win the title in this model (a previous version of this model also had the Thunder in this class, but Kevin Durant is now injured). Comparing these estimates to over-unders in Vegas, the biggest differences are Brooklyn (lower), Indiana (higher), Memphis (lower), Minnesota (higher), New Orleans (lower), Phoenix (higher). If you take the time to read through the methodology at the end of this post, you may be able to see why some of these differences exist. Some are probably reasonable, others may not be.

How It Works

Many of the ingredients for this model were presented in my previous three posts, where I compiled game-by-game results for the 2013-2014 season, built a simple model to predict rookie performance, and tracked roster changes. Now the task is pretty simple: estimate the strength of each team, figure out how unpredictable games are, and then simulate the season using the strengths, accounting for uncertainty. At the end I discuss weaknesses of this model, which if you are a glass-half-full type of person also suggest areas for improvement.

Step 1: Estimate the strength of each team. Team strengths are estimated by adding up the 2013-2014 Win Shares for the top twelve players on each NBA team. In my last post I gave a spreadsheet with Win Shares for all 2013-2014 NBA players based on data from basketball-reference.com. I made three adjustments to this data for the purposes of this analysis:

  • Added rookies. I estimated projected 2014-2015 Win Shares for rookies using the logarithmic curve given in this post.
  • Accounted for injuries to good players. Kobe Bryant, Derrick Rose, Rajon Rondo, and a couple of other high profile players were injured in 2013-2014. I replaced their Win Share total with the average of the past three seasons, including the season they were injured. Is this reasonable? I don’t know.
  • Trimmed to 12. I manually trimmed rosters so that only the 12 players with the highest Win Shares remained.

Adding Win Shares gives an overall “strength rating” for each team.

Step 2: Estimate the unpredictability of game results. Most of the time, a good team will beat a bad team. Most of the time. Can we quantify this more precisely? Sure. From a previous post, I determined that home court advantage is approximately 2.6 points per game. We also found that although the difference in total season wins is a predictor of who will win in a matchup between two teams, it is a rather weak predictor. in other words, bad teams beat good teams quite often, especially at home. For our prediction model we make another simple assumption: every team’s performance over the course of the season varies according to a normal distribution, with the mean of this distribution corresponding to their overall team strength.

Normal distributions are defined by two parameters: mean and standard deviation. If I know what the normal distribution looks like then I can estimate the probability of the home team winning in a matchup: take the difference of their team strengths, then calculate the cdf of the distribution at –2.6 (the home court advantage). But what is the standard deviation? I can estimate it by “replaying” all of the games in the previous season. If I guess a value for the standard deviation, I can calculate win probabilities for all games. If I add up the win probabilities, for say, Boston, then this should sum to Boston’s win total for the season (sadly, 25). So if I want to estimate the standard deviation, all I have to do is minimize the sum of deviations from estimated and actual 2013-2014 win totals. I can do this using Excel’s Solver: it’s a nonlinear minimization problem involving only one variable (the standard deviation).

It turns out that the resulting estimate does a very good job of matching 2013-2014 results:

  • The estimated win totals for all teams were within 2 wins of their actual values.
  • The estimated home winning percentage matches the actual value quite closely!

Step 3: Determine win totals for the 2014-2015 season. I obtained the 2014-2015 schedule for $5 from nbastuffer.com. Using this schedule, I calculated win probabilities for each game using the team strengths in Step 1 and the standard deviation in Step 2. If I add up the totals for each team, I get their estimated win totals. Voila! Since the prediction is created by looking at each game on the schedule, we also get home and away records, in conference records, and so on. It’s also easy to update the estimate during the season as games are played, players are traded or injured, and so on.

Why this model stinks. The biggest virtue of this model is that it was easy to build. I can think of at least ten potential shortcomings with this model:

  1. Win Shares are probably not the best metric for individual and team strength.
  2. It assumes that individual performance for 2014-2015 will be the same as 2013-2014. Paul Pierce isn’t getting any younger.
  3. It does not account for predictable changes in playing time from season-to-season.
  4. 2014-2015 win shares are not normalized to account for players leaving and entering the league.
  5. 2014-2015 win shares do not account for positive and negative synergies between players.
  6. There is no reason to believe that the standard deviation calculated in Step 2 should be the same for all teams.
  7. I have not given any justification for using a normal distribution at all!
  8. The vagaries of the NBA schedule are not accounted for. For example, teams play worse in their second consecutive road game.
  9. Several teams, including the Philadelphia 76ers, will tank games.
  10. Injuries were handled in an arbitrary and inconsistent fashion.

It will be interesting to see how this model performs, in spite of its shortcomings.

NBA Rosters and Team Changes: 2013-2014

I have downloaded statistics for all NBA players from basketball-reference.com, and accounted for roster changes (as of Sunday, October 5).

The Tm2013 column is a three-letter abbreviation for the player’s 2013-2014 team. For players that played on two or more teams, the entry represents the team for which the player played the most minutes. The Tm2014 is the player’s current NBA team (as of Sunday, October 5) according to nba.com. Rookies are not included in this spreadsheet.

If you’ve read my previous two posts, you may have guessed that I am leading up to a prediction of the upcoming 2014-2015 NBA season. You’d be correct – I will post my model and predictions tomorrow.

In the meantime, you can actually use the spreadsheet above to create a very crude prediction. The last column in this spreadsheet is “win shares”. If you create a pivot table based on Tm2014 and Win Shares, you get the sum of player win shares for current rosters – a crude measure of team strength. Here is what that table looks like:

Team Sum of WS
CLE 60.7
SAS 59.4
TOR 58.6
LAC 58
IND 56.6
OKC 54.1
GSW 53.8
PHO 52.6
POR 50.8
DAL 48.6
HOU 47.9
WAS 46.8
NYK 44.3
DET 43.9
MEM 43.5
MIA 40.8
CHI 40.4
ATL 38.9
DEN 36.9
MIN 34.3
SAC 32.6
CHA 32.1
NOP 31
LAL 26.9
BRK 26.1
BOS 25.9
UTA 23.8
ORL 23.1
MIL 21.3
PHI 7.2

I can think of at least five weaknesses in this “pivot table model”. Can you?

Predicting NBA Rookie Performance By Draft Position

Nate Silver (and others) have tracked how NBA draft position relates to total career performance, see for example this article. But what about first-year performance?

I pulled two sets of data from basketball-reference.com to answer this question:

I then merged them using Power Query and then created a pivot table to calculate the average number of rookie season “win shares” by draft position. You can download my Excel workbook here. Here is what I found:

AverageWinSharesByDraftPosition

The first pick in the draft averages nearly five Win Shares in his rookie season, and while the pattern is irregular, win shares decrease as we get deeper into the draft (duh). (The blip at the end is due to Isaiah Thomas, drafted by the Kings who promptly screwed up by letting him go.) I have drawn a logarithmic trendline which fits the data not-to-shabbily: R^2 of 0.7397. Obviously we could do much better if we considered additional factors related to the player (such as their college performance) and team (the strength of teammates playing the same position, who will compete with the rookie for playing time). Here are the averages for the first 30 draft positions:

Draft POSITION Win Shares
1 4.96
2 2.69
3 2.96
4 4.14
5 2.23
6 1.84
7 3.36
8 1.68
9 2.59
10 1.52
11 0.84
12 1.51
13 1.48
14 1.36
15 1.64
16 1.19
17 2.37
18 1.02
19 0.71
20 1.09
21 1.74
22 2.14
23 1.54
24 2.29
25 0.98
26 1.23
27 1.08
28 0.40
29 0.54
30 0.94
31 0.79