Analytics is an anonymously collaborative discipline. Raw materials come from blogs, MOOCs, online tutorials, github, stackoverflow. They come from research papers on paid repositories through university accounts, or free sites like optimization-online.org. They come from pip and install.package and nuget. It used to be different, of course. I remember combing the stacks in MacLean Hall , Pappajohn, Argonne, tracing citations with red pen, turning nickels into gold on the Xerox. I curated hard copies of good papers for years, long after the need had faded – perhaps a story for another time. Those days were good, but now the center of gravity of analytics lies in the digital world.
This shift in how we share and learn data science came to mind during a recent episode involving a long time friend of this blog, the Traveling Salesman Problem (TSP). Randy Olson, a computer science graduate research assistant at Michigan State, blogged about producing optimal shortest path tours between interesting USA landmarks. On his site you can see interactive graphics that display the landmarks and the routes considered by his approach. Before I continue, I want to be clear: Randy’s contributions are admirable. He provided full code for his solution, carefully explained the problem, and connected problem to solution with visuals and context. It’s a great example of analytics storytelling combined with repeatable processes, that happens to be fun and engaging. Olson’s post, and subsequent work, has attracted considerable attention not only from the broader analytics community, but general interest outlets as well, for example the New York Times and the Washington Post. It is not often that one is hailed as a genius without buying or selling something. High praise!
I shall now take a deep breath and address the operations research (aka optimization) community. We are generous folk but this sticks in our collective craws, doesn’t it? We know we should place Dr. Evil style “air quotes” around “optimal” in the previous paragraph because the approach does not necessarily produce the best possible tour. What’s more, Olson says:
If you read my Where’s Waldo article, you’re already aware of how difficult it can be to solve route optimization problems like this one. With 50 landmarks to put in order, we would have to exhaustively evaluate 3 x 1064 possible routes to find the shortest one.
The operations research community knows the last statement is not the case (geeks: this is not to deny TSP’s computational complexity, just that enumeration is not mandatory for provably optimal solutions). This was old news back when I was listening to Depeche Mode on my Discman and when you said “tableau” you were talking about simplex. The Where’s Waldo link reveals another fun example that is similar. In both cases, Olson used a hand-coded genetic algorithm to find shortest tours between points. Operations researchers know that the USA Landmark problem is a classic TSP, and there are a number of highly tuned approaches for solving really big instances, the best of which is Concorde. Bill Cook, TSP master, explains the operations research approach behind TSP and applies it to the Olson problem here. You can read Cook’s TSP book “In Pursuit of the Traveling Salesman” for more. The point is that you can get what is indisputably, provably, the best tour in fractions of a second using Concorde, even on a phone. In other words, Olson’s TSP approach is not guaranteed to find the best possible answer and is not even as efficient (in a comparative sense).
The point of Olson’s blog is to produce interesting visualizations and experiments with data. It is not a production system, and it does not claim to solve grand challenge problems. We shouldn’t shoot a guy because he rolls his own solver for a blog post (though I will qualify this statement in a future post). I applaud him because more people should be doing cool stuff like this. Even so, there are lessons here for both the operations research and the broader data science communities. In the remainder of this post I will focus on the operations research side, saving the data science lessons for another day.
Operations research is an indispensible tool of industry but has never really connected with the technology community in the way that it deserves. This is the fault of the OR community. Neal Stephenson’s 1999 essay “In the Beginning was the Command Line”(1) is an extended meditation on computer user interfaces, including command line interfaces such as seen in Linux. Much of what is said applies to analytics in 2015, particularly an analogy between power drills and user interfaces, which I will quote at length since it’s worth it:
The Hole Hawg is a drill made by the Milwaukee Tool Company. If you look in a typical hardware store you may find smaller Milwaukee drills but not the Hole Hawg, which is too powerful and too expensive for homeowners. The Hole Hawg does not have the pistol-like design of a cheap homeowner’s drill. It is a cube of solid metal with a handle sticking out of one face and a chuck mounted in another. The cube contains a disconcertingly potent electric motor. You can hold the handle and operate the trigger with your index finger, but unless you are exceptionally strong you cannot control the weight of the Hole Hawg with one hand; it is a two-hander all the way. In order to fight off the counter-torque of the Hole Hawg you use a separate handle (provided), which you screw into one side of the iron cube or the other depending on whether you are using your left or right hand to operate the trigger. This handle is not a sleek, ergonomically designed item as it would be in a homeowner’s drill. It is simply a foot-long chunk of regular galvanized pipe, threaded on one end, with a black rubber handle on the other. If you lose it, you just go to the local plumbing supply store and buy another chunk of pipe.
During the Eighties I did some construction work. One day, another worker leaned a ladder against the outside of the building that we were putting up, climbed up to the second-story level, and used the Hole Hawg to drill a hole through the exterior wall. At some point, the drill bit caught in the wall. The Hole Hawg, following its one and only imperative, kept going. It spun the worker’s body around like a rag doll, causing him to knock his own ladder down. Fortunately he kept his grip on the Hole Hawg, which remained lodged in the wall, and he simply dangled from it and shouted for help until someone came along and reinstated the ladder.
But I never blamed the Hole Hawg; I blamed myself. The Hole Hawg is dangerous because it does exactly what you tell it to. It is not bound by the physical limitations that are inherent in a cheap drill, and neither is it limited by safety interlocks that might be built into a homeowner’s product by a liability-conscious manufacturer. The danger lies not in the machine itself but in the user’s failure to envision the full consequences of the instructions he gives to it.
A large part of the operations research community has focused on building and using Hole Hawgs. Not everyone needs one. Stephenson recognizes this; his view is balanced and insightful. He says of graphical user interfaces:
It simply is the case that we are way too busy, nowadays, to comprehend everything in detail. And it’s better to comprehend it dimly, through an interface, than not at all.
Continuing the drill metaphor, many of us want the Black and Decker at Home Depot (or just pay a contractor). The “new analytics” community understand this better: they speak of learners, predicting, and classifying. Many OR folks sneer when someone (an MBA?) goes so far as to say “goal seeking”. I haven’t heard anyone simply say “deciding”, although that’s what solvers do. Data scientists see a problem and think about whether they need a predictor, or a classifier, or whatever, and go and grab their favorite one and see if it works, and if it doesn’t they try another. Practitioners will pick up OR tools when they are easier to use and general purpose – even if raw power is sacrificed. Most of this community does not understand the staggering power that certain operations research tools (such as mixed integer programming) provide. Who could believe that mixed integer programming solvers have sped up by a factor of 5,600,000,000 since 1990? Concorde is incredible, but a far paler variant of Concorde as modeled using mixed integer programming is almost unspeakably powerful compared to a genetic algorithm. The giant caveat is that for this community, if the phrase “subtour elimination” is involved, or arguably even if the practitioner has to recognize the damn thing as a TSP, OR loses. People like Bill Cook have more than done their part by building such great tools and showing how they can be used. We need more of that, and to push it even farther, lest the narrative remain unchanged. As Matthew Saltzman rightly points out:
— Matthew Saltzman (@SaltzmanMJ) March 13, 2015
Are we going to expect people to know these things in order to realize OR’s benefits? I hope not.
(That said, I don’t want to let the broader analytics community off the hook – I will examine that topic next time.)
(1) Read the whole thing if you’ve got time. Otherwise you will miss stuff like “This is exactly how the World Wide Web works: the HTML files are the pithy description on the paper tape, and your Web browser is Ronald Reagan.”