The Only Interesting Thing About the Burrito Bracket Challenge

All models are wrong, but some are useful. Even when it comes to burritos. I had been successfully ignoring the seemingly interminable Burrito Bracket Challenge(*) on Five Thirty Eight until this retweet by Carl Bialik grabbed my attention:

Am I supposed to take any of this seriously? I don’t know. Nevertheless, curiosity got the better of me and I read the Vox article. Matthew Yglesias asserts that Chipotle is better because unlike La Taqueria, most people can actually find a Chipotle burrito when they want one, unlike La Taqueria which is available only to the lucky few within a five mile radius of the Mission District. In the words of Yglesias: 

The best burrito is the burrito you actually want to have in real life, a burrito that is both tasty and available.

OK, so now we’re discussing which definition of “best burrito” is best. Of course, that’s subjective, as I pointed out (in a wildly different context) some time ago. In any decision making process, especially one involving analytics(**), it is critical to openly discuss the criteria for ranking one decision (or burrito) over another. That is obvious. Less obvious is the fact that we often choose our decision criterion based on how easy the criterion is to evaluate, rather than how relevant it is to the matter at hand

Let’s return to burritos. The Five Thirty Eight criterion was “best tasting burrito”. Here’s what they had to do to figure that out:

  1. Decide on a list of attributes to consider, and their weighting. 
  2. Trim the list of all burrito joints in the US to a reasonable number using readily available data.
  3. Visit each location and sample at least one burrito.
  4. Take detailed notes on the experience and produce ratings based on the attributes.
  5. Review and calibrate the ratings.
  6. Determine the winner by sorting the ratings (and getting Silver’s blessing).

That’s real work! The Vox criterion was “tasty and available” (aka “scalable”). Here’s what they had to do:

  1. Get a list of chain restaurants that serve burritos. Remove Taco Bell from this list, because it is not tasty.
  2. Get the number of locations for each.
  3. Sort the list.
Much easier. A child, or possibly an intern, could do this in about thirty minutes, and you can do it informally in one second by whispering “Chipotle”. A more advanced version of the “scalability” study would be to replace step 2 with:
  • Calculate the total reach for each chain by finding the number of people who live within five miles of chain locations.
Throw in a D3 map with burrito reach for the top three chains, and you’ve got yourself a nice little post that will get a bunch of views and retweets.
 
But that would be pointless(***). This global analysis based on a scalability criterion is useless, and it’s useless even though we all know that the “best burrito criterion” question is subjective. The Five Thirty Eight criterion is certainly useful, even if you don’t agree with it: you’ve got an idea of where to find the best tasting burrito in America, based on a reasonable standard. If you’re looking for the best burrito in your region, you can find that too (Northeast? Head to the Bronx). The Vox criterion as evaluated above is not useful because you can’t use it to make decisions. It only tells you that there are tons of Chipotle locations, which you already knew. Such an analysis might be useful for burrito chains themselves, but not for consumers.   
 
If you are looking for a burrito you can actually eat, like, now, then a useful analysis would consist of opening the Yelp app and searching for “burrito” using “Current Location” and sorting by rating. Of course, the results of your analysis and my analysis based on this criteria would yield different results because we live in different places; the results of my analysis apply only to me. 
 
If we wanted an analysis that would actually be useful in decision making, we’d need to modify our methodology to something like:
  1. Divide the United States into little squares, say 5 x 5 miles.
  2. In each square, determine the best “tasty and available” burrito restaurant based on some combination of Yelp reviews of and distance.
  3. Create an interactive map based on these results.
This is much harder to do, but helpful: if you tell me where you are then I can tell you where you should get a burrito. The answer isn’t always Chipotle! Therefore while the general point that “the best burrito is one that you can actually eat” is quite reasonable, a naive global analysis based on this criterion is quite useless. 
 
Choosing decision criteria based on what is simplest rather than most relevant is a fundamental flaw of many analytics applications. I’ve seen it at every company I’ve worked for! For example, it’s common to instrument websites and apps to see how frequently different features, buttons, and pages are used. Data can be collected and we can see, for example, that 70% of the time users will choose to leave a product download page rather than register. If we collect additional information we can infer demographic information, the device they are using, and so on. If you’re trying to figure out how to modify the website, or which features to add to a product, you may turn to this data. This data tells you only how users use what you’ve got, not what they would like to see. You need different data to answer this question, data obtained from A-B testing, surveys, a competitive analysis, and so on. Sorting whatever data you have nearby and making a cool chart out of it is not good analytics.
 
Software doesn’t really help you determine what data you’ll need to answer the questions you care about. It can help you access, process, visualize, and summarize, but that’s it. Our current emphasis on data visualization and infographics obscure this point. In the burrito case, it’s easier to make pretty pictures out of the “most scalable” data than the “best tasting” data, even though it is less useful for decision making. Software vendors aren’t helping either: the emphasis on “storytelling” with data skims over the fact that “analytics stories” must have a “moral”, otherwise they merely entertain rather than inform.

(*) Congratulations, La Taqueria.
(**) because the process for making the decision itself is understood by fewer people, and if you change your mind you will have to ask a geek to go change their code.
(***) other than to attract page views…

Confirmation Bias in Data Science

I thought it would be interesting to talk about a few dangers of data science. Here’s one: confirmation bias.

As a data scientist, you have a client. That client may be a colleague, a customer, or another piece of analytics. If you trace the paths far enough, you are going to find a group of people who have a vested interest in the results of your analysis. If you’re in the business world it’s all about the Benjamins, and if you’re in the academic world it revolves around tenure.

If you are training to become a data scientist, I hope somebody has told you that you will be pressured to produce a specific result by someone with a special interest. I guarantee it. At Nielsen, my group built the analytics systems that were used to carry out marketing return on investment studies for Nielsen clients – big Fortune 500 companies who spent millions or billions of dollars on advertising. A big part of an ROI study was to decompose sales by sales driver, for example: what was the sales due to TV? Pricing considerations? Facebook? Always, without exception, clients had expectations about what these numbers would be. If they decided to make a big push into digital during the previous year, you bet your ass they’re looking for high digital ROI. Or that ROI for “diet” advertising was higher than “classic”. Or that macroeconomic factors were what dragged down sales, and so on. Moreover, there were always specific expectations from the team carrying out the analysis! The biggest one was that the results would be different, but not too different, than the results from the study carried out the year before! Different, so that the client would feel like they got their money’s worth, but not so different that the data, methodology, or modeling would be questioned or dismissed. Let’s be clear: this is not a Nielsen-specific problem, and in fact our team and all of the modeling teams at Nielsen went to huge pains to make sure that these kinds of biases did not affect our findings, by incorporating these concerns into our training, to reporting in our software systems, to automation that would prevent even the temptation of fudging the numbers. We ran into cases where competitors were clearly thumbing the scales to produce the result that was expected, and by holding firm we sometimes lost deals. You may find yourself in the same situation. By the way, the more complicated your data and model, are the easier it is to fudge! Sometimes people make things complicated to avoid accountability.

If you are going to call yourself a data scientist, you are going to have to have a strong spine and not let these pressures get to you, or your models. You have an obligation to listen to those with domain expertise, to be realistic, and to understand that your model is just that: a model. A model is a representation of reality, and not reality itself. Human factors can and should affect how you build and reason about your models. Here’s the point: you need to do this in a level-headed, honest way, even if your client, or their client, doesn’t like it. Don’t just do what people tell you to do, or say what you are expected to say. Use your brain and use your data.

Software Is Not Eating Data Science

I am growing weary of what I will call data science exceptionalism: assuming that anything associated with data science, analytics, and big data are completely new or different. (Data science exceptionalism is perhaps a form of what Evgeny Morozov calls “solutionism”.)

Our example for today is “Software is Eating Data Science”. A couple of representative quotes are

“Automation is sending the data scientist the same way as the switchboard operator.”

and

“Over the next five years, someone will say ‘why am I spending $500,000 on people to do this work, when I can do it with software?’ So in the same way you see software starting to do what people are doing,” Weiss told hosts John Furrier and Dave Vellante. “Increasingly, companies like HP are figuring out how to automate that, but we’re still at the very early stages and there’s so much exciting work to do.”

We must be generous in our interpretations of these quotes; it would be unfair to assume that the author believes that data science was conducted using pen and paper, or perhaps an abacus, to this point. The argument is that automated processes, powered by software, will eventually make data scientists obsolete. Manual data science processes, the argument goes, can be automated because “tools and techniques are similar”. This automation then makes data scientists obsolete because software solutions can be reused and deployed on clusters of computers on premises or in the cloud.

The flaw in this reasoning comes at the end: contrary to the article, automation will result in greater need for data scientists. Data science is, has always been, and will for the foreseeable future be a collaboration between man and machine. The best chess player in the world is a grandmaster with a computer, and the best analytics in the world will come from trained data scientists with computers. Automation will allow data scientists to work more productively, making data science more valuable, increasing the demand and benefit of data science applications, in turn generating more demand for data scientists.

The trends described in the article are no different then those in web development at the beginning of the (first) dot-com era. Namely, the combination of established and nascent technologies and processes being embraced by a larger audience, leading to progressively higher levels of abstraction and productivity. Simply put: what it meant to be a web developer was different in 2004 was different than what it meant in 1994, and what it means today. So it is with data science: you won’t see PhDs writing bespoke Python to mine web comments for sentiment ten, or even five, years from now. For many organizations, the frontier of data science will move up the value chain from descriptive and predictive to prescriptive analytics (i.e. decisions), and will move from low-level data munging and model building to more componentized, automated processes. This will not eliminate the need for a data scientist role – it will affect how they spend their time.

Software is one of the fundamental tools of analytics, along with mathematical and domain expertise. Better tools may change practitioners do their work, but it usually does not obviate the need for those skilled in the art. John Deere’s plow did not “eat farming”, so let’s stop the silly talk.

Programming for Data Scientists – Types of Programming

This post is part of a series that discusses programming for data scientists. Let’s discuss the different types of programing tasks a data scientist is likely to encounter: scripting, applications programming, and systems programming. These are general purpose terms that have specific meaning when applied to data science.

Data scientists often start their careers doing a lot of scripting. Scripting generally involves issuing a sequence of commands to link data and external processes. In other words, it is the glue that holds a larger system together. For example, a script may retrieve poll results from a number of different websites, consolidate key fields into a single table, then push a CSV version to this table to a Python script for analysis. In another system, a script may generate AMPL input files for an optimization model, invoke AMPL, and kickoff a database upload script on completion. Python, R, and Matlab all provide REPL environments that can be used for scripting computationally oriented tasks. Good scripters are highly skilled and highly productive, however a formal education in software engineering principles is not required. Most scripts are intended to have a finite (and short) lifespan and used by only a few people, though it is often the case that these assumptions are incorrect!

Applications programming involves producing a system to be used by a client such as a paying customer, a colleague in another department, or another analyst. Some applications programming scenarios start as scripts. For example, the AMPL “connector script” described above above may be turned into a full Java module installed on a server that handles requests from a web page to invoke Gurobi. In many cases, the “model building” phase in an analytics project is essentially an applications programming task. An important aspect of applications programming is that the client has a different set of skills than you do, and therefore a primary focus is to present a user interface (be it web, command-line, or app-based) that is productive for the client to use. Unlike scripting, an applications programming application may have several tiers of functionality that are responsible for different aspects of the system. For example, a three-tiered architecture is so named for its principal components: the user interface component, the data access component, and the business rules component that mediates between the user and the data required to carry out the task at hand. Each of these components, in turn, may have subcomponents. In our Java/Gurobi example above, there may be subcomponents for interpreting and checking user input, preparing the data necessary for the optimization model, creating and solving the optimization model, solving it, and producing a solution report in a user-friendly form.

Systems programming deliverables are intended for other programmers to use. They provide a set of programming interfaces or services that allow a programmer to carry out a related set of tasks. Traditionally, systems programming has been associated with computer operating systems such as Windows and Linux. However, systems programming has long been a critical component in analytics solutions. A classic example is the LAPACK linear algebra library, designed by Jack Dongarra and others. LAPACK has in turn served as the basis for countless other libraries, ranging from signal processing to quadratic programming. NumPy is a more recent example. If you are building a numerical library of your own, you are undertaking a systems programming task. Hallmarks of well-designed systems libraries are consistent naming, API design, library, and high performance. While such qualities are also important for applications, they are not as critical because the aim is different.

Designers of systems libraries often have deep software engineering experience, as well as technical depth in the area for which the library is intended. It’s easy to fall into the trap of believing that systems programming is therefore inherently more challenging or superior. It’s not that simple. An amusing entry in the Tao of Programming contrasts the challenges between applications and systems programming makes this point well! Systems programmers often make awful applications programmers, and vice versa. Both camps have the potential required for both types of programming; it is simply that they often have not developed the muscles required for the other discipline, similar to sprinters and long-distance runners. As a data scientist, you may be asked to sprint, run a marathon, or something in between, so get used to it!

Domain Expertise and the Data Scientist

When it comes to analytics, domain expertise matters. If you are going to be an effective data scientist doing, say, marketing analytics, you’re going to need to know something about marketing. Supply chain optimization? Supply chain. Skype? Audio and networking.

The point is so obvious that it is surprising that it is so often overlooked. Perhaps this is because in the decades before the terms “data science” and “analytics” entered common usage, the programmers, statisticians, and operations researchers who filled data science roles were simply known as “analysts” or “quants”. They were associated with their industry rather than their job function. Now that the broad “data scientist” label has entered general usage, it is difficult to speak to the domain-specific skills and knowledge required by all data scientists, because they are so industry-specific. I can give marketing analytics data scientists all kinds of advice, but what good would that do most of you?

Many new data scientists have math, stats, hard science, or analytics degrees and do not have deep training in the industries where they are hired. This was common in the 90s when investment firms hired physics grads to become quants. At Nielsen, all of my college graduate hires were trained in something other than media and advertising. The challenge for these newbies is to learn the domain skills they need – on the job! A few words of advice:

Do your homework, but not too much. You may be provided with some intro reading, for example PowerPoint training decks, books, or research papers. It’s obviously a good idea to read these materials, but don’t get your hopes up. I find that these materials often suffer from two flaws: 1) they are organization- rather than industry-specific (for example, describing how a marketing mix model is executed at Nielsen, rather than how marketing mix models work generally), and 2) they are too deep (for example, an academic paper describing a particular type of ARIMA analysis). In the beginning you will want to get the lay of the land, so seek out “for dummies” materials such as undergraduate texts or even general purpose books for laypeople.

Seek out experts in other job functions. Unless you are an external consultant, you will usually have coworkers whose job it is to have tons of domain expertise. For example, at Nielsen, Analytics Development team members worked in the same office as analysts and consultants, whose job it was to carry out projects for clients (rather than building the underlying models and systems). In another organization, it may be a software developer that is building a user interface. Or it may be the client themselves. They will understand the underlying business problems to be addressed, and hopefully be able to describe it in plain language. They may also be well acquainted with the practical difficulties in delivering projects in your line of work. Finally, they are likely to have lots of their own resources for learning.

Teach someone else. The best way to learn is to have to explain it to someone else, so a great technique is to prepare a presentation or whitepaper regarding a process or model that is underdocumented, or write an “executive summary” of something that is complicated. Even better is to write a “getting started” guide for someone in your role. Even if it is never used, it is a good way to crystallize the domain specific information you need to learn to do your job.

Programming for Data Scientists – Guidelines

As a data scientist, you are going to have to do a lot of coding, even if you or your supervisor do not think that you will. The nature of your coding will depend greatly on your role, which will change over time. For the foreseeable future, an important ingredient for success in your data science career will be writing good code. The problem is that many otherwise prepared data scientists have not been formally trained in software engineering principles, or even worse, have been trained to write crappy code. If this sounds like your situation, this post is for you!

You are not going to have the time, and may not have the inclination, to go through a full crash course in computer science. You don’t have to. Your goal should be to become proficient and efficient at building software that your peers can understand and your clients can effectively use. The strategy for achieving this goal is purposeful practice by writing simple programs.

For most budding data scientists, it is a good idea to try to develop an understanding of the basic principles of software engineering, so that you will be prepared to succeed no matter what is thrown at you. More important than the principles themselves are to put them into practice as soon as you can. You need to write lots of code in order to learn to write good code, and a great way to start is to write programs that are meaningful to you. These programs may relate to a work assignment, or to an area of analytics that you know and love, or a “just for fun” project. Many of my blog posts are the result of me trying to develop my skills in a language that I’m trying to learn, or not very good at. A mistake many beginners make is to be too ambitious in their “warm up” programs. Your warm up programs should not be intellectually challenging, at least in terms of what they are actually trying to do. They should focus your attention on how to effectively write solutions in your programming environment. Doing a professional-grade job on a simple problem is a stepping-stone for greater works. Once you nail a simple problem, choose another one that exercises different muscles, rather than a more complicated version of the problem you just solved. If your warm up was a data access task, try charting some data, or writing a portion of an algorithm, or connecting to an external library that does something cool.

There are many places where you can find summaries of software engineering basics: books, online courses, and blogs such as Joel Spolsky’s (here’s one example). Here I will try to summarize a few guidelines that I think are particularly important for data scientists. Many data scientists already have formal training in mathematics, statistics, economics, or a hard science. For this group, the basic concepts of computing (memory, references, functions, control flow, and so on) are easily understood. The challenge is engineering: building programs that will stand the test of time.

Be clear. A well-organized mathematical paper states its assumptions, uses good notation, has a logical flow of connected steps, organizes its components into theorems, lemmas, corollaries, and so on, succinctly reports its conclusions, and produces a valid result! It’s amazing how often otherwise talented mathematicians and statisticians forget this once they start writing code. Gauss erased the traces of the inspiration and wandering that led him to his proofs, preferring instead to present a seamless, elegant whole with no unnecessary pieces. We needn’t always go that far, whether in math or in programming, but it’s good to keep this principle in mind: don’t be satisfied with stream-of-consciousness code that runs without an error. A number of excellent suggestions for writing clear code are given here [pdf], in particular to use sensible names and to structure your code logically into small pieces. 

Keep it simple. Don’t write code you think you will need, write the code you actually need. Fancy solutions are usually wrong, or at least they can be broken up into simpler pieces. If you have made it too simple, you will figure it out soon enough, whereas if you have made things too complicated you will be so busy trying to fix your code that you may never realize it.

Pretend that you are your own customer. If you are writing a library for others to use, you can start by writing example programs that use the library. Of course at the beginning these examples won’t work – the point is to force yourself to understand how your solution will be used so that you can make it as easy and fun to use as possible. It also forces you to think about what should happen in the case of user or system errors. You may also discover additional tasks that your solution should carry out in order to make life simple. These examples may pertain not only to the entire solution, but to small portions of your solution. By writing tests and examples early – that is, by practicing unit testing and test driven development – you can ensure high quality from the start.

Learn how to debug. Most modern languages are associated with development environments that have sophisticated debuggers. They allow you to step through your code bit by bit, inspecting the data and control flow along the way. Learn how to set breakpoints, inspect variables, step in and out of functions, and all of the keyboard shortcuts associated with each. Computational errors in particular can be very hard to catch by simply reviewing code, so you’ll want to become adept at using the debugger efficiently.

Write high performance code. I wrote about this subject in a previous post. The key is to measure. Rico Mariani does an awesome job of describing why measurement is so important in this post. Rookie data scientists frequently spend too much time tuning their code without measuring.

Add logging and tracing to your code. Your code may run as part of a web application, a complicated production system, or may be self-contained, but it’s always a good idea to add logging and tracing statements to your code. Logging is intended for others to follow the execution of your code, whereas tracing is for your own benefit. Tracing is important because analytics code often has complicated control flow and is more computationally intensive than typical code. When users run into problems, such as incorrect results, sluggish performance, or “hangs”, often the only thing you have to go on are the trace logs. Most developers add too little tracing to their code, and many not at all. Just like the rest of your code, your trace statements should be clear, easy to follow, and tuned for the application. For example, if you are writing an optimization algorithm then you may wish to trace the current iterate, the error, and iteration number, and so on. Sometimes the amount of tracing information can become overwhelming, so add switches that help you to control how much is actually traced.

In future posts in this series, I will talk about developing as part of a team, choice of language and toolset, and the different types of programming tasks a data scientist is likely to encounter over the course of their career.

Fantasy Football Ratings 2014

I have prepared fantasy football ratings for the 2014 NFL season based on the data from last year’s season. I hope you will find them useful! You can download the ratings here.

These ratings are reasonable but flawed. The strengths of the ratings are:

  • They are based on player performance from the 2013 season, using a somewhat standard fantasy scoring system. (6 points for touchdowns, -2 for turnovers, 1 point per 25 passing yards, 1 point per 10 passing yards, and reasonable scoring for kickers.)
  • The ratings are comparable across positions because the rating means the expected number of fantasy points that a player will score compared to a “replacement level” player for that position. I call this “Fantasy Points Over Replacement”: FPOR.
  • Touchdowns are a key contributor fantasy performance, but they are fickle: they often vary dramatically between players of the same overall skill level, and even between seasons for the same player. In a previous post I showed that passing and rushing touchdowns are lognormally distributed against the yards per attempt. I have accounted for this phenomenon in the rankings. Loosely speaking, this means that a player that scored an unexpectedly high number of touchdowns on 2013 will be projected to score fewer in 2014.
  • The ratings do a rough correction for minor injuries. Players that play in 10 or more games in a season are rated according to the number of points they score per game. Therefore a player who missed, say, two games in 2013 due to injury is not disadvantaged compared to one that did not.

There are several weaknesses:

  • I have data for several previous seasons but do not use it. This would stabilize the ratings, and we could probably account for maturation / aging of players from season-to-season.
  • Rookies are not rated.
  • We do not account for team changes. This factor is often very important as a backup for one team may end up as a starter for another, dramatically affecting fantasy performance. (I actually have a pretty good heuristic for accounting for this, but I have not implemented it in Python…only SAS and I no longer have access to a SAS license.)
  • Players who missed a large portion of the 2013 season are essentially penalized for 2014, even if they are expected to return fully.
  • I have not rated defense/special teams.

You may want to adjust the rankings accordingly. Here are the top 25 rated players (again, the full ratings are here):

Name Position RawPts AdjPts FPOR
LeSean McCoy RB 278.6 199.3125 85.96875
Jamaal Charles RB 308 194 80.65625
Josh Gordon WR 218.6 176.3571429 71.57142857
Matt Forte RB 261.3 177.46875 64.125
Calvin Johnson WR 219.2 157.7142857 52.92857143
DeMarco Murray RB 205.1 155.4642857 42.12053571
Reggie Bush RB 185.2 153.4285714 40.08482143
Jimmy Graham TE 217.5 113.90625 35.90625
Antonio Brown WR 197.9 140.53125 35.74553571
Knowshon Moreno RB 236.6 148.6875 35.34375
Adrian Peterson RB 203.7 147.5357143 34.19196429
Marshawn Lynch RB 239.3 145.59375 32.25
Le’Veon Bell RB 171.9 142.9615385 29.61778846
Demaryius Thomas WR 227 134.0625 29.27678571
A.J. Green WR 208.6 133.6875 28.90178571
Eddie Lacy RB 207.5 141.5 28.15625
Andre Johnson WR 170.7 131.90625 27.12053571
Alshon Jeffery WR 182.1 131.34375 26.55803571
Peyton Manning QB 519.98 172.48125 22.18125
Stephen Gostkowski K 176 176 22
Drew Brees QB 435.68 172.2 21.9
Ryan Mathews RB 184.4 133.5 20.15625
DeSean Jackson WR 187.2 124.875 20.08928571
Pierre Garcon WR 162.6 124.3125 19.52678571
Jordy Nelson WR 179.4 123.1875 18.40178571

FPOR is the adjusted, cross-position score described earlier. RawPts is simply 2013 fantasy points. AdjPts are the points once touchdowns have been “corrected” and injuries accounted for.

We will see how the ratings work out! If I have time I will post a retrospective once the season is done.