Thanksgiving Analytics Reading

Here are a few retail analytics links for your reading enjoyment. Happy Thanksgiving, everyone.

The Data Types And Sources That Drive Grocery Retail

This article was originally posted at Retail TouchPoints. Click here to read it.

Analytics, the use of computer models to produce insight for decision making, is increasingly the key to competitive advantage in grocery retail. In this new world, data is the fuel that powers the engines of data science. How can consumer goods suppliers and retailers identify and use data to produce business insight?

It all starts with sales, of course. Ideally, suppliers and retailers would have access to single line item in every single sales transaction for all store locations for the entire history of the retailer. Any other view of sales, for example, divisional level sales of baked goods, can be derived from these tiny atoms. Unfortunately it is impractical or impossible for many firms to collect and maintain data at this level of granularity, so they often have to get by with incomplete or summarized sales data. The good news is that an analysis of even imperfect sales data often reveals insights about products, store locations, and general trends. Start with the fundamentals.

The next step is to connect sales with other relevant data sets to answer business questions. It’s easy to become overwhelmed by the sheer number of potentially relevant data sources: deliveries, store audits, advertising, price, promotions, social media, weather, and so on. The savvy consumer goods executive always returns to business value, keeping basic questions in mind: who, what, when, where, why, and how.

Often “how” boils down to supply chain considerations. Understanding and optimizing the path to purchase is a goal of every consumer goods player, but often this path is often long and complicated, involving suppliers, distributors, category managers, and ultimately consumers. The supply chain suggests several of interesting data sets: product information down to scan or SKU level, distribution center receipts and shipments, store deliveries, stocking, and order information. Decisions regarding such information is at the heart of the collaborative process which drives results for suppliers and retailers. Yet despite the proven benefits effective collaboration and the value it creates remains elusive for most. Supply chain data, along with appropriate process and software tools, are key to collaborative tools such as lost sales assessment, out of stock measurement and fulfillment planning.

Supply chain data describes how products moves from production to sale, but say little about why sales happen. The standard levers of price, promotion, and advertising are all obvious data sources, and can often be collected through collaborative effort between consumer goods retailers and suppliers. Looking beyond the obvious leads to additional data sources that relate to longer term forces that affect sales in subtle but important ways. For example, category trends such as the sudden rise of Greek yogurt and the continued growth of organic and natural foods can be tracked through online sources such as Google Trends and various third party outlets. Brand awareness is also known to be an important sales driver. For example, the long term effects of television advertising were recently estimated by Nielsen Catalina Solutions to be twice that of the immediate short term bounce.

The measurement of brand equity is difficult, but not impossible, especially with sentiment information that can be mined from online sources such as Facebook, Twitter, and online forums. Sophisticated machine learning tools can be used to understand how consumer disposition changes over time, and how it affects the bottom line.

While each source provides interesting insights in its own right, the full promise of data is realized for consumer goods suppliers and retailers when these sources are marshaled together. For example, weather data is, strictly speaking, useless to a retailer when viewed in isolation. That’s why at Market6 we’ve joined weather data with store sales to quantify how sales rise in the days before a big snowstorm and decrease as it hits. Retailers see for themselves that two inches of snow yield very different behaviors in Chicago, Charlotte, and Seattle. Further, joining with item level sales yields insights about “stock up” items before big storms, beyond the obvious ones such as bread and milk.

The result of more relevant data sources joined together (increasingly streamed in real-time) is consumer goods innovation. Armed with the insights provided by smart data, retailers and suppliers are more profitable and consumers are more satisfied than ever before.

Optimization Using SQL: Not Crazy

My sanity was recently called into question, not for the first time:

Friend Of The Blog Erwin Kalvelagen asserts that it makes no sense to expression optimization models in SQL. (I wrote about this idea here.) Erwin makes a great point: different environments offer different advantages when carrying out different kinds of data manipulations. Optimization can be viewed as a very complex data manipulation, and SQL is often not the best way to express such manipulations.

That said, it’s a leap to say it never makes sense. I’m not asserting that everyone should do it all the time – just that it’s reasonable. If I wanted to write all of my models in SQL, then I would have just written a system to let me do that (which you can more-or-less do right now by reading this). I have no plans to do so, but it’s still a reasonable idea because:

  • Lots of people know SQL (*),
  • very few people know about optimization or operations research as a discipline, even within the “analytics community”, (**)
  • conceptual knowledge of how SQL queries are constructed can easily be leveraged to define full optimization models, and
  • in certain cases it’s a pain (or not possible (***)) to pop out of one computing environment to express a model in another.(****)

The parts of the machine learning community that think about optimization think of it as fancy search. SQL is a querying language.

Further, part of our future (or at least mine) are scenarios where you need to solve a very large number of relatively simple optimization models with similar structure but different data, as part of a larger data transformation pipeline. Such models will likely be solved in computational environments like Spark, Hadoop, or Flink where SQL syntax will be resident. It’s totally not crazy to imagine expressing such models in this way. Will such cases actually use SQL, as opposed to Python, Scala, R, or whatever(*****)? I don’t know. I do think that somebody will use SQL at the very least as inspiration in enabling concise expression of optimization models.

* Here is the proof. Not on this list: AMPL, GAMS, AIMMS. They never will be. Ever.
** Paco Nathan’s Just Enough Math and John Foreman’s Data Smart are refreshing exceptions.
*** A reality of working in a team environment is that you do not always have full control over your computing environment.
**** Which is why, for example, we may choose to express PageRank in GAMS….
***** Again, there is zero – ZERO – chance that AMPL, GAMS, AIMMS, or your favorite optimization modeling language will be used in these scenarios. I don’t mean to offend proponents of what are great systems, just stating a wider industry reality that may not be fully appreciated by those outside industry.

SQL as an Optimization Modeling Language

Several years ago, a former (awesome) colleague of mine at Microsoft, Bart de Smet, and I discussed the expressibility of optimization problems using SQL syntax. Most formulations carry over in a straightforward way, for example if we want to solve:

minimize 2 x + y
subject to 
x^2 + y^2 <= 1,
x >= 0.

Then we can express this as

  POWER(X, 2) + POWER(Y, 2) <= 1 AND X >= 0

Through suitable rewriting such a specification could be easily sent to a solver. You get the idea; a range of problem types, and even concepts like warm starting are easily supported. I suppose even column generation could be supported via triggers.

Update 10/1/2015: Friends Of The Blog Jeff, Alper, and colleagues thought of this long before I did. See this paper for more.

Installing TopoJSON and GDAL on Windows

Here are instructions for installing topojson and GDAL binaries for Windows 8. These tools are helpful if you want to mess around with map visualizations in d3.js. Parts of these instructions are taken from this helpful stackoverflow thread.

  1. Install GDAL and MapServer binaries from this location at This was the latest stable version as of 8/8/2015.
  2. On my machine, gdal was installed to <location of unzipped archive>\bin\gdal\apps. The batch file in the root of the install should set paths correctly.
  3. Install nodejs from
  4. Install npm using the directions on
  5. From the command prompt run npm install –g topojson
  6. On my machine, this installed topojson to C:\Users\Nathan\AppData\Roaming\npm\node_modules\topojson\bin

Interesting shapefile data can be found at, as explained in this excellent tutorial.

Your Model / Their Model

In data science, building a model means creating a representation of reality using code and equations. In a predictive model we specify the factors that influence a dependent quantity (e.g. how genre/lead actors/director impact box office revenue), and the nature of the relationships between them: linear, multiplicative, logarithmic. In an agent-based model we invent agents that follow rules that make sense for them in a larger ecosystem, like little people in a game of SimCity. In an optimization model we create variables that represent decisions that need to be made, and write equations that govern the restrictions on supply, or demand, or capacity, or flow, or volatility, or budget, or adstock, or when it is legal to castle. In each case we have a view of how some little part of the world works, and we are trying to represent this view in a way that is as simple and realistic as possible.

The problem is that we expect our clients to simply accept our models and use them as we intended. Here is what you should buy. Here is what your sales will look like this year. Buy this stock. Click to add to cart. Pawn to B5. They don’t. The problem, of course, is that our clients are often people, and people bring with them their own mental models of how the world works. I say this like it’s a bad thing. Sometimes it is, and other times it is not: we know that models are often too simplistic, or that they don’t have data we do (it’s going to snow!), or we don’t trust what we don’t understand. As a result, our clients form another model around our own; theirs is the model that is operationalized. In the world before data science, these were the only kinds of models that existed in most cases: it’s called “going with your gut” (left). The dream of automated decision making based on our models is often that, a dream (right). The situation is often like the picture in the middle.


I should give you an example. I’m not one to brag, but our retail store forecasting at Market6 is as good as it gets. We take in an insane amount of data and make literally billions of forecasts a year in a fully automated fashion, incorporating a couple of decades worth of retail knowledge. These forward-looking forecasts of sales are interesting, of course, but so are derived metrics based on these forecasts. For example, we’re able to estimate “out of stock” (OOS) conditions for items in stores based on a mix of inventory information, orders, actual sales, and our sales forecasts. For example, abnormally low actual sales of Diet Coke compared to the sales forecast over a period of time may indicate that it has not been adequately stocked. Rightly or wrongly, people can and do wrap their own models around our own OOS model:

  • They may combine the OOS model results with another OOS indicator, for example spot checks by store employees.
  • They may “cheat up” or “cheat down” the number of OOS flagged for particular stores based on tribal knowledge. 
  • They may use OOS as an indicator for when to order.

Our model is wrapped in theirs. In the first two cases if we as the model builders had access to this additional information, we could simply incorporate it into our models, eliminating the ad hoc ensemble method. In the third case, the client is creating a brand new model using our model as an input: a store ordering model. In this case, doing what is nominally our job (predicting when an item is out of stock), may actually be counterproductive to our client’s purposes (figuring out what to order) – for example ordering a case of an item that sells at a rate of one per month, based on the fact that the item is out of stock, is not going to be helpful. “Out of stock” is not necessarily the same as “should we order”.

The point is a relatively simple one: we should continually reassess the assumptions made and questions answered by our models to ensure they remain appropriate for our clients. This is another reason why good data scientists learn about the domains in which they work.

Spreadsheets Are Alive, Spreadsheets are Dead

It’s hard to think of something relied upon so much and loved as little as the spreadsheet, the Comcast of business analytics. Daniel Lemire’s article “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It)” examines difficulties with the data and conclusions drawn by Thomas Piketty’s “Capital in the Twenty-First Century“. Lemire says “spreadsheets are good for quick and dirty work, but they are not designed for serious and reliable work.” Others agree. Cogent criticism is found in online articles like this one. Another article referred to Microsoft Excel as “the most dangerous software on the planet,” and an article in the Harvard Business Review compared spreadsheets to fast food: a cheap indulgence that is bad for your health. Yet french fries have their virtues.

Analytics is about using computer models to learn from data. The recipe is to get some data, turn it into a form suitable for analysis, conduct a computational analysis, and present the results in a way that people can understand. Critics find flaws with spreadsheets nearly every step of the way. You can’t test their correctness. It is difficult to formally review their assumptions and methodology with colleagues. They encourage “copy and paste” programming and ad-hoc fudging. “Friends don’t let friends use Excel for statistics” because their computations are unreliable. Changes are hard to track and impossible to manage. Reliable deployment is a pipe dream. These claims are serious; if true they call into question the results of any analysis conducted using spreadsheets.

Lemire and others propose instead to build analytics solutions in software using established engineering practices and systems. While this claim sounds reasonable, it’s not particularly realistic. I will make the case that both spreadsheets and software have their place. I will further claim that notebook interfaces may supplant both in many cases. In order to do so, I first need to review some history.

Spreadsheets were the killer app that helped launch the personal computer revolution of the 1980s. VisiCalc, the Cro-Magnon spreadsheet, possessed the basic elements that define a spreadsheet: free-form tabular entry of connected data and formulas. VisiCalc’s genius lie in the combination of simplicity with extensibility: building blocks that can be combined in limitless ways for nearly any purpose. During this time the scientific programming landscape was dominated first by FORTRAN and later by C and C++. Self-contained analytics environments such as SAS and SPSS were frequently used both for one-off engagements as well as for deployed applications, first on mainframes and later on personal computers.

During the 1990s, Microsoft Excel dominated the spreadsheet market as Windows established a de facto monopoly on personal computer operating systems. The simultaneous rise of Windows and Office was no coincidence. Graphical user interfaces made possible new application experiences such as data-bound charts and pivot tables, which in turn justified the interfaces provided by the operating system. Moreover, the integration of Excel into the broader Microsoft Office suite transformed the spreadsheet from a fancy calculator into an application platform in its own right. The introduction of macros, followed by Visual Basic for Applications and “add-ins” meant that spreadsheets could be extended and customized in previously unforeseen ways. These customizations became the backbone of many mission-critical processes for thousands of companies, from payroll to operations to planning. Frontline Systems and Lindo Systems introduced the first spreadsheet optimizers, bringing predictive analytics capabilities to non-programmers for the first time. These add-on products have since been extended to cover the full range of analytics scenarios.

Excel’s market dominance reduced the incentive to innovate, and consequently few core improvements were seen in the following decade. The first widespread criticism of spreadsheets arose during this time. Long-standing, unaddressed flaws in Excel’s statistical functions damaged its credibility among academics and researchers. The rise of email exposed weaknesses in sharing and collaboration. At the same time, software engineering innovations made it far easier for non-specialist developers to use analytics in their applications. In particular, numerical libraries such as BLAS, LAPACK, and CPLEX provided the bedrock upon which vertical-specific applications involving risk analysis, decision support, and simulation could be built. Spreadsheet improvements such as expanded worksheet sizes and more attractive charts were of marginal interest to the analytics world. The tide was shifting.

More recently, spreadsheet software has been reinvigorated by the web’s emergence as an application platform. Google Sheets, the first widely adopted online spreadsheet, facilitates sharing, permits simultaneous edits by multiple users, and enables access from web-enabled devices. Google’s effect on Excel has been similar to that of Firefox and Chrome on Internet Explorer. In response, Microsoft introduced Excel Online, offering many of the same features as the desktop version but available on the web and mobile devices. Microsoft also introduced Excel 2013, analogous to Internet Explorer 7: a major update after a long gap in innovation in response to competitive pressure. Microsoft’s current push to bring Excel to all modern computing platforms represents the next step. Two other developments of note to the analytics community are Power Pivot, bringing SQL Server’s data processing capacity to spreadsheets, and Power BI, which allows Excel to import, join, clean, and transform data from relational databases, Hadoop, and online feeds such as OData and Facebook. Lastly, Google Docs has introduced Google Apps Script, the web equivalent of Visual Basic for Applications, which may end up powering a new generation of business applications inside of spreadsheets.

Even so, the conventional wisdom in the analytics community remains that spreadsheets are staid, underpowered, and dangerous for production use. Many people familiar with the terms “data scientist” or “predictive analytics” avoid the use of spreadsheets for their core tasks, or at least do not admit it.

The primary alternative is to write programs, whether in a self-contained environment such as SAS, Matlab, or R, or using a general purpose programming language such as Python, C++, Julia, or Scala. Programming languages have become progressively more expressive, allowing developers to describe their intent in fewer lines of code. Programming languages themselves are often one component in a larger development platform including standard libraries for handling data processing, visualization, and numerics. Interactive Development Environments (IDEs) provide autocompletion (itself a predictive model), powerful source code analysis, and conveniences for efficient editing. Tools for securing, managing, and deploying source code and applications mean that it is easier than ever to make sure the right code is deployed without a significant setup or support burden. Even more important are the continued advances of software engineering as a formal discipline. Test driven development, agile project management, and formal methods aim to ensure that correct solutions are delivered efficiently. Open source is embraced not only by academics and startups, but corporate powers as well. Online resources such as Stack Exchange and GitHub have transformed how we share and learn to code, turning analytics into an anonymously collaborative discipline, as I wrote about recently.

Given all of this, haven’t spreadsheets earned their reputation? According to those who think that spreadsheets are a problem, the solution is simple: “write sane software”. The assumption behind this statement would seem to be that the author of a bad spreadsheet will be able to write good software. If not that, then something worse: that authors of bad spreadsheets should not practice analytics at all.

Most people find spreadsheets easier to build and use than a custom software solution. The primary reason is that the sheet itself is a canvas for data and computation, where all the moving parts can be seen by anyone who has access to it. The author of the spreadsheet has the freedom to layout the formulas, the charts, and the tables in any way they see fit. Software is much more opaque. Let us, for a brief moment, breathe in and savor the reckless, liberating freedom provided by the spreadsheet. Spreadsheet users of modest talent can hack their way to an answer in cases where there is no other hope because spreadsheets provide immediate feedback. So-called “REPL” based environments such as R and Python are similarly interactive, but require a programmer’s experience to effectively use. Spreadsheets reward persistence and facilitate experimentation.

Objections about spreadsheet source control, code review, and deployment are overstated. Many analytics projects are undertaken by one, two, or three contributors and are intended to be completed in days or weeks. This does not mean they are not “serious”; they are scoped. In any case, software is also susceptible to flaws due to insufficient review, for example the “Heartbleed” flaw in open source code. (We are aware the Heartbleed bug was quickly fixed after it was reported. Piketty can and should do the same with his spreadsheets.) No analytics solution is more easily deployed than a spreadsheet. Nearly every computing device has a spreadsheet program (or a web browser), so “deployment” simply means emailing a document or sharing a link. What’s more, the recipient will actually know what to do with it. The closest alternative, package systems such as those in R and Python, are far too difficult for non-experts to use. Simply put, time to solution is usually superior for spreadsheets. This is the consideration that overrides all others for many real-world projects. Spreadsheet critics are often far removed from such realities.

Writing code does not make the flaws attributed to spreadsheets magically disappear. While the methodologies and technologies to write good analytics software are there for the taking, they are infrequently used together, making it insanely hard to write sane software. Even in 2015, analytics software is often written in an imperative, comparatively low-level language such as C++ to realize the perceived runtime performance benefits of coding “close to the metal”, often at the cost of developer time and solution robustness. Even when a language like R or Python is used, the undocumented chains of library calls are often reminiscent of poorly structured spaghetti spreadsheets. The users of these solutions often have it worse than the authors. Analytics solutions intended to be used by non-experts often present terrible user experiences, neglecting best practices for human-centered design and presentation. Finally, and perhaps most frustratingly, modern tools widely accepted by the software industry lie unused by analytics practitioners. Somewhere, right now, three data scientists are debating which version of Python to use, two statisticians are arguing about vi versus emacs, and a graduate student is using printf to debug. Lemire preaches correctness, but what percentage of production analytics systems have repeatable unit tests? What percentage of published R packages ship with tests? How many open source analytics projects have more than three active developers or reviewers? How many research papers whose findings are based on new code actually share a working version? Is the answer to any of these questions more than 10%? The fault lies not in the sheet or the code, but in ourselves.

Code has it place. In a code-based solution, deployment, user experience, and level of customization is under full control of the developer. These properties make code wonderful for high complexity, high re-use, or high interactivity scenarios. When the analytics developer is a skilled engineer, the resulting software meets the requirements precisely in a way that a spreadsheet cannot due to its generality. However, spreadsheets often still sneak back into the picture. They are the Swiss Army knife that non-developers use to perform the next level of analysis, or produce the killer chart for the presentation; cases where new requirements emerge and even well-designed software is not extensible enough to meet them.

While there is room in the world for both spreadsheet and software solutions, the formalization of analytics as a discipline will inspire a “third way” for analytics development that will increasingly replace both. The recent growth of analytics graduate programs, certifications, maturity models, and buzzwords remind old-timers of similar developments in the computer industry in the sixties. The following decade witnessed the first systems truly ready for the mass market, and we will see the same for business analytics within the next five years. These changes will come because the advent of the “app” era, increased reliance on visualizations, and the diversification of computing device form factors are changing the scenarios and requirements for business analytics. Spreadsheets will be simply inadequate in such situations, as will programming, at least in their present incarnations. The seeds for this third way may lie in notebook environments such as those provided by Mathematica and Databricks Cloud, and now adopted by Microsoft Azure ML. (Notebooks contain code and results in a “linear” form that can be read and formatted like documents, splitting the difference between code and spreadsheets. Regardless of the form of this new analytics development environment, we know what it will enable. It will provide an expressive environment for specifying analytical calculations and models. It will connect and integrate data from any connected device, including the cloud. It will provide relevant, beautiful visualizations at any stage in the process. The solution and its source will be able to be easily shared and published, and in time the environment will itself use analytics to guess the intent of the developer. It will be able to be successfully used by those who are familiar with analytics and software, but are not themselves data scientists or software engineers. The third way will combine the benefits of both spreadsheets and programming but will be able to be used by anyone with the skills to clearly express their intent to the system.

Until this day arrives, both spreadsheets and software development will remain indispensable for analytics. The data science movement should pause to consider the fact that if spreadsheets vanished today, the number of people building predictive and prescriptive analytics would decrease by an order of magnitude. This community should not be so easily dismissed. However, organizations should consider the often unaccounted costs of locking their wisdom in unstructured, unreviewed, isolated spreadsheets, and consider the benefits of treating their analytics systems with the same care and discipline as their IT systems. Analytics is a human endeavor, and its successful practice requires overcoming human failings: laziness, overgeneralization, and cloudy judgment. Our technology should help us to overcome these failings, and while much has been accomplished, much remains to be done.