Optimization Using SQL: Not Crazy

My sanity was recently called into question, not for the first time:

Friend Of The Blog Erwin Kalvelagen asserts that it makes no sense to expression optimization models in SQL. (I wrote about this idea here.) Erwin makes a great point: different environments offer different advantages when carrying out different kinds of data manipulations. Optimization can be viewed as a very complex data manipulation, and SQL is often not the best way to express such manipulations.

That said, it’s a leap to say it never makes sense. I’m not asserting that everyone should do it all the time – just that it’s reasonable. If I wanted to write all of my models in SQL, then I would have just written a system to let me do that (which you can more-or-less do right now by reading this). I have no plans to do so, but it’s still a reasonable idea because:

  • Lots of people know SQL (*),
  • very few people know about optimization or operations research as a discipline, even within the “analytics community”, (**)
  • conceptual knowledge of how SQL queries are constructed can easily be leveraged to define full optimization models, and
  • in certain cases it’s a pain (or not possible (***)) to pop out of one computing environment to express a model in another.(****)

The parts of the machine learning community that think about optimization think of it as fancy search. SQL is a querying language.

Further, part of our future (or at least mine) are scenarios where you need to solve a very large number of relatively simple optimization models with similar structure but different data, as part of a larger data transformation pipeline. Such models will likely be solved in computational environments like Spark, Hadoop, or Flink where SQL syntax will be resident. It’s totally not crazy to imagine expressing such models in this way. Will such cases actually use SQL, as opposed to Python, Scala, R, or whatever(*****)? I don’t know. I do think that somebody will use SQL at the very least as inspiration in enabling concise expression of optimization models.

* Here is the proof. Not on this list: AMPL, GAMS, AIMMS. They never will be. Ever.
** Paco Nathan’s Just Enough Math and John Foreman’s Data Smart are refreshing exceptions.
*** A reality of working in a team environment is that you do not always have full control over your computing environment.
**** Which is why, for example, we may choose to express PageRank in GAMS….
***** Again, there is zero – ZERO – chance that AMPL, GAMS, AIMMS, or your favorite optimization modeling language will be used in these scenarios. I don’t mean to offend proponents of what are great systems, just stating a wider industry reality that may not be fully appreciated by those outside industry.

SQL as an Optimization Modeling Language

Several years ago, a former (awesome) colleague of mine at Microsoft, Bart de Smet, and I discussed the expressibility of optimization problems using SQL syntax. Most formulations carry over in a straightforward way, for example if we want to solve:

minimize 2 x + y
subject to 
x^2 + y^2 <= 1,
x >= 0.

Then we can express this as

  POWER(X, 2) + POWER(Y, 2) <= 1 AND X >= 0

Through suitable rewriting such a specification could be easily sent to a solver. You get the idea; a range of problem types, and even concepts like warm starting are easily supported. I suppose even column generation could be supported via triggers.

Update 10/1/2015: Friends Of The Blog Jeff, Alper, and colleagues thought of this long before I did. See this paper for more.

Installing TopoJSON and GDAL on Windows

Here are instructions for installing topojson and GDAL binaries for Windows 8. These tools are helpful if you want to mess around with map visualizations in d3.js. Parts of these instructions are taken from this helpful stackoverflow thread.

  1. Install GDAL and MapServer binaries from this location at gisinternals.com. This was the latest stable version as of 8/8/2015.
  2. On my machine, gdal was installed to <location of unzipped archive>\bin\gdal\apps. The batch file in the root of the install should set paths correctly.
  3. Install nodejs from nodejs.org.
  4. Install npm using the directions on npmjs.org.
  5. From the command prompt run npm install –g topojson
  6. On my machine, this installed topojson to C:\Users\Nathan\AppData\Roaming\npm\node_modules\topojson\bin

Interesting shapefile data can be found at naturalearthdata.com, as explained in this excellent tutorial.

Your Model / Their Model

In data science, building a model means creating a representation of reality using code and equations. In a predictive model we specify the factors that influence a dependent quantity (e.g. how genre/lead actors/director impact box office revenue), and the nature of the relationships between them: linear, multiplicative, logarithmic. In an agent-based model we invent agents that follow rules that make sense for them in a larger ecosystem, like little people in a game of SimCity. In an optimization model we create variables that represent decisions that need to be made, and write equations that govern the restrictions on supply, or demand, or capacity, or flow, or volatility, or budget, or adstock, or when it is legal to castle. In each case we have a view of how some little part of the world works, and we are trying to represent this view in a way that is as simple and realistic as possible.

The problem is that we expect our clients to simply accept our models and use them as we intended. Here is what you should buy. Here is what your sales will look like this year. Buy this stock. Click to add to cart. Pawn to B5. They don’t. The problem, of course, is that our clients are often people, and people bring with them their own mental models of how the world works. I say this like it’s a bad thing. Sometimes it is, and other times it is not: we know that models are often too simplistic, or that they don’t have data we do (it’s going to snow!), or we don’t trust what we don’t understand. As a result, our clients form another model around our own; theirs is the model that is operationalized. In the world before data science, these were the only kinds of models that existed in most cases: it’s called “going with your gut” (left). The dream of automated decision making based on our models is often that, a dream (right). The situation is often like the picture in the middle.


I should give you an example. I’m not one to brag, but our retail store forecasting at Market6 is as good as it gets. We take in an insane amount of data and make literally billions of forecasts a year in a fully automated fashion, incorporating a couple of decades worth of retail knowledge. These forward-looking forecasts of sales are interesting, of course, but so are derived metrics based on these forecasts. For example, we’re able to estimate “out of stock” (OOS) conditions for items in stores based on a mix of inventory information, orders, actual sales, and our sales forecasts. For example, abnormally low actual sales of Diet Coke compared to the sales forecast over a period of time may indicate that it has not been adequately stocked. Rightly or wrongly, people can and do wrap their own models around our own OOS model:

  • They may combine the OOS model results with another OOS indicator, for example spot checks by store employees.
  • They may “cheat up” or “cheat down” the number of OOS flagged for particular stores based on tribal knowledge. 
  • They may use OOS as an indicator for when to order.

Our model is wrapped in theirs. In the first two cases if we as the model builders had access to this additional information, we could simply incorporate it into our models, eliminating the ad hoc ensemble method. In the third case, the client is creating a brand new model using our model as an input: a store ordering model. In this case, doing what is nominally our job (predicting when an item is out of stock), may actually be counterproductive to our client’s purposes (figuring out what to order) – for example ordering a case of an item that sells at a rate of one per month, based on the fact that the item is out of stock, is not going to be helpful. “Out of stock” is not necessarily the same as “should we order”.

The point is a relatively simple one: we should continually reassess the assumptions made and questions answered by our models to ensure they remain appropriate for our clients. This is another reason why good data scientists learn about the domains in which they work.

Spreadsheets Are Alive, Spreadsheets are Dead

It’s hard to think of something relied upon so much and loved as little as the spreadsheet, the Comcast of business analytics. Daniel Lemire’s article “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It)” examines difficulties with the data and conclusions drawn by Thomas Piketty’s “Capital in the Twenty-First Century“. Lemire says “spreadsheets are good for quick and dirty work, but they are not designed for serious and reliable work.” Others agree. Cogent criticism is found in online articles like this one. Another article referred to Microsoft Excel as “the most dangerous software on the planet,” and an article in the Harvard Business Review compared spreadsheets to fast food: a cheap indulgence that is bad for your health. Yet french fries have their virtues.

Analytics is about using computer models to learn from data. The recipe is to get some data, turn it into a form suitable for analysis, conduct a computational analysis, and present the results in a way that people can understand. Critics find flaws with spreadsheets nearly every step of the way. You can’t test their correctness. It is difficult to formally review their assumptions and methodology with colleagues. They encourage “copy and paste” programming and ad-hoc fudging. “Friends don’t let friends use Excel for statistics” because their computations are unreliable. Changes are hard to track and impossible to manage. Reliable deployment is a pipe dream. These claims are serious; if true they call into question the results of any analysis conducted using spreadsheets.

Lemire and others propose instead to build analytics solutions in software using established engineering practices and systems. While this claim sounds reasonable, it’s not particularly realistic. I will make the case that both spreadsheets and software have their place. I will further claim that notebook interfaces may supplant both in many cases. In order to do so, I first need to review some history.

Spreadsheets were the killer app that helped launch the personal computer revolution of the 1980s. VisiCalc, the Cro-Magnon spreadsheet, possessed the basic elements that define a spreadsheet: free-form tabular entry of connected data and formulas. VisiCalc’s genius lie in the combination of simplicity with extensibility: building blocks that can be combined in limitless ways for nearly any purpose. During this time the scientific programming landscape was dominated first by FORTRAN and later by C and C++. Self-contained analytics environments such as SAS and SPSS were frequently used both for one-off engagements as well as for deployed applications, first on mainframes and later on personal computers.

During the 1990s, Microsoft Excel dominated the spreadsheet market as Windows established a de facto monopoly on personal computer operating systems. The simultaneous rise of Windows and Office was no coincidence. Graphical user interfaces made possible new application experiences such as data-bound charts and pivot tables, which in turn justified the interfaces provided by the operating system. Moreover, the integration of Excel into the broader Microsoft Office suite transformed the spreadsheet from a fancy calculator into an application platform in its own right. The introduction of macros, followed by Visual Basic for Applications and “add-ins” meant that spreadsheets could be extended and customized in previously unforeseen ways. These customizations became the backbone of many mission-critical processes for thousands of companies, from payroll to operations to planning. Frontline Systems and Lindo Systems introduced the first spreadsheet optimizers, bringing predictive analytics capabilities to non-programmers for the first time. These add-on products have since been extended to cover the full range of analytics scenarios.

Excel’s market dominance reduced the incentive to innovate, and consequently few core improvements were seen in the following decade. The first widespread criticism of spreadsheets arose during this time. Long-standing, unaddressed flaws in Excel’s statistical functions damaged its credibility among academics and researchers. The rise of email exposed weaknesses in sharing and collaboration. At the same time, software engineering innovations made it far easier for non-specialist developers to use analytics in their applications. In particular, numerical libraries such as BLAS, LAPACK, and CPLEX provided the bedrock upon which vertical-specific applications involving risk analysis, decision support, and simulation could be built. Spreadsheet improvements such as expanded worksheet sizes and more attractive charts were of marginal interest to the analytics world. The tide was shifting.

More recently, spreadsheet software has been reinvigorated by the web’s emergence as an application platform. Google Sheets, the first widely adopted online spreadsheet, facilitates sharing, permits simultaneous edits by multiple users, and enables access from web-enabled devices. Google’s effect on Excel has been similar to that of Firefox and Chrome on Internet Explorer. In response, Microsoft introduced Excel Online, offering many of the same features as the desktop version but available on the web and mobile devices. Microsoft also introduced Excel 2013, analogous to Internet Explorer 7: a major update after a long gap in innovation in response to competitive pressure. Microsoft’s current push to bring Excel to all modern computing platforms represents the next step. Two other developments of note to the analytics community are Power Pivot, bringing SQL Server’s data processing capacity to spreadsheets, and Power BI, which allows Excel to import, join, clean, and transform data from relational databases, Hadoop, and online feeds such as OData and Facebook. Lastly, Google Docs has introduced Google Apps Script, the web equivalent of Visual Basic for Applications, which may end up powering a new generation of business applications inside of spreadsheets.

Even so, the conventional wisdom in the analytics community remains that spreadsheets are staid, underpowered, and dangerous for production use. Many people familiar with the terms “data scientist” or “predictive analytics” avoid the use of spreadsheets for their core tasks, or at least do not admit it.

The primary alternative is to write programs, whether in a self-contained environment such as SAS, Matlab, or R, or using a general purpose programming language such as Python, C++, Julia, or Scala. Programming languages have become progressively more expressive, allowing developers to describe their intent in fewer lines of code. Programming languages themselves are often one component in a larger development platform including standard libraries for handling data processing, visualization, and numerics. Interactive Development Environments (IDEs) provide autocompletion (itself a predictive model), powerful source code analysis, and conveniences for efficient editing. Tools for securing, managing, and deploying source code and applications mean that it is easier than ever to make sure the right code is deployed without a significant setup or support burden. Even more important are the continued advances of software engineering as a formal discipline. Test driven development, agile project management, and formal methods aim to ensure that correct solutions are delivered efficiently. Open source is embraced not only by academics and startups, but corporate powers as well. Online resources such as Stack Exchange and GitHub have transformed how we share and learn to code, turning analytics into an anonymously collaborative discipline, as I wrote about recently.

Given all of this, haven’t spreadsheets earned their reputation? According to those who think that spreadsheets are a problem, the solution is simple: “write sane software”. The assumption behind this statement would seem to be that the author of a bad spreadsheet will be able to write good software. If not that, then something worse: that authors of bad spreadsheets should not practice analytics at all.

Most people find spreadsheets easier to build and use than a custom software solution. The primary reason is that the sheet itself is a canvas for data and computation, where all the moving parts can be seen by anyone who has access to it. The author of the spreadsheet has the freedom to layout the formulas, the charts, and the tables in any way they see fit. Software is much more opaque. Let us, for a brief moment, breathe in and savor the reckless, liberating freedom provided by the spreadsheet. Spreadsheet users of modest talent can hack their way to an answer in cases where there is no other hope because spreadsheets provide immediate feedback. So-called “REPL” based environments such as R and Python are similarly interactive, but require a programmer’s experience to effectively use. Spreadsheets reward persistence and facilitate experimentation.

Objections about spreadsheet source control, code review, and deployment are overstated. Many analytics projects are undertaken by one, two, or three contributors and are intended to be completed in days or weeks. This does not mean they are not “serious”; they are scoped. In any case, software is also susceptible to flaws due to insufficient review, for example the “Heartbleed” flaw in open source code. (We are aware the Heartbleed bug was quickly fixed after it was reported. Piketty can and should do the same with his spreadsheets.) No analytics solution is more easily deployed than a spreadsheet. Nearly every computing device has a spreadsheet program (or a web browser), so “deployment” simply means emailing a document or sharing a link. What’s more, the recipient will actually know what to do with it. The closest alternative, package systems such as those in R and Python, are far too difficult for non-experts to use. Simply put, time to solution is usually superior for spreadsheets. This is the consideration that overrides all others for many real-world projects. Spreadsheet critics are often far removed from such realities.

Writing code does not make the flaws attributed to spreadsheets magically disappear. While the methodologies and technologies to write good analytics software are there for the taking, they are infrequently used together, making it insanely hard to write sane software. Even in 2015, analytics software is often written in an imperative, comparatively low-level language such as C++ to realize the perceived runtime performance benefits of coding “close to the metal”, often at the cost of developer time and solution robustness. Even when a language like R or Python is used, the undocumented chains of library calls are often reminiscent of poorly structured spaghetti spreadsheets. The users of these solutions often have it worse than the authors. Analytics solutions intended to be used by non-experts often present terrible user experiences, neglecting best practices for human-centered design and presentation. Finally, and perhaps most frustratingly, modern tools widely accepted by the software industry lie unused by analytics practitioners. Somewhere, right now, three data scientists are debating which version of Python to use, two statisticians are arguing about vi versus emacs, and a graduate student is using printf to debug. Lemire preaches correctness, but what percentage of production analytics systems have repeatable unit tests? What percentage of published R packages ship with tests? How many open source analytics projects have more than three active developers or reviewers? How many research papers whose findings are based on new code actually share a working version? Is the answer to any of these questions more than 10%? The fault lies not in the sheet or the code, but in ourselves.

Code has it place. In a code-based solution, deployment, user experience, and level of customization is under full control of the developer. These properties make code wonderful for high complexity, high re-use, or high interactivity scenarios. When the analytics developer is a skilled engineer, the resulting software meets the requirements precisely in a way that a spreadsheet cannot due to its generality. However, spreadsheets often still sneak back into the picture. They are the Swiss Army knife that non-developers use to perform the next level of analysis, or produce the killer chart for the presentation; cases where new requirements emerge and even well-designed software is not extensible enough to meet them.

While there is room in the world for both spreadsheet and software solutions, the formalization of analytics as a discipline will inspire a “third way” for analytics development that will increasingly replace both. The recent growth of analytics graduate programs, certifications, maturity models, and buzzwords remind old-timers of similar developments in the computer industry in the sixties. The following decade witnessed the first systems truly ready for the mass market, and we will see the same for business analytics within the next five years. These changes will come because the advent of the “app” era, increased reliance on visualizations, and the diversification of computing device form factors are changing the scenarios and requirements for business analytics. Spreadsheets will be simply inadequate in such situations, as will programming, at least in their present incarnations. The seeds for this third way may lie in notebook environments such as those provided by Mathematica and Databricks Cloud, and now adopted by Microsoft Azure ML. (Notebooks contain code and results in a “linear” form that can be read and formatted like documents, splitting the difference between code and spreadsheets. Regardless of the form of this new analytics development environment, we know what it will enable. It will provide an expressive environment for specifying analytical calculations and models. It will connect and integrate data from any connected device, including the cloud. It will provide relevant, beautiful visualizations at any stage in the process. The solution and its source will be able to be easily shared and published, and in time the environment will itself use analytics to guess the intent of the developer. It will be able to be successfully used by those who are familiar with analytics and software, but are not themselves data scientists or software engineers. The third way will combine the benefits of both spreadsheets and programming but will be able to be used by anyone with the skills to clearly express their intent to the system.

Until this day arrives, both spreadsheets and software development will remain indispensable for analytics. The data science movement should pause to consider the fact that if spreadsheets vanished today, the number of people building predictive and prescriptive analytics would decrease by an order of magnitude. This community should not be so easily dismissed. However, organizations should consider the often unaccounted costs of locking their wisdom in unstructured, unreviewed, isolated spreadsheets, and consider the benefits of treating their analytics systems with the same care and discipline as their IT systems. Analytics is a human endeavor, and its successful practice requires overcoming human failings: laziness, overgeneralization, and cloudy judgment. Our technology should help us to overcome these failings, and while much has been accomplished, much remains to be done.

Free Advice on Building Your Data Science Career

This article is for data scientists who have already been in the field for at least a year, and are thinking about their long term aspirations. I want to share with you some possible directions for continued progression. A word for the newbies: it’s a good idea to think about long term professional development from the very start of your career in data science. It’s also important not to get too hung up on specific paths early on. Your focus should be clear: high productivity and performance as an individual data scientist. Learning the basics of the craft is a subject for another post. (There’s a lot of advice out there for new data scientists: some good and some bad.)

Once you’re comfortable with the day-to-day work of data science, you’ll start to think about what’s next. There are many ways to progress as a data scientist, but a simple fact unifies them all: there comes a time in everyone’s career where you can no longer add additional value (or get promoted) by working longer hours, becoming a better coder, or reading more research papers. Eventually the way to contribute more is by working with and through a team. This doesn’t mean that you need to become a flashy extrovert or give up data science, but it does mean that you need to become comfortable communicating with others and sharing knowledge.

A managerial path is one option. This usually starts by becoming responsible for another data scientist’s day-to-day priorities, either in a mentoring or formal management role. Supervising an intern is often a good first step. The scope of responsibility may encompass project management responsibilities, which is about managing work and priorities, or it may also include career development discussions and performance reviews. Either way, in order to be successful in a managerial role you will need to have a strong understanding of the business domain in which you operate so that you can help your team make appropriate tradeoffs and set strategic direction. You’ll need to understand the role of data science, and the data science team, within your organization. In other words, you have to be able to ask and answer “why” questions. The role of data science varies widely within organizations, serving in a consultant support role, as a department within software engineering, or as technical business analysts. You’ll need to learn to effectively delegate, check in on progress, provide air cover for your team, and serve as an advocate. Management takes time and effort to do right. It’s a real change that should not be taken lightly. Once you become a manager, when people talk about “management” or “leadership”, they’re talking about you.

It’s a common mistake to think that career advancement means being somebody’s boss. In a good organization there is plenty of room for all kinds of leaders. Another path is to establish oneself as a domain expert: perhaps the expert on machine learning, or optimization, or the application of data science to a particular major project or initiative. In other words, the “domain” may refer to either a technical or business component. In other words, you become a “go-to” person to answer difficult questions, bring people together to address previously unidentified roadblocks, establish best practices or “ways of doing things”. You lead through sharing and spreading knowledge. A domain expert by definition has deep knowledge in important discipline, but often is also someone with very broad knowledge: they can make connections and provide pointers. A domain expert is not simply a super productive individual contributor.

Another path is consultative in nature. This does not necessarily mean leaving your organization to become the next John D. Cook (but it could)! In this role you will join new or in-progress initiatives to advance them to a particular next step (or to completion). In this role you will draw on your past experience in theoretical and applied knowledge to new situations. You’ll build a personal toolbox of code snippets, models, techniques, pointers, and confidantes. You’ll get used to tight deadlines, changing scope, dealing with jerks and cool people, and making new connections. You’ll routinely learn new techniques, programming languages, and project management idioms.

Still another path is to change disciplines. Some data scientists can become excellent product or project managers by virtue of their talent for breaking down problems into understandable pieces. Others may shift over into a pure software engineering role. Still others may embrace a direct client-facing role in technical sales or consulting.

Everyone’s journey is different, and there’s no reason why you need to follow a single path. Think about what considerations drive satisfaction in your work, and also think about your context. Sometimes opportunities do not present themselves when you are looking for them. Find a mentor, manager, or trusted friend with whom you can discuss possibilities and options.

A Missing Data Challenge

Be careful with missing values. I have heard advice recently that it’s often okay to just ignore missing values. Sure, sometimes…but be careful! We were recently given some data that looked like this – let’s say that it represents the number of shoppers visiting eight different retail stores over the course of a week. (I have anonymized the data.)

  5/4/2015 5/5/2015 5/6/2015 5/7/2015 5/8/2015 5/9/2015 5/10/2015
1   1150 1065 1155 1091   1104
3 1167 1328 1189 1151 828 800 1110
4 2130 1853 1064        
5 2041 2014 1461   1578 1346  
6 3016 2699 2043 2757 2414 2268  
7   1282 893 1197 1243    
8 2752 2001   2071     1511
Average 2221.2 1761.0 1285.8 1666.2 1430.8 1471.3 1241.7

Let’s say our team had developed a forecasting method and were asked to compare our results against this data. What sorts of problems could you encounter if you simply ignored the missing values? Are the averages trustworthy? If you wanted to fill in the missing values above, how would you do that? And what the heck do you do about store 2?

Data acquisition is not as fun to think about as building cool machine learning or optimization models, but is every bit as important, and the issues are often subtle.