The Future of Spreadsheets in Analytics

It’s hard to think of something relied upon so much and loved as little as the spreadsheet, the Comcast of business analytics. Daniel Lemire’s article “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It)” examines difficulties with the data and conclusions drawn by Thomas Piketty’s “Capital in the Twenty-First Century“. Lemire says “spreadsheets are good for quick and dirty work, but they are not designed for serious and reliable work.” Others agree. Cogent criticism is found in online articles like this one. Another article referred to Microsoft Excel as “the most dangerous software on the planet,” and an article in the Harvard Business Review compared spreadsheets to fast food: a cheap indulgence that is bad for your health. Yet french fries have their virtues.

Analytics is about using computer models to learn from data. The recipe is to get some data, turn it into a form suitable for analysis, conduct a computational analysis, and present the results in a way that people can understand. Critics find flaws with spreadsheets nearly every step of the way. You can’t test their correctness. It is difficult to formally review their assumptions and methodology with colleagues. They encourage “copy and paste” programming and ad-hoc fudging. “Friends don’t let friends use Excel for statistics” because their computations are unreliable. Changes are hard to track and impossible to manage. Reliable deployment is a pipe dream. These claims are serious; if true they call into question the results of any analysis conducted using spreadsheets.

Lemire and others propose instead to build analytics solutions in software using established engineering practices and systems. While this claim sounds reasonable, it’s not particularly realistic. I will make the case that both spreadsheets and software have their place. I will further claim that notebook interfaces may supplant both in many cases. In order to do so, I first need to review some history.

Spreadsheets were the killer app that helped launch the personal computer revolution of the 1980s. VisiCalc, the Cro-Magnon spreadsheet, possessed the basic elements that define a spreadsheet: free-form tabular entry of connected data and formulas. VisiCalc’s genius lie in the combination of simplicity with extensibility: building blocks that can be combined in limitless ways for nearly any purpose. During this time the scientific programming landscape was dominated first by FORTRAN and later by C and C++. Self-contained analytics environments such as SAS and SPSS were frequently used both for one-off engagements as well as for deployed applications, first on mainframes and later on personal computers.

During the 1990s, Microsoft Excel dominated the spreadsheet market as Windows established a de facto monopoly on personal computer operating systems. The simultaneous rise of Windows and Office was no coincidence. Graphical user interfaces made possible new application experiences such as data-bound charts and pivot tables, which in turn justified the interfaces provided by the operating system. Moreover, the integration of Excel into the broader Microsoft Office suite transformed the spreadsheet from a fancy calculator into an application platform in its own right. The introduction of macros, followed by Visual Basic for Applications and “add-ins” meant that spreadsheets could be extended and customized in previously unforeseen ways. These customizations became the backbone of many mission-critical processes for thousands of companies, from payroll to operations to planning. Frontline Systems and Lindo Systems introduced the first spreadsheet optimizers, bringing predictive analytics capabilities to non-programmers for the first time. These add-on products have since been extended to cover the full range of analytics scenarios.

Excel’s market dominance reduced the incentive to innovate, and consequently few core improvements were seen in the following decade. The first widespread criticism of spreadsheets arose during this time. Long-standing, unaddressed flaws in Excel’s statistical functions damaged its credibility among academics and researchers. The rise of email exposed weaknesses in sharing and collaboration. At the same time, software engineering innovations made it far easier for non-specialist developers to use analytics in their applications. In particular, numerical libraries such as BLAS, LAPACK, and CPLEX provided the bedrock upon which vertical-specific applications involving risk analysis, decision support, and simulation could be built. Spreadsheet improvements such as expanded worksheet sizes and more attractive charts were of marginal interest to the analytics world. The tide was shifting.

More recently, spreadsheet software has been reinvigorated by the web’s emergence as an application platform. Google Sheets, the first widely adopted online spreadsheet, facilitates sharing, permits simultaneous edits by multiple users, and enables access from web-enabled devices. Google’s effect on Excel has been similar to that of Firefox and Chrome on Internet Explorer. In response, Microsoft introduced Excel Online, offering many of the same features as the desktop version but available on the web and mobile devices. Microsoft also introduced Excel 2013, analogous to Internet Explorer 7: a major update after a long gap in innovation in response to competitive pressure. Microsoft’s current push to bring Excel to all modern computing platforms represents the next step. Two other developments of note to the analytics community are Power Pivot, bringing SQL Server’s data processing capacity to spreadsheets, and Power BI, which allows Excel to import, join, clean, and transform data from relational databases, Hadoop, and online feeds such as OData and Facebook. Lastly, Google Docs has introduced Google Apps Script, the web equivalent of Visual Basic for Applications, which may end up powering a new generation of business applications inside of spreadsheets.

Even so, the conventional wisdom in the analytics community remains that spreadsheets are staid, underpowered, and dangerous for production use. Many people familiar with the terms “data scientist” or “predictive analytics” avoid the use of spreadsheets for their core tasks, or at least do not admit it.

The primary alternative is to write programs, whether in a self-contained environment such as SAS, Matlab, or R, or using a general purpose programming language such as Python, C++, Julia, or Scala. Programming languages have become progressively more expressive, allowing developers to describe their intent in fewer lines of code. Programming languages themselves are often one component in a larger development platform including standard libraries for handling data processing, visualization, and numerics. Interactive Development Environments (IDEs) provide autocompletion (itself a predictive model), powerful source code analysis, and conveniences for efficient editing. Tools for securing, managing, and deploying source code and applications mean that it is easier than ever to make sure the right code is deployed without a significant setup or support burden. Even more important are the continued advances of software engineering as a formal discipline. Test driven development, agile project management, and formal methods aim to ensure that correct solutions are delivered efficiently. Open source is embraced not only by academics and startups, but corporate powers as well. Online resources such as Stack Exchange and GitHub have transformed how we share and learn to code, turning analytics into an anonymously collaborative discipline, as I wrote about recently.

Given all of this, haven’t spreadsheets earned their reputation? According to those who think that spreadsheets are a problem, the solution is simple: “write sane software”. The assumption behind this statement would seem to be that the author of a bad spreadsheet will be able to write good software. If not that, then something worse: that authors of bad spreadsheets should not practice analytics at all.

Most people find spreadsheets easier to build and use than a custom software solution. The primary reason is that the sheet itself is a canvas for data and computation, where all the moving parts can be seen by anyone who has access to it. The author of the spreadsheet has the freedom to layout the formulas, the charts, and the tables in any way they see fit. Software is much more opaque. Let us, for a brief moment, breathe in and savor the reckless, liberating freedom provided by the spreadsheet. Spreadsheet users of modest talent can hack their way to an answer in cases where there is no other hope because spreadsheets provide immediate feedback. So-called “REPL” based environments such as R and Python are similarly interactive, but require a programmer’s experience to effectively use. Spreadsheets reward persistence and facilitate experimentation.

Objections about spreadsheet source control, code review, and deployment are overstated. Many analytics projects are undertaken by one, two, or three contributors and are intended to be completed in days or weeks. This does not mean they are not “serious”; they are scoped. In any case, software is also susceptible to flaws due to insufficient review, for example the “Heartbleed” flaw in open source code. (We are aware the Heartbleed bug was quickly fixed after it was reported. Piketty can and should do the same with his spreadsheets.) No analytics solution is more easily deployed than a spreadsheet. Nearly every computing device has a spreadsheet program (or a web browser), so “deployment” simply means emailing a document or sharing a link. What’s more, the recipient will actually know what to do with it. The closest alternative, package systems such as those in R and Python, are far too difficult for non-experts to use. Simply put, time to solution is usually superior for spreadsheets. This is the consideration that overrides all others for many real-world projects. Spreadsheet critics are often far removed from such realities.

Writing code does not make the flaws attributed to spreadsheets magically disappear. While the methodologies and technologies to write good analytics software are there for the taking, they are infrequently used together, making it insanely hard to write sane software. Even in 2015, analytics software is often written in an imperative, comparatively low-level language such as C++ to realize the perceived runtime performance benefits of coding “close to the metal”, often at the cost of developer time and solution robustness. Even when a language like R or Python is used, the undocumented chains of library calls are often reminiscent of poorly structured spaghetti spreadsheets. The users of these solutions often have it worse than the authors. Analytics solutions intended to be used by non-experts often present terrible user experiences, neglecting best practices for human-centered design and presentation. Finally, and perhaps most frustratingly, modern tools widely accepted by the software industry lie unused by analytics practitioners. Somewhere, right now, three data scientists are debating which version of Python to use, two statisticians are arguing about vi versus emacs, and a graduate student is using printf to debug. Lemire preaches correctness, but what percentage of production analytics systems have repeatable unit tests? What percentage of published R packages ship with tests? How many open source analytics projects have more than three active developers or reviewers? How many research papers whose findings are based on new code actually share a working version? Is the answer to any of these questions more than 10%? The fault lies not in the sheet or the code, but in ourselves.

Code has it place. In a code-based solution, deployment, user experience, and level of customization is under full control of the developer. These properties make code wonderful for high complexity, high re-use, or high interactivity scenarios. When the analytics developer is a skilled engineer, the resulting software meets the requirements precisely in a way that a spreadsheet cannot due to its generality. However, spreadsheets often still sneak back into the picture. They are the Swiss Army knife that non-developers use to perform the next level of analysis, or produce the killer chart for the presentation; cases where new requirements emerge and even well-designed software is not extensible enough to meet them.

While there is room in the world for both spreadsheet and software solutions, the formalization of analytics as a discipline will inspire a “third way” for analytics development that will increasingly replace both. The recent growth of analytics graduate programs, certifications, maturity models, and buzzwords remind old-timers of similar developments in the computer industry in the sixties. The following decade witnessed the first systems truly ready for the mass market, and we will see the same for business analytics within the next five years. These changes will come because the advent of the “app” era, increased reliance on visualizations, and the diversification of computing device form factors are changing the scenarios and requirements for business analytics. Spreadsheets will be simply inadequate in such situations, as will programming, at least in their present incarnations. The seeds for this third way may lie in notebook environments such as those provided by Mathematica and Databricks Cloud, and now adopted by Microsoft Azure ML. (Notebooks contain code and results in a “linear” form that can be read and formatted like documents, splitting the difference between code and spreadsheets. Regardless of the form of this new analytics development environment, we know what it will enable. It will provide an expressive environment for specifying analytical calculations and models. It will connect and integrate data from any connected device, including the cloud. It will provide relevant, beautiful visualizations at any stage in the process. The solution and its source will be able to be easily shared and published, and in time the environment will itself use analytics to guess the intent of the developer. It will be able to be successfully used by those who are familiar with analytics and software, but are not themselves data scientists or software engineers. The third way will combine the benefits of both spreadsheets and programming but will be able to be used by anyone with the skills to clearly express their intent to the system.

Until this day arrives, both spreadsheets and software development will remain indispensable for analytics. The data science movement should pause to consider the fact that if spreadsheets vanished today, the number of people building predictive and prescriptive analytics would decrease by an order of magnitude. This community should not be so easily dismissed. However, organizations should consider the often unaccounted costs of locking their wisdom in unstructured, unreviewed, isolated spreadsheets, and consider the benefits of treating their analytics systems with the same care and discipline as their IT systems. Analytics is a human endeavor, and its successful practice requires overcoming human failings: laziness, overgeneralization, and cloudy judgment. Our technology should help us to overcome these failings, and while much has been accomplished, much remains to be done.

Author: natebrix

Follow me on twitter at @natebrix.

1 thought on “The Future of Spreadsheets in Analytics”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s