2016 NCAA Tournament Picks

Every year since 2010 I have used analytics to make my NCAA picks. Here is a link to the picks made by my model [PDF]: the projected Final Four is Villanova, Duke, North Carolina, and Virginia with Villanova defeating North Carolina in the final. (I think my model likes Virginia too much, by the way.)

Here’s how these selections were made. First, the ground rules I set for myself:

  • The picks should not be embarrassingly bad.
  • I shall spend no more than on this activity (and 30 minutes for this post).
  • I will share my code and raw data.

Okay: the model. The model combines two concepts:

  1. A “win probability” model developed by Joel Sokol in 2010 as described on Net Prophet.
  2. An eigenvalue centrality model based on this post on BioPhysEngr Blog.

The win probability model accounts for margin of victory and serves as preprocessing for step 2. I added a couple of other features to make the model more accurate:

  • Home-court advantage is considered: 2.5 points which was a rough estimate I made a few years ago and presumably is still reasonable.
  • The win probability is scaled by an adjustment factor which has been selected for best results (see below).
  • Recency is considered: more recent victories are weighted more strongly.

The eigenvalue centrality model requires game-by-game results. I pulled four years of game results for all divisions from masseyratings.com (holla!) and saved them as CSV. You can get all the data here. It sounds complicated, but it’s not (otherwise I wouldn’t do it) – the model requires less than 200 lines of Python, also available here. (The code is poor quality.)

How do I know these picks aren’t crap? I don’t. The future is uncertain. But, I did a little bit of backtesting. I trained the model using different “win probability” and “recency” parameters on the 2013-2015 seasons, selecting the combination of parameters that correctly predicted the highest percentage of NCAA tournament games during those seasons, getting approximately 68% of those games right. I don’t know if that’s good, but it seems to be better than applying either the eigenvalue centrality model or the win probability model separately.

In general, picks produced by my models rank in the upper quartile in pools that I enter. I hope that’s the case this year too.

Blogs, Research Papers, and Operations Research

There’s an interesting thread on twitter this morning about making Operations Research accessible:

I think everyone is right! I have three types of readers for my analytics posts: 

  • Active researchers or experts
  • Technically oriented readers who aren’t experts in an analytics-related discipline, e.g. software engineers
  • Everyone else

These groups roughly correspond to “shipbuilders”, “sailors”, and “passengers” using the analogy in this post. A single blog post may not satisfy all these parties, even if well written! Experts may well prefer a research paper. Developers may well prefer a link to github. General interest readers may prefer a one paragraph overview, an interactive visual, or simply “the answer”. All of these things are good, and I have found that all three groups can sometimes benefit from content intended for only one.

You can supplement a blog post with any or all of these additional materials, or break a post into two: one that explains the problem and the answer, and another that describes the solution methodology. (Here is an example from a few years ago: problem and methodology.) Consider writing blog posts for your research papers or projects before the project is complete. This will give you practice explaining the topic to an audience, and provides the opportunity for early feedback.

//platform.twitter.com/widgets.js

Collaborating via Data Fusion and Analytics

This article was originally published in Chain Store Age. Click here to read it.

Baseball’s great accidental philosopher Yogi Berra once said, “If you don’t know where you are going, you might wind up someplace else.” In the retail business, knowing where you’re going means understanding sales, inventory, promotions, pricing, and assortment. How do these considerations change by product? By store? On Black Friday? It’s hard enough for a bodega shopkeeper to keep track of all of this information accurately and efficiently, let alone a regional, national, or global retail business. Retailers and suppliers alike need a complete view of the forces that shape their businesses, so they can harness those they control and manage the ones they do not.

Many firms have adopted an inside-out approach to leveraging data and analytics, often starting with organizing their own information through data warehousing. Big data technologies may be employed for transactional or shopper data. Summarizing and reporting on this data often yields many interesting insights, but don’t stop there! Having a diverse set of relevant data, coming from both inside and outside an organization, often matters more than sheer volume. Start with your partners. Critical business decisions, such as supply chain and promotional considerations, are often made collaboratively. After all, suppliers and retailers need each other. Fusing supplier and retailer information together, for example for budgeting and planning purposes, can be an effective way of discovering and executing high-impact changes through collaborative effort.

This leaves rest of the world: the economy, the weather, social trends, the competition, and the billions of people that are currently not your customers. An integrated, shared view of the broader retail environment will help you and your business partners to make sound strategic decisions. A complete off the shelf solution for this kind of 360 degree view does not really exist, for every business is different. Don’t despair, however: you don’t have to start from scratch. Many useful data sources, such as census and macroeconomic information, are available for free and easy for analysts to use. Fused social media data, including engagement and sentiment information, are also available. Finally, a number of data analytics companies offer custom solutions for retail. The recent partnership between Target, Ideo, and MIT Media Lab is a fascinating recent example of this kind of collaborative analysis.

Having worked from the inside out, the final step is to turn data into competitive advantage through sound decision making. Focus on key business problems, which may involve coordinated action with partners, and leverage your data by combining analytics with human wisdom. Many retail businesses are not seeking automated systems in all cases; rather computer assisted processes. A store manager can use suggested orders from an analytics system as a guide and then adjust based on local conditions that the big brain in the sky has no knowledge of: the traffic jam, the high school football game, the store appearance, prom season.

When you and your partners have a shared view of the competitive environment, you can focus on the issues that matter. Analytics on a diverse set data produces the trends, insights, and forecasts that enable better collaboration, better strategic decisions, and better operations. It doesn’t have to be complicated: start with the business decisions that matter and work from the inside out. A broad set of data unified by analytics, is a winning combination.

Thanksgiving Analytics Reading

Here are a few retail analytics links for your reading enjoyment. Happy Thanksgiving, everyone.

The Data Types And Sources That Drive Grocery Retail

This article was originally posted at Retail TouchPoints. Click here to read it.

Analytics, the use of computer models to produce insight for decision making, is increasingly the key to competitive advantage in grocery retail. In this new world, data is the fuel that powers the engines of data science. How can consumer goods suppliers and retailers identify and use data to produce business insight?

It all starts with sales, of course. Ideally, suppliers and retailers would have access to single line item in every single sales transaction for all store locations for the entire history of the retailer. Any other view of sales, for example, divisional level sales of baked goods, can be derived from these tiny atoms. Unfortunately it is impractical or impossible for many firms to collect and maintain data at this level of granularity, so they often have to get by with incomplete or summarized sales data. The good news is that an analysis of even imperfect sales data often reveals insights about products, store locations, and general trends. Start with the fundamentals.

The next step is to connect sales with other relevant data sets to answer business questions. It’s easy to become overwhelmed by the sheer number of potentially relevant data sources: deliveries, store audits, advertising, price, promotions, social media, weather, and so on. The savvy consumer goods executive always returns to business value, keeping basic questions in mind: who, what, when, where, why, and how.

Often “how” boils down to supply chain considerations. Understanding and optimizing the path to purchase is a goal of every consumer goods player, but often this path is often long and complicated, involving suppliers, distributors, category managers, and ultimately consumers. The supply chain suggests several of interesting data sets: product information down to scan or SKU level, distribution center receipts and shipments, store deliveries, stocking, and order information. Decisions regarding such information is at the heart of the collaborative process which drives results for suppliers and retailers. Yet despite the proven benefits effective collaboration and the value it creates remains elusive for most. Supply chain data, along with appropriate process and software tools, are key to collaborative tools such as lost sales assessment, out of stock measurement and fulfillment planning.

Supply chain data describes how products moves from production to sale, but say little about why sales happen. The standard levers of price, promotion, and advertising are all obvious data sources, and can often be collected through collaborative effort between consumer goods retailers and suppliers. Looking beyond the obvious leads to additional data sources that relate to longer term forces that affect sales in subtle but important ways. For example, category trends such as the sudden rise of Greek yogurt and the continued growth of organic and natural foods can be tracked through online sources such as Google Trends and various third party outlets. Brand awareness is also known to be an important sales driver. For example, the long term effects of television advertising were recently estimated by Nielsen Catalina Solutions to be twice that of the immediate short term bounce.

The measurement of brand equity is difficult, but not impossible, especially with sentiment information that can be mined from online sources such as Facebook, Twitter, and online forums. Sophisticated machine learning tools can be used to understand how consumer disposition changes over time, and how it affects the bottom line.

While each source provides interesting insights in its own right, the full promise of data is realized for consumer goods suppliers and retailers when these sources are marshaled together. For example, weather data is, strictly speaking, useless to a retailer when viewed in isolation. That’s why at Market6 we’ve joined weather data with store sales to quantify how sales rise in the days before a big snowstorm and decrease as it hits. Retailers see for themselves that two inches of snow yield very different behaviors in Chicago, Charlotte, and Seattle. Further, joining with item level sales yields insights about “stock up” items before big storms, beyond the obvious ones such as bread and milk.

The result of more relevant data sources joined together (increasingly streamed in real-time) is consumer goods innovation. Armed with the insights provided by smart data, retailers and suppliers are more profitable and consumers are more satisfied than ever before.

Optimization Using SQL: Not Crazy

My sanity was recently called into question, not for the first time:

Friend Of The Blog Erwin Kalvelagen asserts that it makes no sense to expression optimization models in SQL. (I wrote about this idea here.) Erwin makes a great point: different environments offer different advantages when carrying out different kinds of data manipulations. Optimization can be viewed as a very complex data manipulation, and SQL is often not the best way to express such manipulations.

That said, it’s a leap to say it never makes sense. I’m not asserting that everyone should do it all the time – just that it’s reasonable. If I wanted to write all of my models in SQL, then I would have just written a system to let me do that (which you can more-or-less do right now by reading this). I have no plans to do so, but it’s still a reasonable idea because:

  • Lots of people know SQL (*),
  • very few people know about optimization or operations research as a discipline, even within the “analytics community”, (**)
  • conceptual knowledge of how SQL queries are constructed can easily be leveraged to define full optimization models, and
  • in certain cases it’s a pain (or not possible (***)) to pop out of one computing environment to express a model in another.(****)

The parts of the machine learning community that think about optimization think of it as fancy search. SQL is a querying language.

Further, part of our future (or at least mine) are scenarios where you need to solve a very large number of relatively simple optimization models with similar structure but different data, as part of a larger data transformation pipeline. Such models will likely be solved in computational environments like Spark, Hadoop, or Flink where SQL syntax will be resident. It’s totally not crazy to imagine expressing such models in this way. Will such cases actually use SQL, as opposed to Python, Scala, R, or whatever(*****)? I don’t know. I do think that somebody will use SQL at the very least as inspiration in enabling concise expression of optimization models.


* Here is the proof. Not on this list: AMPL, GAMS, AIMMS. They never will be. Ever.
** Paco Nathan’s Just Enough Math and John Foreman’s Data Smart are refreshing exceptions.
*** A reality of working in a team environment is that you do not always have full control over your computing environment.
**** Which is why, for example, we may choose to express PageRank in GAMS….
***** Again, there is zero – ZERO – chance that AMPL, GAMS, AIMMS, or your favorite optimization modeling language will be used in these scenarios. I don’t mean to offend proponents of what are great systems, just stating a wider industry reality that may not be fully appreciated by those outside industry.

SQL as an Optimization Modeling Language

Several years ago, a former (awesome) colleague of mine at Microsoft, Bart de Smet, and I discussed the expressibility of optimization problems using SQL syntax. Most formulations carry over in a straightforward way, for example if we want to solve:

minimize 2 x + y
subject to 
x^2 + y^2 <= 1,
x >= 0.

Then we can express this as

CREATE TABLE VARS (
  X FLOAT,
  Y FLOAT
);
SELECT TOP 1 X, Y
FROM VARS
WHERE 
  POWER(X, 2) + POWER(Y, 2) <= 1 AND X >= 0
ORDER BY 2*X + Y ASC;

Through suitable rewriting such a specification could be easily sent to a solver. You get the idea; a range of problem types, and even concepts like warm starting are easily supported. I suppose even column generation could be supported via triggers.

Update 10/1/2015: Friends Of The Blog Jeff, Alper, and colleagues thought of this long before I did. See this paper for more.