In data science, building a model means creating a representation of reality using code and equations. In a predictive model we specify the factors that influence a dependent quantity (e.g. how genre/lead actors/director impact box office revenue), and the nature of the relationships between them: linear, multiplicative, logarithmic. In an agent-based model we invent agents that follow rules that make sense for them in a larger ecosystem, like little people in a game of SimCity. In an optimization model we create variables that represent decisions that need to be made, and write equations that govern the restrictions on supply, or demand, or capacity, or flow, or volatility, or budget, or adstock, or when it is legal to castle. In each case we have a view of how some little part of the world works, and we are trying to represent this view in a way that is as simple and realistic as possible.
The problem is that we expect our clients to simply accept our models and use them as we intended. Here is what you should buy. Here is what your sales will look like this year. Buy this stock. Click to add to cart. Pawn to B5. They don’t. The problem, of course, is that our clients are often people, and people bring with them their own mental models of how the world works. I say this like it’s a bad thing. Sometimes it is, and other times it is not: we know that models are often too simplistic, or that they don’t have data we do (it’s going to snow!), or we don’t trust what we don’t understand. As a result, our clients form another model around our own; theirs is the model that is operationalized. In the world before data science, these were the only kinds of models that existed in most cases: it’s called “going with your gut” (left). The dream of automated decision making based on our models is often that, a dream (right). The situation is often like the picture in the middle.
I should give you an example. I’m not one to brag, but our retail store forecasting at Market6 is as good as it gets. We take in an insane amount of data and make literally billions of forecasts a year in a fully automated fashion, incorporating a couple of decades worth of retail knowledge. These forward-looking forecasts of sales are interesting, of course, but so are derived metrics based on these forecasts. For example, we’re able to estimate “out of stock” (OOS) conditions for items in stores based on a mix of inventory information, orders, actual sales, and our sales forecasts. For example, abnormally low actual sales of Diet Coke compared to the sales forecast over a period of time may indicate that it has not been adequately stocked. Rightly or wrongly, people can and do wrap their own models around our own OOS model:
- They may combine the OOS model results with another OOS indicator, for example spot checks by store employees.
- They may “cheat up” or “cheat down” the number of OOS flagged for particular stores based on tribal knowledge.
- They may use OOS as an indicator for when to order.
Our model is wrapped in theirs. In the first two cases if we as the model builders had access to this additional information, we could simply incorporate it into our models, eliminating the ad hoc ensemble method. In the third case, the client is creating a brand new model using our model as an input: a store ordering model. In this case, doing what is nominally our job (predicting when an item is out of stock), may actually be counterproductive to our client’s purposes (figuring out what to order) – for example ordering a case of an item that sells at a rate of one per month, based on the fact that the item is out of stock, is not going to be helpful. “Out of stock” is not necessarily the same as “should we order”.
The point is a relatively simple one: we should continually reassess the assumptions made and questions answered by our models to ensure they remain appropriate for our clients. This is another reason why good data scientists learn about the domains in which they work.