“Bixby says” there is no better tool than building a model to force you to understand your data.
— Jeff Linderoth (@JeffLinderoth) March 7, 2014
This has been my experience as well. Here’s an example. At Nielsen we often created predictive models of the sales of our clients’ products, often consumer packaged goods. The way we did this, called marketing mix, was to collect sales data for all of the products and regions of interest on a weekly basis and match it with data representing all of the factors that we believe impacted sales, such as advertising, price and weather. Then we’d run a big regression model to understand the impact of each causal factor on sales, so that we could confidently make statements such as “8% of your sales come from your TV advertising.” Read more about marketing mix here. Understanding where sales come from is important of course, but even more important is trying to use that information to predict future sales. We’d use the results of a marketing mix model to produce “sales response curves”. A sales response curve predicts the expected sales lift given some amount of causal activity. For example, in the response curve below, 50 units of TV advertising produces 150 units of sales lift. The orange points represent historical activity levels and sales lifts, and the blue curve is the projection used to predict future results:
We often found that the best way to “debug” a predictive model of sales, besides looking at basic summary statistics and charts of the output, was to try and use the predictive model to make decisions. That is, we’d build an optimization model that would try to find future budgets for all of the causal factors (e.g. TV, Radio, Facebook advertising, price, coupons, and so on) that result in the highest possible profit, given budgetary constraints. When we used the results of a “tested, verified” predictive model inside our optimization model, we’d often find that the results of the optimization model were completely nuts. For example, the optimization model might recommend moving millions of dollars from national TV advertising to direct mail advertising in Toledo. After we checked that we didn’t screw up the optimization model, we’d often find that the cause was “garbage in, garbage out”. For example:
- The sales data for Toledo was incorrect (or partially correct).
- The units for the direct mail time series were improperly specified (perhaps in 1s rather than 1000s).
- The cost of direct mail activity was incorrect.
- The direct mail data was improperly scaled when fed into the MMM regression model.
- The analyst gave improper weights or priors for direct mail or Toledo-related coefficients.
- The procedure that produced response curves from the marketing mix results (the part that fits the blue line to the orange points) was run incorrectly.
- An outlier in direct mail activity in Toledo influenced the response curve fitting procedure.
Any of these data problems could have caused an implausibly high sales response curve for Toledo direct mail activity, which would influence the optimizer to dump tons of money there. Given the volume of data involved in a real-world predictive model, this stuff can be hard to find even with modern tools and careful analysts. Often the best way to learn something is to teach someone else, and the best way to validate data is to use it in a model.