Why Did AOL Buy Convertro for $100 Million?

Hey, so AOL bought marketing analytics firm Convertro for $100 million. Convertro does something called attribution modeling. What the hell is that?

Attribution modeling uses statistics on individual purchases to determine the credit that each marketing channel should receive for each sale. Giving proper credit where it is due leads to insights about which forms of advertising are more effective, which leads to better results for marketers. This leads to $100 million for Convertro.

Here’s an example. Suppose you are on espn.com’s golf page and you see an ad for Contoso golf clubs. A link on a posting takes you to a blog where a banner ad is shown at the top of the page. The article makes some ridiculous assertions about Steve Stricker that cannot go unchallenged, so you post the link on Facebook. As you scroll through your news feed, a promoted ad for Contoso shows up based on your obvious interest in golf clubs. The next day on your long commute to work, you pass a Contoso billboard, and the sports radio talk show host drops in three (paid) product placements for Contoso. Looking out your window, the verdant fairways of Royal Oaks golf course beckon. At work, you hop onto a golf equipment reviews site and see that Contoso clubs are rated highly. Finally, that night when you get home you see a link to a hilarious Contoso video starring Steve Stricker. You click on the link and laugh like crazy. That seals it: you hop onto Contoso’s website and purchase new irons for $350.

All told, you received seven advertising impressions prior to purchase:

  1. ESPN.com ad
  2. Blog banner ad
  3. Paid Facebook news feed ad
  4. Outdoor advertising
  5. Golf reviews site (unpaid)
  6. Twitter feed (unpaid)
  7. Video (viral)

How much credit from the $350 sale go to each of the impressions? This is a question of attribution. In the old days, a survey could have asked you, “How did you hear about us?” And you would have said ESPN.com, and perhaps those conducting the survey would give them all of the credit. This is called first touch attribution. If the operator of Contoso’s website looked at the referrer to their site, they’d see that the video brought you there. If you give the video all the credit, that’s called last touch attribution. Finally, you may decide that all seven impressions played a role in your purchase decision, and that the fair thing to do is give each source credit for one seventh: fifty bucks. That’s called linear, multi-touch, or equal weight attribution.

Attribution modeling attempts to provide weights for each impression using statistics. Here is how:

  1. Collect information on every online impression seen by as many people as possible, including the anonymized identity of the viewer, the exact time the impression was delivered, and the device it was delivered to.
  2. Create an analytics model that predicts the probability of purchase based on the entire history of advertising impressions for that individual, called user paths. (Details here.)
  3. Turn these purchase probabilities into attributions by comparing cases where a particular type of advertising (say Facebook) did and did not appear in user paths. For example, if it turned out that purchase probabilities are the same whether or not Facebook appears in a user path, then Facebook should not get any credit.

Attribution models look at data at a very granular level – down to the individual level as opposed to a metropolitan area (as in Google AdWords) or at the store level (as in a marketing mix model). This means there is the potential for greater accuracy and more targeted conclusions. In a world where marketing messages are becoming more and more individualized, attribution modeling is attractive to advertisers and advertising platforms alike. This potential is what AOL is paying $100 million for.

On the other hand, like any model, attribution models need to adequately represent reality to be of much use. This isn’t easy. If you think about it, there are a few potential pitfalls of trying to apply attribution modeling:

  1. Not all impressions are delivered online. In fact, many are not: TV, radio, billboards, word of mouth, and so on. Attribution modelers are aware of this of course, but they need to account for this by an “out of band” process that may be prone to error, for example by running another model for traditional media (such as a marketing mix model), or entering projected (“fake” as my previous boss used to say) information about offline media. This amounts to modeling with previously modeled data, which leads to doubt about the confidence intervals for the results.
  2. Not all sales are captured online. This is highly dependent on category: music is primarily purchased online, soup is not. Attribution modeling started in online-only categories, but as its popularity grows this issue can no longer be avoided.
  3. Not everyone purchases. You have to reckon not only with cases where a sale occurs, but cases where a sale does not occur. More specifically, cases where a category sale occurs even though the target product was not selected. In other words, Contoso impressions may have led to a Fabrikam sale.
  4. What about synergy? Even if all of the above considerations are accounted for, there is still the question of whether the combined effect of impressions from multiple channels exceeds their individual impact.

Even if all of these traps are avoided, there is still the issues of expense and complexity. Marshaling all of the detailed information required to carry out an attribution modeling analysis is hard. Making sure it is right is even harder. Will AOL be able to make attribution modeling cost effective and trustworthy enough to put it the reach of all of its advertising customers? We shall see.

Advertisement

Build Models To Understand Your Data

I am not at the 2014 INFORMS Optimization Society Conference but good friend and human pyramid obsessive Jeff Linderoth reports the following from Gurobi’s Bob Bixby: 

This has been my experience as well. Here’s an example. At Nielsen we often created predictive models of the sales of our clients’ products, often consumer packaged goods. The way we did this, called marketing mix, was to collect sales data for all of the products and regions of interest on a weekly basis and match it with data representing all of the factors that we believe impacted sales, such as advertising, price and weather. Then we’d run a big regression model to understand the impact of each causal factor on sales, so that we could confidently make statements such as “8% of your sales come from your TV advertising.” Read more about marketing mix here. Understanding where sales come from is important of course, but even more important is trying to use that information to predict future sales. We’d use the results of a marketing mix model to produce “sales response curves”. A sales response curve predicts the expected sales lift given some amount of causal activity. For example, in the response curve below, 50 units of TV advertising produces 150 units of sales lift. The orange points represent historical activity levels and sales lifts, and the blue curve is the projection used to predict future results:

Sales Response Curve

We often found that the best way to “debug” a predictive model of sales, besides looking at basic summary statistics and charts of the output, was to try and use the predictive model to make decisions. That is, we’d build an optimization model that would try to find future budgets for all of the causal factors (e.g. TV, Radio, Facebook advertising, price, coupons, and so on) that result in the highest possible profit, given budgetary constraints. When we used the results of a “tested, verified” predictive model inside our optimization model, we’d often find that the results of the optimization model were completely nuts. For example, the optimization model might recommend moving millions of dollars from national TV advertising to direct mail advertising in Toledo. After we checked that we didn’t screw up the optimization model, we’d often find that the cause was “garbage in, garbage out”. For example:

  • The sales data for Toledo was incorrect (or partially correct).
  • The units for the direct mail time series were improperly specified (perhaps in 1s rather than 1000s).
  • The cost of direct mail activity was incorrect.
  • The direct mail data was improperly scaled when fed into the MMM regression model.
  • The analyst gave improper weights or priors for direct mail or Toledo-related coefficients.
  • The procedure that produced response curves from the marketing mix results (the part that fits the blue line to the orange points) was run incorrectly.
  • An outlier in direct mail activity in Toledo influenced the response curve fitting procedure.

Any of these data problems could have caused an implausibly high sales response curve for Toledo direct mail activity, which would influence the optimizer to dump tons of money there. Given the volume of data involved in a real-world predictive model, this stuff can be hard to find even with modern tools and careful analysts. Often the best way to learn something is to teach someone else, and the best way to validate data is to use it in a model.

Marketing Mix Analytics II – Modeling

In my previous post I discussed the challenges in obtaining data to measure marketing effectiveness in marketing mix models. Getting good data fast is hard because of comprehensiveness and correctness concerns. In this post I want to address modeling challenges. The heart of most MMM estimation is multivariate regression: a statistical means for predicting one quantity (referred to as the dependent variable) in terms of others (the independent variables). MMM systems rely on prebuilt multivariate regression packages from SAS, R, SPSS, or somebody else.

The dependent variable in an MMM is typically related to brand volume: cases of green beans, for example. Sales may seem like a more natural choice, but there are difficulties. For a global product, this would require conversion rates, and even for single currency projects, inflation comes into play. More importantly, regular and promoted price are often independent variables in an MMM, so using sales gets confusing. So brand volume is often the way to go. As I noted in the previous post, a single MMM may model an entire brand – which may consist of UPCs with different sized packages or units: 12 oz, 16 oz, 12 pack, 2 liter for example. This means that to model with brand volume it’s necessary to convert the sales volume of each UPC into what is referred to “equivalized units”. For example, if equivalized units are expressed in terms of 24 count cases, then a 12 pack counts as 0.5.

You’ve got to mess with the independent variables too. The reason why is that you want to scale or transform them so that they are as useful as possible for predicting volume. There are tricks won through experience that depend on the quantity. As an example, weather often affects sales. Papa Murphy’s pizza offers discounts in the summer based on daily high temperature, in recognition of this fact. When average weekly temperature is used as an independent variable, it is often mean centered – that is, the average is subtracted from each week. Variation from average is more useful than a string of values in the 70s. A log transform may be applied in other cases, and so on.

Once the data is prepared, the regression comes out and estimated volumes come out based on the model. It’s tempting to simply compare the estimated and actual volumes: if they’re close, the model’s good. This can be done using standard statistical measures such as R^2 or MAPE. Evaluating a model solely on fit is a very bad idea. The biggest reason is that you risk overfitting. Overfitting happens when there are too many independent variables in the model, so that you are no longer modeling the underlying phenomena causing sales. I can get amazing R^2 for any marketing mix model by simply adding a bunch of independent variables with random noise. Random bumps in the series will happen to line up with portions of sales, and just like a big room full of monkeys accidentally banging out Shakespeare, out comes great fit. Overfitting is often more subtle, for example by trying to account for differences in regions, channels, stores, and so on. Each new variable by itself may seem reasonable but collectively the model becomes overspecified. The guard against this is to take holdout samples: randomly pull some percentage of the independent variable data prior to estimating. Then measure fit on both the modeled data and the holdout sample. If the fit on the modeled data is great but poor for the holdout sample, you’re overfitting.

Another important consideration is the level of data aggregation. A simple rule of thumb is to try and get all of data at the level of aggregation where it occurs in real life, and model at the lowest common level. If you can’t do that, aggregate up until you can. This implies that for MMM it would be great if I could get individual sales data for everyone in my modeling universe, along with information about all of their media exposure, the grocery store features they were exposed to, the coupons they received, etc. Not bloody likely, even with NSA assistance. And even if I could obtain this data, it might be difficult to clean, prepare and model at this level. Grocery store scanners yield very accurate store level sales data, therefore store level models are frequently used in the US for brands that are sold in grocery stores. In other situations a market level model is more appropriate. The danger of modeling at a higher level of aggregation is that we lose variation in the data, and therefore predictive power. This is easiest to see in the time dimension. Consider a TV ad campaign where we run ads for the first two weeks of a month, and then pull them for the last two. When viewed at the biweekly level, TV activity zigzags up and down in predictable fashion. When viewed at the monthly level, TV activity is uniform and would therefore be useless in an MMM.

A last (underrated) consideration is reasonableness. Does the result makes sense? This seems obvious but when the amount of input and output are considered, this can be laborious and tricky. Looking at different “pivots” of the output results is often helpful. Something is reasonable only with respect to a convention. The convention in this case can come from past models, industry norms, or even the opinion of the client. The latter is dangerous because there can be considerable pressure to bend the model so that the result is exactly what the client expects. Modeling is complicated and it’s usually pretty easy to second guess details at any step in the process, so the safe bet is simply to tell the client what they want to hear. Don’t do that! And don’t tell them what you want them to hear – tell them what they need to hear, based on facts. Reasonableness assessments are intended to ferret out flaws in data preparation or modeling, not to reject uncomfortable truths.

Marketing Mix Analytics I – Data Acquisition

At last spring’s INFORMS Analytics Conference I was invited to speak about Marketing Mix Analytics at Nielsen. I thought I would (belatedly) summarize my talk for those who were not able to attend.

Nielsen’s mission is to provide the most complete understanding of what consumers watch and buy. My team builds analytics solutions that use watch and buy information to help advertisers understand where their sales come from. Our primary analytical tool to do this is called marketing mix modeling. This first post summarizes what marketing mix is about, and how modeling teams assemble the data necessary for a mix model.

Marketing mix models measure the impact of marketing (and other drivers) on sales. Simply put, we go get sales data in partnership with our clients, and find matching time series data for everything that we believe affects sales: their advertising whether TV, radio, or online; their trade activity such as features and displays in grocery stores; their pricing and discounts; events, holidays, and industry trend. Once we obtain all of this data, we build a big regression model that predicts sales based on all of these factors. This has the effect of attributing the dips and spikes in sales to corresponding dips and spikes in activity. A big ad appears in the paper: sales spike. We assume some portion of the spike is because of the ad. When we run the regression model we obtain a decomposition of sales according to the various factors in the model, based on their coefficients. This allows us to make statements such as “7% of your sales are due to your TV advertising,” or, “you lost 3% of your sales due to your competitor’s pricing strategy.”

These kinds of statements are useful by themselves but they’re even better when you turn them into decisions that affect the future. This is done by chaining models together to provide additional insight. A marketing mix model produces coefficients and decomps – which characterize past sales for historical levels of, for example, TV advertising. We can turn those into sales response curves which predict sales for any level of activity – even levels of advertising that were not conducted historically. These curves are the basis for forecasting and optimization models for media planning. Moving from raw sales, advertising, and pricing data to a coordinated, targeted media plan is a huge leap, but not without challenges.

Textbooks and websites will tell you that marketing mix modeling is old hat, but doing it right is hard work. First of all, getting the data is difficult. The point is for the analyst team and client to dream up everything that can impact sales…and obtain matching, correct time series data for up to three years in duration. Some data, like TV or radio advertising, can be sourced from within Nielsen. Sales, revenue, and margin data comes from a combination of client and MMM vendor sources. Other data such as industry trend, macroeconomic data, and so on may come from third parties. Cleaning and verifying data is always hard, but it’s particularly hard in marketing mix because of its dimensionality. The modeled product dimension may be at the brand, sub brand, or even the PPG (price promoted group) level – a collection of UPCs. Sales data is sometimes modeled down to the store level via grocery store scanner reports. The variety and intricacy of the data used for a “straightforward” mix necessitates a data review between the analyst team and client before modeling even begins. This step alone – getting the data – sometimes takes half of the total cycle time in a mix engagement. Time is money, so defining workflows and procedures that result in quick, accurate, repeatable data acquisition are good for vendor and client alike.