Box plots are widely used among data scientists and statisticians. They’re useful because they show variation both between and within data series. R, Python’s matplotlib, and many other charting libraries support box plots right out of the…box, but Excel does not. In Excel 2013, with a little bit of imagination you can create nice looking box plots without writing any code.
Read this post to find out how to create box plots that look like this:
You’ll need to start with a table containing the data you want to plot. I am using the data from the Michelson-Morley experiment:
A box plot shows the median of each data series as a line, with a “box” whose top edge is the third quartile and whose bottom edge is the first quartile. Often we draw “whiskers” at the top and bottom representing the extreme values of each series.
If we create an auxiliary data containing this data and follow my advice from my Error Bars in Excel post, we can create a nice looking box plot.
Step 1: Calculate Quartiles and Extremes.
Create another table with the following rows for each series: min, q1, q2, q3, max. These will be the primary data in your box plot. Min and max are easy – use the =MIN() and =MAX() formulas on each data series (represented as columns A – E in my example). To compute Q1-Q3 use the QUARTILE.INC() function. (INC means “inclusive”. QUARTILE.EXC() would work fine if that’s what you want.) Enter the formulas for the first series and then “fill right”:
Step 2: Calculate box and whisker edges
We are going to create a stacked column chart with error bars, and “hide” the bottommost column in the stack to make the chart look like a box plot. Therefore we have to calculate the tops and bottoms of our boxes and whiskers:
- The bottom of each box is Q1.
- The ‘middle’ of each box is Q2 (the median). Since this is a stacked column chart, we actually want to compute Q2 – Q1.
- The top of each box is Q3. Since we want to represent this as a “slice” in the stacked column chart, we want Q3 – Q2.
- The error going “down” in the chart is Q2 – min, since the whiskers start at the median.
- The error going “up” is max – Q2.
Compute these five quantities as rows and you’ll have this:
Step 3: Create a stacked column chart.
Go to the INSERT tab and select a stacked column chart:
Now right click on the blank chart, choose Select Data Range and select the “box lo, box mid, mix hi range” as your data:
Step 4: Make the chart look like a Box Plot.
This is simple: the bottom bar (the blue ones in my example) need to go away. So right click on a blue bar and change both the outline and fill to nothing.
Step 5: Add Whiskers.
Follow the steps in my celebrated “Add Error Bars” post. Click on the “+” next to the chart, select Error Bars. Choose Series 2 (which corresponds to the median). Click on “More options” in the Error Bars flyout menu next to the “+”. In the task pane on the right, for Error Amount choose Custom and then click the Specify Value button:
For “Positive Error Value” select the “err up” row and for “Negative Error Value” select “err down”. Both rows contain positive values, and that is totally fine. Here’s what mine looks like:
That’s it! You can of course customize the other bars as desired.
“Bixby says” there is no better tool than building a model to force you to understand your data.
— Jeff Linderoth (@JeffLinderoth) March 7, 2014
This has been my experience as well. Here’s an example. At Nielsen we often created predictive models of the sales of our clients’ products, often consumer packaged goods. The way we did this, called marketing mix, was to collect sales data for all of the products and regions of interest on a weekly basis and match it with data representing all of the factors that we believe impacted sales, such as advertising, price and weather. Then we’d run a big regression model to understand the impact of each causal factor on sales, so that we could confidently make statements such as “8% of your sales come from your TV advertising.” Read more about marketing mix here. Understanding where sales come from is important of course, but even more important is trying to use that information to predict future sales. We’d use the results of a marketing mix model to produce “sales response curves”. A sales response curve predicts the expected sales lift given some amount of causal activity. For example, in the response curve below, 50 units of TV advertising produces 150 units of sales lift. The orange points represent historical activity levels and sales lifts, and the blue curve is the projection used to predict future results:
We often found that the best way to “debug” a predictive model of sales, besides looking at basic summary statistics and charts of the output, was to try and use the predictive model to make decisions. That is, we’d build an optimization model that would try to find future budgets for all of the causal factors (e.g. TV, Radio, Facebook advertising, price, coupons, and so on) that result in the highest possible profit, given budgetary constraints. When we used the results of a “tested, verified” predictive model inside our optimization model, we’d often find that the results of the optimization model were completely nuts. For example, the optimization model might recommend moving millions of dollars from national TV advertising to direct mail advertising in Toledo. After we checked that we didn’t screw up the optimization model, we’d often find that the cause was “garbage in, garbage out”. For example:
- The sales data for Toledo was incorrect (or partially correct).
- The units for the direct mail time series were improperly specified (perhaps in 1s rather than 1000s).
- The cost of direct mail activity was incorrect.
- The direct mail data was improperly scaled when fed into the MMM regression model.
- The analyst gave improper weights or priors for direct mail or Toledo-related coefficients.
- The procedure that produced response curves from the marketing mix results (the part that fits the blue line to the orange points) was run incorrectly.
- An outlier in direct mail activity in Toledo influenced the response curve fitting procedure.
Any of these data problems could have caused an implausibly high sales response curve for Toledo direct mail activity, which would influence the optimizer to dump tons of money there. Given the volume of data involved in a real-world predictive model, this stuff can be hard to find even with modern tools and careful analysts. Often the best way to learn something is to teach someone else, and the best way to validate data is to use it in a model.
I have downloaded stats for the recently completed 2013 NFL regular season from yahoo.com, cleaned the data, and saved in Excel format. The files are located here. The column headers should be self-explanatory.
There are seven worksheets:
- QB: quarterback data.
- RB: running backs.
- WR: wide receivers.
- TE: tight ends.
- K: kickers. For the yardage columns, the portion before the dash indicates FG made, the portion after attempted. So 4-5 means 4 made out of 5 from the distance range.
- Def: defensive stats by team.
- ST: special teams stats by team.
I will split these out into CSV files soon.
I am pleased to share that I am joining Frontline Systems as Chief Technology Officer. Frontline Systems, developer of the Excel Solver, has been a vanguard analytics company for over two decades. Frontline’s analytics leadership has been recognized by the INFORMS Impact Prize, and is reflected in the hundreds of millions of computers with Frontline software.
I believe in the power of analytics to help us learn about the world, and make it better. Frontline’s unwavering focus on democratizing analytics is what sets it apart, bringing the cutting edge of operations research, data mining, and simulation to everyone. This fact, combined with the strength of Frontline’s people and product line, are what drew me to this exciting opportunity.
As we begin 2014, analytics, optimization, and data mining are on the minds of more than ever before. I look forward to learning from the team, and working with customers and partners to build solutions that take Frontline to the proverbial next level. I hope to share what I learn with you through this forum, and many others, in the coming months.
Opening one door means closing another. I am grateful for the opportunity to have led a fantastic analytics development team at Nielsen. Our systems carry out more marketing effectiveness projects – at global scale – than anyone else out there, and I couldn’t be prouder of the team’s effort and attitude. Most gratifying of all is seeing so many team members grow professionally, whether through deepening their analytics knowledge, becoming great engineers, collaborating, or stepping up to take on big challenges. I look forward to seeing continued innovation and success from the team going forward.
In my previous post I discussed the challenges in obtaining data to measure marketing effectiveness in marketing mix models. Getting good data fast is hard because of comprehensiveness and correctness concerns. In this post I want to address modeling challenges. The heart of most MMM estimation is multivariate regression: a statistical means for predicting one quantity (referred to as the dependent variable) in terms of others (the independent variables). MMM systems rely on prebuilt multivariate regression packages from SAS, R, SPSS, or somebody else.
The dependent variable in an MMM is typically related to brand volume: cases of green beans, for example. Sales may seem like a more natural choice, but there are difficulties. For a global product, this would require conversion rates, and even for single currency projects, inflation comes into play. More importantly, regular and promoted price are often independent variables in an MMM, so using sales gets confusing. So brand volume is often the way to go. As I noted in the previous post, a single MMM may model an entire brand – which may consist of UPCs with different sized packages or units: 12 oz, 16 oz, 12 pack, 2 liter for example. This means that to model with brand volume it’s necessary to convert the sales volume of each UPC into what is referred to “equivalized units”. For example, if equivalized units are expressed in terms of 24 count cases, then a 12 pack counts as 0.5.
You’ve got to mess with the independent variables too. The reason why is that you want to scale or transform them so that they are as useful as possible for predicting volume. There are tricks won through experience that depend on the quantity. As an example, weather often affects sales. Papa Murphy’s pizza offers discounts in the summer based on daily high temperature, in recognition of this fact. When average weekly temperature is used as an independent variable, it is often mean centered – that is, the average is subtracted from each week. Variation from average is more useful than a string of values in the 70s. A log transform may be applied in other cases, and so on.
Once the data is prepared, the regression comes out and estimated volumes come out based on the model. It’s tempting to simply compare the estimated and actual volumes: if they’re close, the model’s good. This can be done using standard statistical measures such as R^2 or MAPE. Evaluating a model solely on fit is a very bad idea. The biggest reason is that you risk overfitting. Overfitting happens when there are too many independent variables in the model, so that you are no longer modeling the underlying phenomena causing sales. I can get amazing R^2 for any marketing mix model by simply adding a bunch of independent variables with random noise. Random bumps in the series will happen to line up with portions of sales, and just like a big room full of monkeys accidentally banging out Shakespeare, out comes great fit. Overfitting is often more subtle, for example by trying to account for differences in regions, channels, stores, and so on. Each new variable by itself may seem reasonable but collectively the model becomes overspecified. The guard against this is to take holdout samples: randomly pull some percentage of the independent variable data prior to estimating. Then measure fit on both the modeled data and the holdout sample. If the fit on the modeled data is great but poor for the holdout sample, you’re overfitting.
Another important consideration is the level of data aggregation. A simple rule of thumb is to try and get all of data at the level of aggregation where it occurs in real life, and model at the lowest common level. If you can’t do that, aggregate up until you can. This implies that for MMM it would be great if I could get individual sales data for everyone in my modeling universe, along with information about all of their media exposure, the grocery store features they were exposed to, the coupons they received, etc. Not bloody likely, even with NSA assistance. And even if I could obtain this data, it might be difficult to clean, prepare and model at this level. Grocery store scanners yield very accurate store level sales data, therefore store level models are frequently used in the US for brands that are sold in grocery stores. In other situations a market level model is more appropriate. The danger of modeling at a higher level of aggregation is that we lose variation in the data, and therefore predictive power. This is easiest to see in the time dimension. Consider a TV ad campaign where we run ads for the first two weeks of a month, and then pull them for the last two. When viewed at the biweekly level, TV activity zigzags up and down in predictable fashion. When viewed at the monthly level, TV activity is uniform and would therefore be useless in an MMM.
A last (underrated) consideration is reasonableness. Does the result makes sense? This seems obvious but when the amount of input and output are considered, this can be laborious and tricky. Looking at different “pivots” of the output results is often helpful. Something is reasonable only with respect to a convention. The convention in this case can come from past models, industry norms, or even the opinion of the client. The latter is dangerous because there can be considerable pressure to bend the model so that the result is exactly what the client expects. Modeling is complicated and it’s usually pretty easy to second guess details at any step in the process, so the safe bet is simply to tell the client what they want to hear. Don’t do that! And don’t tell them what you want them to hear – tell them what they need to hear, based on facts. Reasonableness assessments are intended to ferret out flaws in data preparation or modeling, not to reject uncomfortable truths.
At last spring’s INFORMS Analytics Conference I was invited to speak about Marketing Mix Analytics at Nielsen. I thought I would (belatedly) summarize my talk for those who were not able to attend.
Nielsen’s mission is to provide the most complete understanding of what consumers watch and buy. My team builds analytics solutions that use watch and buy information to help advertisers understand where their sales come from. Our primary analytical tool to do this is called marketing mix modeling. This first post summarizes what marketing mix is about, and how modeling teams assemble the data necessary for a mix model.
Marketing mix models measure the impact of marketing (and other drivers) on sales. Simply put, we go get sales data in partnership with our clients, and find matching time series data for everything that we believe affects sales: their advertising whether TV, radio, or online; their trade activity such as features and displays in grocery stores; their pricing and discounts; events, holidays, and industry trend. Once we obtain all of this data, we build a big regression model that predicts sales based on all of these factors. This has the effect of attributing the dips and spikes in sales to corresponding dips and spikes in activity. A big ad appears in the paper: sales spike. We assume some portion of the spike is because of the ad. When we run the regression model we obtain a decomposition of sales according to the various factors in the model, based on their coefficients. This allows us to make statements such as “7% of your sales are due to your TV advertising,” or, “you lost 3% of your sales due to your competitor’s pricing strategy.”
These kinds of statements are useful by themselves but they’re even better when you turn them into decisions that affect the future. This is done by chaining models together to provide additional insight. A marketing mix model produces coefficients and decomps – which characterize past sales for historical levels of, for example, TV advertising. We can turn those into sales response curves which predict sales for any level of activity – even levels of advertising that were not conducted historically. These curves are the basis for forecasting and optimization models for media planning. Moving from raw sales, advertising, and pricing data to a coordinated, targeted media plan is a huge leap, but not without challenges.
Textbooks and websites will tell you that marketing mix modeling is old hat, but doing it right is hard work. First of all, getting the data is difficult. The point is for the analyst team and client to dream up everything that can impact sales…and obtain matching, correct time series data for up to three years in duration. Some data, like TV or radio advertising, can be sourced from within Nielsen. Sales, revenue, and margin data comes from a combination of client and MMM vendor sources. Other data such as industry trend, macroeconomic data, and so on may come from third parties. Cleaning and verifying data is always hard, but it’s particularly hard in marketing mix because of its dimensionality. The modeled product dimension may be at the brand, sub brand, or even the PPG (price promoted group) level – a collection of UPCs. Sales data is sometimes modeled down to the store level via grocery store scanner reports. The variety and intricacy of the data used for a “straightforward” mix necessitates a data review between the analyst team and client before modeling even begins. This step alone – getting the data – sometimes takes half of the total cycle time in a mix engagement. Time is money, so defining workflows and procedures that result in quick, accurate, repeatable data acquisition are good for vendor and client alike.
I will probably regret writing this post.
So perhaps you have seen these touchless toilets in airports. They flush by themselves. Exhibit A:
Amazing. Wonderful. Sanitary. See that button on the left? That flushes the old fashioned way. You need that. Because no matter how good the sensor is, how elegant your solution for activating the sensor, no matter how impeccably designed the sensor is, you are going to need that button. And when you need that button, boy, you really need that button.
The same thing goes when you are building an “automated” analytics system that hides all of the math and complexity from your poor user. You’re going to need that button, for scarily similar reasons.