A Missing Data Challenge

Be careful with missing values. I have heard advice recently that it’s often okay to just ignore missing values. Sure, sometimes…but be careful! We were recently given some data that looked like this – let’s say that it represents the number of shoppers visiting eight different retail stores over the course of a week. (I have anonymized the data.)

  5/4/2015 5/5/2015 5/6/2015 5/7/2015 5/8/2015 5/9/2015 5/10/2015
1   1150 1065 1155 1091   1104
3 1167 1328 1189 1151 828 800 1110
4 2130 1853 1064        
5 2041 2014 1461   1578 1346  
6 3016 2699 2043 2757 2414 2268  
7   1282 893 1197 1243    
8 2752 2001   2071     1511
Average 2221.2 1761.0 1285.8 1666.2 1430.8 1471.3 1241.7

Let’s say our team had developed a forecasting method and were asked to compare our results against this data. What sorts of problems could you encounter if you simply ignored the missing values? Are the averages trustworthy? If you wanted to fill in the missing values above, how would you do that? And what the heck do you do about store 2?

Data acquisition is not as fun to think about as building cool machine learning or optimization models, but is every bit as important, and the issues are often subtle.

Author: natebrix

Follow me on twitter at @natebrix.

1 thought on “A Missing Data Challenge”

  1. I would be inclined to send a memo to the manager of store 2 (and possibly CC the manager of store 4) noting that stores with no sales will be closed (and their staff let go). That might turn up some of the missing data. 🙂 I realize this is probably not literally store sales, but in my limited dealings with corporate data I’ve found that reliability of the recording process deteriorates if the people responsible for recording it are unmotivated, unsupervised or somehow get the idea that the data is unimportant and is just being recorded from force of habit, never to be used.

    Statistically, I’ve seen a number of people (which here means academics) substitute averages of observations adjacent to missing ones, or fit local regressions (maybe lowess) and substitute the predictions for the missing values. This of course plays hob with autocorrelation in time series, and tends to produce self-fulfilling prophecies (perfectly acceptable to some of my colleagues as long as the prophecy is the one they wanted).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s