Optimization In Google Sheets

Good news for those of you that use spreadsheets to do analytics: Google recently announced a Linear Optimization add-on for Google Sheets, and now Frontline Systems has released a free Solver add-on for Google Sheets that solves not only linear optimization problems, but nonlinear ones as well. It has roughly the same capabilities as the Solver App for Excel Online. If you know how to use Excel’s Solver, then you know how to use this. (Disclaimer: I participated in the development of both the Google Sheets and Excel Online apps during my tenure as CTO of Frontline. I think they are great.)

Here’s how to get started with the Solver add-on for Google sheets.

Step 1: Insert the Add-on. Create a new Google Sheet (for example by going to drive.google.com and clicking “New”). Then, under the Add-ons menu, click “Get add-ons..”. Search for “solver” and you will see both the Google and Frontline apps:

image

Click on the button next to Solver. (Hi Edwin!) Now “Solver” will appear under the Add-ons menu. When you click on it, a pane will show up on the right-hand side of your screen.

image

Step 2: Create an optimization model. You can use the task pane to define the variables, objective, and constraints of your optimization model. Clicking on the “Insert Example” button will paste a sample problem into your sheet. Here’s what it looks like: it’s a production planning problem where we want to determine the number of TVs, Stereos, and Speakers to build in order to maximize profit.

image

In the task pane on the right you can see that the profit cell (F13) has been selected as the objective we are maximizing. Similar to the Excel solver, you can define the constraints by clicking on them in the “Subject To” section.

image

Step 3: Solve. Clicking on Solve will call Frontline’s Simplex solver to solve your model on the cloud (specifically – Windows Azure…). The variables B3:D3 will be updated, as will any formulas that depend on those values. As you can see, profit goes up:

image

WINNING. If you fool around with the app you will see that you can solve models with arbitrary formulas, not just linear models. And it’s free! Go check it out.

Predicting the 2014-2015 NBA Season

Over the weekend I created a model in Excel to predict the 2014-2015 NBA season. The model simulates the full 82-game schedule, using player Win Shares from the 2013-2014 season to estimate the strength of each team, accounting for roster changes. This model is not perfect, or even particularly sophisticated, but it is interesting. Here are the model’s predictions as of 10/15/2014, with projected playoff teams in bold.

Eastern W L PCT GB Home Road
Cleveland 61 21 0.744 0 31-10 30-12
Toronto 57 25 0.695 4 30-11 27-14
Chicago 49 33 0.598 12 26-15 23-18
Washington 44 38 0.537 17 24-17 21-20
New York 44 38 0.537 17 23-18 21-20
Miami 42 40 0.512 19 22-19 20-21
Detroit 42 40 0.512 19 22-19 20-21
Charlotte 41 41 0.500 20 22-19 19-22
Atlanta 41 41 0.500 20 22-19 19-22
Indiana 37 45 0.451 24 20-21 17-24
Boston 30 52 0.366 31 16-25 14-27
Brooklyn 26 56 0.317 35 14-27 11-30
Orlando 25 57 0.305 36 14-27 12-29
Milwaukee 24 58 0.293 37 14-27 11-30
Philadelphia 14 68 0.171 47 8-33 6-35
Western W L PCT GB Home Road
LA Clippers 57 25 0.695 0 30-11 27-14
San Antonio 57 25 0.695 0 30-11 27-14
Oklahoma City 54 28 0.646 3 29-14 26-15
Phoenix 53 29 0.646 4 28-13 25-16
Golden State 53 29 0.646 4 28-13 25-16
Portland 49 33 0.598 8 26-15 23-18
Houston 47 35 0.573 10 25-16 22-19
Dallas 47 36 0.566 10 24-17 22-19
Memphis 41 41 0.500 16 22-19 19-22
Denver 40 42 0.488 17 21-20 19-22
Minnesota 37 45 0.451 20 20-21 17-24
Sacramento 32 50 0.390 25 17-24 15-26
LA Lakers 31 51 0.378 26 17-24 14-27
New Orleans 27 55 0.329 30 15-26 12-29
Utah 25 58 0.301 33 14-27 11-30

You can download my full spreadsheet here. It’s complicated but not impossible to follow. It does not exactly match the results presented above because I have a messier version that accounts for recent injuries, e.g. Kevin Durant.

The Cavs, Spurs, and Clips are the favorites to win the title in this model (a previous version of this model also had the Thunder in this class, but Kevin Durant is now injured). Comparing these estimates to over-unders in Vegas, the biggest differences are Brooklyn (lower), Indiana (higher), Memphis (lower), Minnesota (higher), New Orleans (lower), Phoenix (higher). If you take the time to read through the methodology at the end of this post, you may be able to see why some of these differences exist. Some are probably reasonable, others may not be.

How It Works

Many of the ingredients for this model were presented in my previous three posts, where I compiled game-by-game results for the 2013-2014 season, built a simple model to predict rookie performance, and tracked roster changes. Now the task is pretty simple: estimate the strength of each team, figure out how unpredictable games are, and then simulate the season using the strengths, accounting for uncertainty. At the end I discuss weaknesses of this model, which if you are a glass-half-full type of person also suggest areas for improvement.

Step 1: Estimate the strength of each team. Team strengths are estimated by adding up the 2013-2014 Win Shares for the top twelve players on each NBA team. In my last post I gave a spreadsheet with Win Shares for all 2013-2014 NBA players based on data from basketball-reference.com. I made three adjustments to this data for the purposes of this analysis:

  • Added rookies. I estimated projected 2014-2015 Win Shares for rookies using the logarithmic curve given in this post.
  • Accounted for injuries to good players. Kobe Bryant, Derrick Rose, Rajon Rondo, and a couple of other high profile players were injured in 2013-2014. I replaced their Win Share total with the average of the past three seasons, including the season they were injured. Is this reasonable? I don’t know.
  • Trimmed to 12. I manually trimmed rosters so that only the 12 players with the highest Win Shares remained.

Adding Win Shares gives an overall “strength rating” for each team.

Step 2: Estimate the unpredictability of game results. Most of the time, a good team will beat a bad team. Most of the time. Can we quantify this more precisely? Sure. From a previous post, I determined that home court advantage is approximately 2.6 points per game. We also found that although the difference in total season wins is a predictor of who will win in a matchup between two teams, it is a rather weak predictor. in other words, bad teams beat good teams quite often, especially at home. For our prediction model we make another simple assumption: every team’s performance over the course of the season varies according to a normal distribution, with the mean of this distribution corresponding to their overall team strength.

Normal distributions are defined by two parameters: mean and standard deviation. If I know what the normal distribution looks like then I can estimate the probability of the home team winning in a matchup: take the difference of their team strengths, then calculate the cdf of the distribution at –2.6 (the home court advantage). But what is the standard deviation? I can estimate it by “replaying” all of the games in the previous season. If I guess a value for the standard deviation, I can calculate win probabilities for all games. If I add up the win probabilities, for say, Boston, then this should sum to Boston’s win total for the season (sadly, 25). So if I want to estimate the standard deviation, all I have to do is minimize the sum of deviations from estimated and actual 2013-2014 win totals. I can do this using Excel’s Solver: it’s a nonlinear minimization problem involving only one variable (the standard deviation).

It turns out that the resulting estimate does a very good job of matching 2013-2014 results:

  • The estimated win totals for all teams were within 2 wins of their actual values.
  • The estimated home winning percentage matches the actual value quite closely!

Step 3: Determine win totals for the 2014-2015 season. I obtained the 2014-2015 schedule for $5 from nbastuffer.com. Using this schedule, I calculated win probabilities for each game using the team strengths in Step 1 and the standard deviation in Step 2. If I add up the totals for each team, I get their estimated win totals. Voila! Since the prediction is created by looking at each game on the schedule, we also get home and away records, in conference records, and so on. It’s also easy to update the estimate during the season as games are played, players are traded or injured, and so on.

Why this model stinks. The biggest virtue of this model is that it was easy to build. I can think of at least ten potential shortcomings with this model:

  1. Win Shares are probably not the best metric for individual and team strength.
  2. It assumes that individual performance for 2014-2015 will be the same as 2013-2014. Paul Pierce isn’t getting any younger.
  3. It does not account for predictable changes in playing time from season-to-season.
  4. 2014-2015 win shares are not normalized to account for players leaving and entering the league.
  5. 2014-2015 win shares do not account for positive and negative synergies between players.
  6. There is no reason to believe that the standard deviation calculated in Step 2 should be the same for all teams.
  7. I have not given any justification for using a normal distribution at all!
  8. The vagaries of the NBA schedule are not accounted for. For example, teams play worse in their second consecutive road game.
  9. Several teams, including the Philadelphia 76ers, will tank games.
  10. Injuries were handled in an arbitrary and inconsistent fashion.

It will be interesting to see how this model performs, in spite of its shortcomings.

NBA Rosters and Team Changes: 2013-2014

I have downloaded statistics for all NBA players from basketball-reference.com, and accounted for roster changes (as of Sunday, October 5).

The Tm2013 column is a three-letter abbreviation for the player’s 2013-2014 team. For players that played on two or more teams, the entry represents the team for which the player played the most minutes. The Tm2014 is the player’s current NBA team (as of Sunday, October 5) according to nba.com. Rookies are not included in this spreadsheet.

If you’ve read my previous two posts, you may have guessed that I am leading up to a prediction of the upcoming 2014-2015 NBA season. You’d be correct – I will post my model and predictions tomorrow.

In the meantime, you can actually use the spreadsheet above to create a very crude prediction. The last column in this spreadsheet is “win shares”. If you create a pivot table based on Tm2014 and Win Shares, you get the sum of player win shares for current rosters – a crude measure of team strength. Here is what that table looks like:

Team Sum of WS
CLE 60.7
SAS 59.4
TOR 58.6
LAC 58
IND 56.6
OKC 54.1
GSW 53.8
PHO 52.6
POR 50.8
DAL 48.6
HOU 47.9
WAS 46.8
NYK 44.3
DET 43.9
MEM 43.5
MIA 40.8
CHI 40.4
ATL 38.9
DEN 36.9
MIN 34.3
SAC 32.6
CHA 32.1
NOP 31
LAL 26.9
BRK 26.1
BOS 25.9
UTA 23.8
ORL 23.1
MIL 21.3
PHI 7.2

I can think of at least five weaknesses in this “pivot table model”. Can you?

Predicting NBA Rookie Performance By Draft Position

Nate Silver (and others) have tracked how NBA draft position relates to total career performance, see for example this article. But what about first-year performance?

I pulled two sets of data from basketball-reference.com to answer this question:

I then merged them using Power Query and then created a pivot table to calculate the average number of rookie season “win shares” by draft position. You can download my Excel workbook here. Here is what I found:

AverageWinSharesByDraftPosition

The first pick in the draft averages nearly five Win Shares in his rookie season, and while the pattern is irregular, win shares decrease as we get deeper into the draft (duh). (The blip at the end is due to Isaiah Thomas, drafted by the Kings who promptly screwed up by letting him go.) I have drawn a logarithmic trendline which fits the data not-to-shabbily: R^2 of 0.7397. Obviously we could do much better if we considered additional factors related to the player (such as their college performance) and team (the strength of teammates playing the same position, who will compete with the rookie for playing time). Here are the averages for the first 30 draft positions:

Draft POSITION Win Shares
1 4.96
2 2.69
3 2.96
4 4.14
5 2.23
6 1.84
7 3.36
8 1.68
9 2.59
10 1.52
11 0.84
12 1.51
13 1.48
14 1.36
15 1.64
16 1.19
17 2.37
18 1.02
19 0.71
20 1.09
21 1.74
22 2.14
23 1.54
24 2.29
25 0.98
26 1.23
27 1.08
28 0.40
29 0.54
30 0.94
31 0.79

NBA Game Results: 2013-2014

The NBA preseason is in full swing! For those of you who like to fool around with data, I have prepared a CSV file with game-by-game results for the 2013-2014 season. The data was downloaded from basketball-reference.com using Power Query and cleaned up (see below).

The format is simple:

  • Date = When the game was played
  • Visitor = three letter abbreviation of the visiting team
  • VisitorPts = visiting team score
  • VisitorSeasonWins = number of wins by the visiting team for the entire season
  • Home = TLA of home team
  • HomePts = home team score
  • HomeSeasonWins = number of wins by the home team for the entire season
  • WinMargin = HomeSeasonWins – VisitorSeasonWins
  • Margin = HomePts – VistorPts

I include the number of wins for each team in the files because I wanted to see how often good teams beat bad teams. The diagram below plots the difference in total wins for teams against the margin of victory. I have used the trendline feature in Excel to verify that while (by definition) good teams beat bad ones frequently, the variability is quite high. Notice the R^2 value.

NBAWinsVersusMargin

The intercept for the trendline is 2.5967, which represents the home court advantage in points. In a future post I hope to use this data to make some predictions about the upcoming NBA season.

Enjoy!

Liberal Arts: Who’s Our Champion Now?

Last year, Ben Thompson traced the evolution of Apple product introductions in “Whither Liberal Arts?”, beginning with Steve Jobs’ iconic description of Apple being at the “intersection of technology and liberal arts” through Tim Cook’s emphasis on “feeds and speeds”; a move from human stories to product design. These transitions have resulted in difficulties with “that vision thing”, in Thompson’s view. But why? Thompson:

It’s telling that Jobs did not say that Apple was at the intersection of technology and design. While any designer will tell you that design is about how it works, not looks, to pretend there isn’t a tension is naive. The liberal arts, though, and the humanities, are literally about people. They are about how they learn and experience the world, and each other.

Go and check out the introduction of the Mac from 1984. It’s all great, but pay particular attention to the “Chariots of Fire” sequence from 1:43 to 3:15. Six different liberal arts majors are represented:

  • Scrolling MACINTOSH marquee + “insanely great”
  • Paint (geisha) [Art]
  • MacWrite [English]
  • Spreadsheet + calculator [Mathematics]
  • Print Shop certificate
  • A whole bunch of fonts [Graphic Design]
  • Greek columns [Architecture]
  • Pascal code [Computer Science]
  • Paint again (Steve Jobs)
  • Chess

You may find it cheesy, but I eat this stuff up, particularly Jobs’ emotional response to the crowd after the Mac’s ‘speech’ at 4:03. It’s all very…human. Sure, it’s a product introduction by a big company, yet it still resonates emotionally. It’s worth a thousand TED talks.

Recently I spent two days discussing the present and future of the College of Liberal Arts at the University of Iowa in my capacity as Dean’s Advisory Board member. Reflecting on those discussions, and Thompson’s essay, it strikes me that Steve Jobs was perhaps the greatest advocate for liberal arts in our generation, at least in terms of societal impact. Liberal arts “champions”, almost by necessity, are difficult to spot because the power to influence is a product of their ability to articulate the value of liberal arts combined with their notoriety. Notoriety usually comes from exceptional achievement in a relatively narrow domain. We live in an era of specialization; a modern da Vinci would follow a radically different path than the genuine article. So while Jobs was a computer guy, I think it is fair to put him forward as a liberal arts champion.

It embarrasses me slightly that my self-appointed champion of liberal arts was a rich technologist, especially one as hyped as Steve Jobs. Even so, Jobs marshaled learning from related disciplines (architecture, computer science, art, and psychology among others) in pursuit of business success at Apple, resulting in one of the world’s most valued brands. Certainly these are significant achievements, however much or how little we choose to value them. It is uncouth to speak of brands as if they mean something, but because the Apple brand is so universally identified with creativity, aesthetics, and (gulp) a certain humanity, and because these qualities are essential to liberal arts, and because Jobs was the driving force behind the construction of this brand, who else but Jobs?

And after Jobs, now what? It’s hard to envision another technologist, as some descend into Silicon Valley self-parody, and others speak of “gamifying education”. Elon Musk shows signs of becoming the next great industrial titan, but has been content to drag race up and down Technology Avenue thus far. A politician? Doubtful, in this polarized time. A writer? Maybe somewhere else, but probably not in America. Any ideas?

The Only Interesting Thing About the Burrito Bracket Challenge

All models are wrong, but some are useful. Even when it comes to burritos. I had been successfully ignoring the seemingly interminable Burrito Bracket Challenge(*) on Five Thirty Eight until this retweet by Carl Bialik grabbed my attention:

Am I supposed to take any of this seriously? I don’t know. Nevertheless, curiosity got the better of me and I read the Vox article. Matthew Yglesias asserts that Chipotle is better because unlike La Taqueria, most people can actually find a Chipotle burrito when they want one, unlike La Taqueria which is available only to the lucky few within a five mile radius of the Mission District. In the words of Yglesias: 

The best burrito is the burrito you actually want to have in real life, a burrito that is both tasty and available.

OK, so now we’re discussing which definition of “best burrito” is best. Of course, that’s subjective, as I pointed out (in a wildly different context) some time ago. In any decision making process, especially one involving analytics(**), it is critical to openly discuss the criteria for ranking one decision (or burrito) over another. That is obvious. Less obvious is the fact that we often choose our decision criterion based on how easy the criterion is to evaluate, rather than how relevant it is to the matter at hand

Let’s return to burritos. The Five Thirty Eight criterion was “best tasting burrito”. Here’s what they had to do to figure that out:

  1. Decide on a list of attributes to consider, and their weighting. 
  2. Trim the list of all burrito joints in the US to a reasonable number using readily available data.
  3. Visit each location and sample at least one burrito.
  4. Take detailed notes on the experience and produce ratings based on the attributes.
  5. Review and calibrate the ratings.
  6. Determine the winner by sorting the ratings (and getting Silver’s blessing).

That’s real work! The Vox criterion was “tasty and available” (aka “scalable”). Here’s what they had to do:

  1. Get a list of chain restaurants that serve burritos. Remove Taco Bell from this list, because it is not tasty.
  2. Get the number of locations for each.
  3. Sort the list.
Much easier. A child, or possibly an intern, could do this in about thirty minutes, and you can do it informally in one second by whispering “Chipotle”. A more advanced version of the “scalability” study would be to replace step 2 with:
  • Calculate the total reach for each chain by finding the number of people who live within five miles of chain locations.
Throw in a D3 map with burrito reach for the top three chains, and you’ve got yourself a nice little post that will get a bunch of views and retweets.
 
But that would be pointless(***). This global analysis based on a scalability criterion is useless, and it’s useless even though we all know that the “best burrito criterion” question is subjective. The Five Thirty Eight criterion is certainly useful, even if you don’t agree with it: you’ve got an idea of where to find the best tasting burrito in America, based on a reasonable standard. If you’re looking for the best burrito in your region, you can find that too (Northeast? Head to the Bronx). The Vox criterion as evaluated above is not useful because you can’t use it to make decisions. It only tells you that there are tons of Chipotle locations, which you already knew. Such an analysis might be useful for burrito chains themselves, but not for consumers.   
 
If you are looking for a burrito you can actually eat, like, now, then a useful analysis would consist of opening the Yelp app and searching for “burrito” using “Current Location” and sorting by rating. Of course, the results of your analysis and my analysis based on this criteria would yield different results because we live in different places; the results of my analysis apply only to me. 
 
If we wanted an analysis that would actually be useful in decision making, we’d need to modify our methodology to something like:
  1. Divide the United States into little squares, say 5 x 5 miles.
  2. In each square, determine the best “tasty and available” burrito restaurant based on some combination of Yelp reviews of and distance.
  3. Create an interactive map based on these results.
This is much harder to do, but helpful: if you tell me where you are then I can tell you where you should get a burrito. The answer isn’t always Chipotle! Therefore while the general point that “the best burrito is one that you can actually eat” is quite reasonable, a naive global analysis based on this criterion is quite useless. 
 
Choosing decision criteria based on what is simplest rather than most relevant is a fundamental flaw of many analytics applications. I’ve seen it at every company I’ve worked for! For example, it’s common to instrument websites and apps to see how frequently different features, buttons, and pages are used. Data can be collected and we can see, for example, that 70% of the time users will choose to leave a product download page rather than register. If we collect additional information we can infer demographic information, the device they are using, and so on. If you’re trying to figure out how to modify the website, or which features to add to a product, you may turn to this data. This data tells you only how users use what you’ve got, not what they would like to see. You need different data to answer this question, data obtained from A-B testing, surveys, a competitive analysis, and so on. Sorting whatever data you have nearby and making a cool chart out of it is not good analytics.
 
Software doesn’t really help you determine what data you’ll need to answer the questions you care about. It can help you access, process, visualize, and summarize, but that’s it. Our current emphasis on data visualization and infographics obscure this point. In the burrito case, it’s easier to make pretty pictures out of the “most scalable” data than the “best tasting” data, even though it is less useful for decision making. Software vendors aren’t helping either: the emphasis on “storytelling” with data skims over the fact that “analytics stories” must have a “moral”, otherwise they merely entertain rather than inform.

(*) Congratulations, La Taqueria.
(**) because the process for making the decision itself is understood by fewer people, and if you change your mind you will have to ask a geek to go change their code.
(***) other than to attract page views…