Optimization In Google Sheets

Good news for those of you that use spreadsheets to do analytics: Google recently announced a Linear Optimization add-on for Google Sheets, and now Frontline Systems has released a free Solver add-on for Google Sheets that solves not only linear optimization problems, but nonlinear ones as well. It has roughly the same capabilities as the Solver App for Excel Online. If you know how to use Excel’s Solver, then you know how to use this. (Disclaimer: I participated in the development of both the Google Sheets and Excel Online apps during my tenure as CTO of Frontline. I think they are great.)

Here’s how to get started with the Solver add-on for Google sheets.

Step 1: Insert the Add-on. Create a new Google Sheet (for example by going to drive.google.com and clicking “New”). Then, under the Add-ons menu, click “Get add-ons..”. Search for “solver” and you will see both the Google and Frontline apps:

image

Click on the button next to Solver. (Hi Edwin!) Now “Solver” will appear under the Add-ons menu. When you click on it, a pane will show up on the right-hand side of your screen.

image

Step 2: Create an optimization model. You can use the task pane to define the variables, objective, and constraints of your optimization model. Clicking on the “Insert Example” button will paste a sample problem into your sheet. Here’s what it looks like: it’s a production planning problem where we want to determine the number of TVs, Stereos, and Speakers to build in order to maximize profit.

image

In the task pane on the right you can see that the profit cell (F13) has been selected as the objective we are maximizing. Similar to the Excel solver, you can define the constraints by clicking on them in the “Subject To” section.

image

Step 3: Solve. Clicking on Solve will call Frontline’s Simplex solver to solve your model on the cloud (specifically – Windows Azure…). The variables B3:D3 will be updated, as will any formulas that depend on those values. As you can see, profit goes up:

image

WINNING. If you fool around with the app you will see that you can solve models with arbitrary formulas, not just linear models. And it’s free! Go check it out.

Advertisements

Predicting the 2014-2015 NBA Season

Over the weekend I created a model in Excel to predict the 2014-2015 NBA season. The model simulates the full 82-game schedule, using player Win Shares from the 2013-2014 season to estimate the strength of each team, accounting for roster changes. This model is not perfect, or even particularly sophisticated, but it is interesting. Here are the model’s predictions as of 10/15/2014, with projected playoff teams in bold.

Eastern W L PCT GB Home Road
Cleveland 61 21 0.744 0 31-10 30-12
Toronto 57 25 0.695 4 30-11 27-14
Chicago 49 33 0.598 12 26-15 23-18
Washington 44 38 0.537 17 24-17 21-20
New York 44 38 0.537 17 23-18 21-20
Miami 42 40 0.512 19 22-19 20-21
Detroit 42 40 0.512 19 22-19 20-21
Charlotte 41 41 0.500 20 22-19 19-22
Atlanta 41 41 0.500 20 22-19 19-22
Indiana 37 45 0.451 24 20-21 17-24
Boston 30 52 0.366 31 16-25 14-27
Brooklyn 26 56 0.317 35 14-27 11-30
Orlando 25 57 0.305 36 14-27 12-29
Milwaukee 24 58 0.293 37 14-27 11-30
Philadelphia 14 68 0.171 47 8-33 6-35
Western W L PCT GB Home Road
LA Clippers 57 25 0.695 0 30-11 27-14
San Antonio 57 25 0.695 0 30-11 27-14
Oklahoma City 54 28 0.646 3 29-14 26-15
Phoenix 53 29 0.646 4 28-13 25-16
Golden State 53 29 0.646 4 28-13 25-16
Portland 49 33 0.598 8 26-15 23-18
Houston 47 35 0.573 10 25-16 22-19
Dallas 47 36 0.566 10 24-17 22-19
Memphis 41 41 0.500 16 22-19 19-22
Denver 40 42 0.488 17 21-20 19-22
Minnesota 37 45 0.451 20 20-21 17-24
Sacramento 32 50 0.390 25 17-24 15-26
LA Lakers 31 51 0.378 26 17-24 14-27
New Orleans 27 55 0.329 30 15-26 12-29
Utah 25 58 0.301 33 14-27 11-30

You can download my full spreadsheet here. It’s complicated but not impossible to follow. It does not exactly match the results presented above because I have a messier version that accounts for recent injuries, e.g. Kevin Durant.

The Cavs, Spurs, and Clips are the favorites to win the title in this model (a previous version of this model also had the Thunder in this class, but Kevin Durant is now injured). Comparing these estimates to over-unders in Vegas, the biggest differences are Brooklyn (lower), Indiana (higher), Memphis (lower), Minnesota (higher), New Orleans (lower), Phoenix (higher). If you take the time to read through the methodology at the end of this post, you may be able to see why some of these differences exist. Some are probably reasonable, others may not be.

How It Works

Many of the ingredients for this model were presented in my previous three posts, where I compiled game-by-game results for the 2013-2014 season, built a simple model to predict rookie performance, and tracked roster changes. Now the task is pretty simple: estimate the strength of each team, figure out how unpredictable games are, and then simulate the season using the strengths, accounting for uncertainty. At the end I discuss weaknesses of this model, which if you are a glass-half-full type of person also suggest areas for improvement.

Step 1: Estimate the strength of each team. Team strengths are estimated by adding up the 2013-2014 Win Shares for the top twelve players on each NBA team. In my last post I gave a spreadsheet with Win Shares for all 2013-2014 NBA players based on data from basketball-reference.com. I made three adjustments to this data for the purposes of this analysis:

  • Added rookies. I estimated projected 2014-2015 Win Shares for rookies using the logarithmic curve given in this post.
  • Accounted for injuries to good players. Kobe Bryant, Derrick Rose, Rajon Rondo, and a couple of other high profile players were injured in 2013-2014. I replaced their Win Share total with the average of the past three seasons, including the season they were injured. Is this reasonable? I don’t know.
  • Trimmed to 12. I manually trimmed rosters so that only the 12 players with the highest Win Shares remained.

Adding Win Shares gives an overall “strength rating” for each team.

Step 2: Estimate the unpredictability of game results. Most of the time, a good team will beat a bad team. Most of the time. Can we quantify this more precisely? Sure. From a previous post, I determined that home court advantage is approximately 2.6 points per game. We also found that although the difference in total season wins is a predictor of who will win in a matchup between two teams, it is a rather weak predictor. in other words, bad teams beat good teams quite often, especially at home. For our prediction model we make another simple assumption: every team’s performance over the course of the season varies according to a normal distribution, with the mean of this distribution corresponding to their overall team strength.

Normal distributions are defined by two parameters: mean and standard deviation. If I know what the normal distribution looks like then I can estimate the probability of the home team winning in a matchup: take the difference of their team strengths, then calculate the cdf of the distribution at –2.6 (the home court advantage). But what is the standard deviation? I can estimate it by “replaying” all of the games in the previous season. If I guess a value for the standard deviation, I can calculate win probabilities for all games. If I add up the win probabilities, for say, Boston, then this should sum to Boston’s win total for the season (sadly, 25). So if I want to estimate the standard deviation, all I have to do is minimize the sum of deviations from estimated and actual 2013-2014 win totals. I can do this using Excel’s Solver: it’s a nonlinear minimization problem involving only one variable (the standard deviation).

It turns out that the resulting estimate does a very good job of matching 2013-2014 results:

  • The estimated win totals for all teams were within 2 wins of their actual values.
  • The estimated home winning percentage matches the actual value quite closely!

Step 3: Determine win totals for the 2014-2015 season. I obtained the 2014-2015 schedule for $5 from nbastuffer.com. Using this schedule, I calculated win probabilities for each game using the team strengths in Step 1 and the standard deviation in Step 2. If I add up the totals for each team, I get their estimated win totals. Voila! Since the prediction is created by looking at each game on the schedule, we also get home and away records, in conference records, and so on. It’s also easy to update the estimate during the season as games are played, players are traded or injured, and so on.

Why this model stinks. The biggest virtue of this model is that it was easy to build. I can think of at least ten potential shortcomings with this model:

  1. Win Shares are probably not the best metric for individual and team strength.
  2. It assumes that individual performance for 2014-2015 will be the same as 2013-2014. Paul Pierce isn’t getting any younger.
  3. It does not account for predictable changes in playing time from season-to-season.
  4. 2014-2015 win shares are not normalized to account for players leaving and entering the league.
  5. 2014-2015 win shares do not account for positive and negative synergies between players.
  6. There is no reason to believe that the standard deviation calculated in Step 2 should be the same for all teams.
  7. I have not given any justification for using a normal distribution at all!
  8. The vagaries of the NBA schedule are not accounted for. For example, teams play worse in their second consecutive road game.
  9. Several teams, including the Philadelphia 76ers, will tank games.
  10. Injuries were handled in an arbitrary and inconsistent fashion.

It will be interesting to see how this model performs, in spite of its shortcomings.

NBA Rosters and Team Changes: 2013-2014

I have downloaded statistics for all NBA players from basketball-reference.com, and accounted for roster changes (as of Sunday, October 5).

The Tm2013 column is a three-letter abbreviation for the player’s 2013-2014 team. For players that played on two or more teams, the entry represents the team for which the player played the most minutes. The Tm2014 is the player’s current NBA team (as of Sunday, October 5) according to nba.com. Rookies are not included in this spreadsheet.

If you’ve read my previous two posts, you may have guessed that I am leading up to a prediction of the upcoming 2014-2015 NBA season. You’d be correct – I will post my model and predictions tomorrow.

In the meantime, you can actually use the spreadsheet above to create a very crude prediction. The last column in this spreadsheet is “win shares”. If you create a pivot table based on Tm2014 and Win Shares, you get the sum of player win shares for current rosters – a crude measure of team strength. Here is what that table looks like:

Team Sum of WS
CLE 60.7
SAS 59.4
TOR 58.6
LAC 58
IND 56.6
OKC 54.1
GSW 53.8
PHO 52.6
POR 50.8
DAL 48.6
HOU 47.9
WAS 46.8
NYK 44.3
DET 43.9
MEM 43.5
MIA 40.8
CHI 40.4
ATL 38.9
DEN 36.9
MIN 34.3
SAC 32.6
CHA 32.1
NOP 31
LAL 26.9
BRK 26.1
BOS 25.9
UTA 23.8
ORL 23.1
MIL 21.3
PHI 7.2

I can think of at least five weaknesses in this “pivot table model”. Can you?

Predicting NBA Rookie Performance By Draft Position

Nate Silver (and others) have tracked how NBA draft position relates to total career performance, see for example this article. But what about first-year performance?

I pulled two sets of data from basketball-reference.com to answer this question:

I then merged them using Power Query and then created a pivot table to calculate the average number of rookie season “win shares” by draft position. You can download my Excel workbook here. Here is what I found:

AverageWinSharesByDraftPosition

The first pick in the draft averages nearly five Win Shares in his rookie season, and while the pattern is irregular, win shares decrease as we get deeper into the draft (duh). (The blip at the end is due to Isaiah Thomas, drafted by the Kings who promptly screwed up by letting him go.) I have drawn a logarithmic trendline which fits the data not-to-shabbily: R^2 of 0.7397. Obviously we could do much better if we considered additional factors related to the player (such as their college performance) and team (the strength of teammates playing the same position, who will compete with the rookie for playing time). Here are the averages for the first 30 draft positions:

Draft POSITION Win Shares
1 4.96
2 2.69
3 2.96
4 4.14
5 2.23
6 1.84
7 3.36
8 1.68
9 2.59
10 1.52
11 0.84
12 1.51
13 1.48
14 1.36
15 1.64
16 1.19
17 2.37
18 1.02
19 0.71
20 1.09
21 1.74
22 2.14
23 1.54
24 2.29
25 0.98
26 1.23
27 1.08
28 0.40
29 0.54
30 0.94
31 0.79

NBA Game Results: 2013-2014

The NBA preseason is in full swing! For those of you who like to fool around with data, I have prepared a CSV file with game-by-game results for the 2013-2014 season. The data was downloaded from basketball-reference.com using Power Query and cleaned up (see below).

The format is simple:

  • Date = When the game was played
  • Visitor = three letter abbreviation of the visiting team
  • VisitorPts = visiting team score
  • VisitorSeasonWins = number of wins by the visiting team for the entire season
  • Home = TLA of home team
  • HomePts = home team score
  • HomeSeasonWins = number of wins by the home team for the entire season
  • WinMargin = HomeSeasonWins – VisitorSeasonWins
  • Margin = HomePts – VistorPts

I include the number of wins for each team in the files because I wanted to see how often good teams beat bad teams. The diagram below plots the difference in total wins for teams against the margin of victory. I have used the trendline feature in Excel to verify that while (by definition) good teams beat bad ones frequently, the variability is quite high. Notice the R^2 value.

NBAWinsVersusMargin

The intercept for the trendline is 2.5967, which represents the home court advantage in points. In a future post I hope to use this data to make some predictions about the upcoming NBA season.

Enjoy!

401k Simulation Using Analytic Solver Platform

You can build a pretty decent 401k simulation in a few minutes in Excel using Analytic Solver Platform:

image

Let’s give it a shot! You can download the completed workbook here.

First, let’s build a worksheet that calculates 401k balances for 10 years. At the top of the worksheet let’s enter a yearly contribution rate:

image

Let’s compute 401k balances for the next 10 years, based on this contribution. A simple calculation for the balance for a given year involves five factors:

  1. The 401k balance for the previous year.
  2. The rate of return for the 401k.
  3. The previous year’s salary.
  4. The rate of increase in the salary (your raise).
  5. The rate of contribution (entered above).

In row 6 we will enter in the starting values for return, salary increase, balance, and salary in columns B, C, D, E respectively. For now let’s assume:

  • Return = 0.05
  • Salary Increase = 0.05
  • Balance = 5,000
  • Salary = 100,000

With a couple of small assumptions, the new balance is old balance * return + contribution * (salary * (1 + salary increase)). In the next row we will compute Year 1, using this formula:

  • Salary = D6 * (1 + C6). This simply means that this year’s salary is last year’s adjusted by raise. (Obviously salary could be modeled differently depending on when the raise kicks in.)
  • Balance = E6*(1 + B6)+D6*$B$3. There are two terms. The first is the old balance times the portfolio return. The second is the current salary times the contribution rate.

We can fill these values down, giving us the 401k balance for the entire period:

image

Here’s the thing: we don’t actually know what our portfolio return and salary increases will be in future years. They’re uncertain. We can use Analytic Solver Platform to turn the wild guesses in columns B and C into probability distributions. Using simulation we can then determine the most likely range for future 401k balances.

For portfolio return, a reasonable thing to do is to go back and look at past performance. Rates of return for the S&P 500 (and other financial instruments) are given on this page. Using the “From Web” feature of Power Query (or by simply copy-pasting) you can bring this data into another Excel worksheet with no sweat:

image

Now let’s turn this historical data into a probability distribution we can use in our model. Select the S&P 500 historical return data and select Distibutions –> Distribution Wizard in the Analytic Solver Platform tab:

image

Fill in the first page of the wizard:

image

Select “continuous values” in the next step, “Fit the data” in the next, and then pick an empty cell for “Location” in the final step. In the cell that you selected, you will see a formula something like this:

=PsiWeibull(3.55593208704872,0.692234009779183, PsiShift(-0.509633992648591))

This is a Weibull distribution that fits the historical data. If you hit “F9” to recalculate the spreadsheet you will see that the value for this cell changes as a result of sampling from this distribution. Each sample is a different plausible yearly return. Let’s copy this formula in place of the 0.05 values we entered in column B of our original spreadsheet. If we click on the “Model” button in the Analytic Solver Platform ribbon, we will see that these cells have been labeled as “Uncertain Variables” in the Simulation section.

For Salary Increase we will do something simpler. Let’s just assume that the increase will be between 2% and 7% each year. Enter =PsiUniform(0.02, 0.07) in cell C6, and fill down.

The last thing we need to do is to define an “output” for the simulation, called an Uncertain Function. When we define Uncertain Functions, we get nice charts and stats for these cells when we run a simulation. Click on the Balance entry for Year 10, then click on arrow next to the “+” in the Model Pane, and then Add Uncertain Function. Your Model Pane will look something like this:

image

And your spreadsheet will look something like this:

image

Now all we need to do is click Simulate in the ribbon. Analytic Solver Platform draws samples for the uncertain variables (and evaluates everything in parallel for fast performance) and then shows you a chart showing the different possible 401k balances. As you can see, the possible balances vary widely but are concentrated around $100,000:

 

image

Here’s the great thing: you can now build out this spreadsheet to your heart’s content to build simulations that incorporate more factors. If you want to get really fancy, you can correlate yearly returns. Check out the extensive help on solver.com for more.

Time Series Forecasting using Analytic Solver Platform, Windows Azure Marketplace, and Power Query

In this post, I’ll show you how to use Analytic Solver Platform in Excel 2013 with Power BI to build a time series analysis model using data from the cloud. Together, Analytic Solver Platform and Power BI provide a powerful, easy-to-use predictive analytics platform at a fraction of the cost of alternatives. A new release of Analytic Solver Platform is coming later this week! Visit solver.com to download a free trial or purchase.

My goal is to forecast Gross Domestic Product (GDP) in the United States. Historical GDP is available for free on Windows Azure Marketplace, so our first step is to subscribe to the data feed.

  1. Go to http://datamarket.azure.com and sign in using your Windows ID.
  2. Go to the Real GDP Per Capita By State data set and subscribe to it. The subscription is free.

If you look around you will find a number of interesting data sets. 

Now we’d like to bring this data into Excel so we can work with it. Power Query, a free Excel add-on provided by Microsoft, makes this extremely easy. If you use Excel and do not know about Power Query, it is definitely worth your time!

  1. Download and install Microsoft Power Query for Excel here.
  2. Start Excel and click on the Power Query tab.
  3. Select “From Other Sources” and then “From Windows Azure Marketplace”:From Windows Azure Marketplace
  4. The Navigator task pane will open on the right hand side. Your Windows Azure Marketplace subscriptions will be listed, including Real GDP Per Capita By State:Selecting Data Source
  5. Expand the Real GDP Per Capity By State entry and click on its only sub-item, called RealGDPByStatePerCapital [sic].
  6. The Query Editor window is displayed. You should see GDP information by state as well as rolled up to the US:
    GDP by State in Power Query 
  7. Let’s focus on national GDP. Click on the Area column header, uncheck “Select All” to clear all entries and scroll down to United States (screenshot below). Click OK.
  8. Click Apply and Close. A new worksheet with national GDP data will be created by Power Query.

These steps only scratch the surface of Power Query. For one thing, Power Query supports a number of interesting data sources, as you may have noticed in the menu in step 3. For another, Power Query lets you easily join and clean data, loading it into worksheets, or PowerPivot data models.

Now that we’ve got the data into Excel, we can use XLMiner to run a time series forecast.

  1. Click on the XLMiner ribbon tab.
  2. Under Time Series, select ARIMA -> ARIMA Model.
  3. Enter the following options (and check out the screenshot below):
    1. Choose RealGDPPerCapital as the “Selected Variable”. This is what we want to forecast.
    2. Select Year as the Time Variable.
    3. Set Autoregressive (p) = 4, Difference (d) = 0, Moving average (q) = 1.
      ARIMA Options 
  4. Click Advanced and:
    1. Check “Fitted values and residuals”. This will show the error in the time series analysis in the report.
    2. Check “Produce forecasts” and change “Number of forecasts” to 5.
    3. Click OK.
  5. Click OK. XLMiner produces an output worksheet, a residuals worksheet, and a stored model worksheet [which can be used for scoring].
  6. We want to look at how well the ARIMA model performed. Click on the ARIMA_Residuals worksheet and compare the Actual Value and Fitted Value columns. Not too shabby!

If I explore the tables and charts in the output worksheets, I can explore other aspects of the model results, including the forecasted GDP values.

Once you get going, you’ll find that it’s really easy to build cool predictive models using Power Query and Analytic Solver Platform. In the coming days I will share some other interesting scenarios.