2014 In Review: Five Data Science Trends

2014 was another transformative, exciting year for data science. Summarizing even one measly year of progress is difficult! Here, in no particular order, are five trends that caught my attention. They are are likely to continue in 2015.

Adoption of higher productivity analytics programming environments. Traditional languages and environments such as C, C++, and SAS, are diminishing in importance as R, Python, and Scala ascend. It is not that data scientists are dumping the old stuff; it is that a flood of new data scientists have entered the fray, overwhelmingly choosing more modern environments. These newer systems provide language conveniences as well as a rich library of built-in (or easy to install) libraries that handle higher abstraction analytics-related tasks. Modern data scientists don’t want to write CSV read routines, JSON parsers, SQL INSERTs or logging systems. R is notable in the sense that its productivity gains come from its packages and community support, not from the language itself. R’s clunkiness will be its downfall as Python, Scala, and other languages gain greater traction within the analytics community and narrow the libraries gap.

Machine learning is king. Data science, however you define it, is a broad discipline, but the thunder belongs to machine learning. Rather, machine learning is becoming synonymous with data science. Optimization, for example, is increasingly being described as a backend algorithm supporting prediction or classification. Objective functions are described as “learning functions”, and so on. We are collectively inventing definitions and classifications as we go along, so in some sense this is merely semantics. There is a real problem however: we risk ignoring the collective wisdom of past masters, throwing rather broad but shallow techniques at models with rather particular structure. Somewhere in a coffee shop right now, somebody is using a genetic algorithm to solve a network model.

Visualization is everywhere. To the point that it’s starting to get annoying. Whether in solutions such as Tableau or TIBCO Spotfire, add-ons such as Excel Power View or Frontline’s XLMiner Visualization app (plug!), or programming libraries such as matplotlib or d3.js, the machinery to build good visualizations is maturing. The explosion of infographics in popular media have raised expectations: users expect visualizations to guide and inform them at all stages of the analytics lifecycle. As you have no doubt seen, the problem is that it is so easy to build shitty viz: bar charts with no content; furious heat maps signifying nothing. We’ll start to see broader, if unarticulated, consensus on appropriate visualizations and visual metaphors for the results of quantitative analysis. I hope.

Spark is supplanting Hadoop. Apache Spark wins on performance, has well-designed streaming, text analytics, machine learning, and data access libraries, and has huge community momentum. This was all true at the beginning of 2014, but now at the end of 2014 we are starting to see a breakthrough in industry investment. Hadoop isn’t going anywhere, but in 2015 many new, “big time” big data projects will be built on Spark. The more flexible graph-based pipeline at the heart of Spark is begging for great data scientists to exploit – what will 2015 bring?

Service oriented architectures are coming to analytics. The ubiquity of REST-based endpoints in web development, combined with a new culture of code sharing, have engendered a new “mixtape” development paradigm. A kid (or even an older guy…) can whip out their MacBook, create a webservice on django, deploy it on AWS using Elastic Beanstalk, connect it to interactive visualizations in an afternoon, and submit an app that night. Amazon has built a $1B+ side business on the strength of cloud-only services, and same these forces will drive analytics forward. The prime mover in data science is not big data. It is cloud. RESTful, service-based analytics services will explode.

Optimization In Google Sheets

Good news for those of you that use spreadsheets to do analytics: Google recently announced a Linear Optimization add-on for Google Sheets, and now Frontline Systems has released a free Solver add-on for Google Sheets that solves not only linear optimization problems, but nonlinear ones as well. It has roughly the same capabilities as the Solver App for Excel Online. If you know how to use Excel’s Solver, then you know how to use this. (Disclaimer: I participated in the development of both the Google Sheets and Excel Online apps during my tenure as CTO of Frontline. I think they are great.)

Here’s how to get started with the Solver add-on for Google sheets.

Step 1: Insert the Add-on. Create a new Google Sheet (for example by going to drive.google.com and clicking “New”). Then, under the Add-ons menu, click “Get add-ons..”. Search for “solver” and you will see both the Google and Frontline apps:

image

Click on the button next to Solver. (Hi Edwin!) Now “Solver” will appear under the Add-ons menu. When you click on it, a pane will show up on the right-hand side of your screen.

image

Step 2: Create an optimization model. You can use the task pane to define the variables, objective, and constraints of your optimization model. Clicking on the “Insert Example” button will paste a sample problem into your sheet. Here’s what it looks like: it’s a production planning problem where we want to determine the number of TVs, Stereos, and Speakers to build in order to maximize profit.

image

In the task pane on the right you can see that the profit cell (F13) has been selected as the objective we are maximizing. Similar to the Excel solver, you can define the constraints by clicking on them in the “Subject To” section.

image

Step 3: Solve. Clicking on Solve will call Frontline’s Simplex solver to solve your model on the cloud (specifically – Windows Azure…). The variables B3:D3 will be updated, as will any formulas that depend on those values. As you can see, profit goes up:

image

WINNING. If you fool around with the app you will see that you can solve models with arbitrary formulas, not just linear models. And it’s free! Go check it out.

Predicting the 2014-2015 NBA Season

Over the weekend I created a model in Excel to predict the 2014-2015 NBA season. The model simulates the full 82-game schedule, using player Win Shares from the 2013-2014 season to estimate the strength of each team, accounting for roster changes. This model is not perfect, or even particularly sophisticated, but it is interesting. Here are the model’s predictions as of 10/15/2014, with projected playoff teams in bold.

Eastern W L PCT GB Home Road
Cleveland 61 21 0.744 0 31-10 30-12
Toronto 57 25 0.695 4 30-11 27-14
Chicago 49 33 0.598 12 26-15 23-18
Washington 44 38 0.537 17 24-17 21-20
New York 44 38 0.537 17 23-18 21-20
Miami 42 40 0.512 19 22-19 20-21
Detroit 42 40 0.512 19 22-19 20-21
Charlotte 41 41 0.500 20 22-19 19-22
Atlanta 41 41 0.500 20 22-19 19-22
Indiana 37 45 0.451 24 20-21 17-24
Boston 30 52 0.366 31 16-25 14-27
Brooklyn 26 56 0.317 35 14-27 11-30
Orlando 25 57 0.305 36 14-27 12-29
Milwaukee 24 58 0.293 37 14-27 11-30
Philadelphia 14 68 0.171 47 8-33 6-35
Western W L PCT GB Home Road
LA Clippers 57 25 0.695 0 30-11 27-14
San Antonio 57 25 0.695 0 30-11 27-14
Oklahoma City 54 28 0.646 3 29-14 26-15
Phoenix 53 29 0.646 4 28-13 25-16
Golden State 53 29 0.646 4 28-13 25-16
Portland 49 33 0.598 8 26-15 23-18
Houston 47 35 0.573 10 25-16 22-19
Dallas 47 36 0.566 10 24-17 22-19
Memphis 41 41 0.500 16 22-19 19-22
Denver 40 42 0.488 17 21-20 19-22
Minnesota 37 45 0.451 20 20-21 17-24
Sacramento 32 50 0.390 25 17-24 15-26
LA Lakers 31 51 0.378 26 17-24 14-27
New Orleans 27 55 0.329 30 15-26 12-29
Utah 25 58 0.301 33 14-27 11-30

You can download my full spreadsheet here. It’s complicated but not impossible to follow. It does not exactly match the results presented above because I have a messier version that accounts for recent injuries, e.g. Kevin Durant.

The Cavs, Spurs, and Clips are the favorites to win the title in this model (a previous version of this model also had the Thunder in this class, but Kevin Durant is now injured). Comparing these estimates to over-unders in Vegas, the biggest differences are Brooklyn (lower), Indiana (higher), Memphis (lower), Minnesota (higher), New Orleans (lower), Phoenix (higher). If you take the time to read through the methodology at the end of this post, you may be able to see why some of these differences exist. Some are probably reasonable, others may not be.

How It Works

Many of the ingredients for this model were presented in my previous three posts, where I compiled game-by-game results for the 2013-2014 season, built a simple model to predict rookie performance, and tracked roster changes. Now the task is pretty simple: estimate the strength of each team, figure out how unpredictable games are, and then simulate the season using the strengths, accounting for uncertainty. At the end I discuss weaknesses of this model, which if you are a glass-half-full type of person also suggest areas for improvement.

Step 1: Estimate the strength of each team. Team strengths are estimated by adding up the 2013-2014 Win Shares for the top twelve players on each NBA team. In my last post I gave a spreadsheet with Win Shares for all 2013-2014 NBA players based on data from basketball-reference.com. I made three adjustments to this data for the purposes of this analysis:

  • Added rookies. I estimated projected 2014-2015 Win Shares for rookies using the logarithmic curve given in this post.
  • Accounted for injuries to good players. Kobe Bryant, Derrick Rose, Rajon Rondo, and a couple of other high profile players were injured in 2013-2014. I replaced their Win Share total with the average of the past three seasons, including the season they were injured. Is this reasonable? I don’t know.
  • Trimmed to 12. I manually trimmed rosters so that only the 12 players with the highest Win Shares remained.

Adding Win Shares gives an overall “strength rating” for each team.

Step 2: Estimate the unpredictability of game results. Most of the time, a good team will beat a bad team. Most of the time. Can we quantify this more precisely? Sure. From a previous post, I determined that home court advantage is approximately 2.6 points per game. We also found that although the difference in total season wins is a predictor of who will win in a matchup between two teams, it is a rather weak predictor. in other words, bad teams beat good teams quite often, especially at home. For our prediction model we make another simple assumption: every team’s performance over the course of the season varies according to a normal distribution, with the mean of this distribution corresponding to their overall team strength.

Normal distributions are defined by two parameters: mean and standard deviation. If I know what the normal distribution looks like then I can estimate the probability of the home team winning in a matchup: take the difference of their team strengths, then calculate the cdf of the distribution at –2.6 (the home court advantage). But what is the standard deviation? I can estimate it by “replaying” all of the games in the previous season. If I guess a value for the standard deviation, I can calculate win probabilities for all games. If I add up the win probabilities, for say, Boston, then this should sum to Boston’s win total for the season (sadly, 25). So if I want to estimate the standard deviation, all I have to do is minimize the sum of deviations from estimated and actual 2013-2014 win totals. I can do this using Excel’s Solver: it’s a nonlinear minimization problem involving only one variable (the standard deviation).

It turns out that the resulting estimate does a very good job of matching 2013-2014 results:

  • The estimated win totals for all teams were within 2 wins of their actual values.
  • The estimated home winning percentage matches the actual value quite closely!

Step 3: Determine win totals for the 2014-2015 season. I obtained the 2014-2015 schedule for $5 from nbastuffer.com. Using this schedule, I calculated win probabilities for each game using the team strengths in Step 1 and the standard deviation in Step 2. If I add up the totals for each team, I get their estimated win totals. Voila! Since the prediction is created by looking at each game on the schedule, we also get home and away records, in conference records, and so on. It’s also easy to update the estimate during the season as games are played, players are traded or injured, and so on.

Why this model stinks. The biggest virtue of this model is that it was easy to build. I can think of at least ten potential shortcomings with this model:

  1. Win Shares are probably not the best metric for individual and team strength.
  2. It assumes that individual performance for 2014-2015 will be the same as 2013-2014. Paul Pierce isn’t getting any younger.
  3. It does not account for predictable changes in playing time from season-to-season.
  4. 2014-2015 win shares are not normalized to account for players leaving and entering the league.
  5. 2014-2015 win shares do not account for positive and negative synergies between players.
  6. There is no reason to believe that the standard deviation calculated in Step 2 should be the same for all teams.
  7. I have not given any justification for using a normal distribution at all!
  8. The vagaries of the NBA schedule are not accounted for. For example, teams play worse in their second consecutive road game.
  9. Several teams, including the Philadelphia 76ers, will tank games.
  10. Injuries were handled in an arbitrary and inconsistent fashion.

It will be interesting to see how this model performs, in spite of its shortcomings.

NBA Rosters and Team Changes: 2013-2014

I have downloaded statistics for all NBA players from basketball-reference.com, and accounted for roster changes (as of Sunday, October 5).

The Tm2013 column is a three-letter abbreviation for the player’s 2013-2014 team. For players that played on two or more teams, the entry represents the team for which the player played the most minutes. The Tm2014 is the player’s current NBA team (as of Sunday, October 5) according to nba.com. Rookies are not included in this spreadsheet.

If you’ve read my previous two posts, you may have guessed that I am leading up to a prediction of the upcoming 2014-2015 NBA season. You’d be correct – I will post my model and predictions tomorrow.

In the meantime, you can actually use the spreadsheet above to create a very crude prediction. The last column in this spreadsheet is “win shares”. If you create a pivot table based on Tm2014 and Win Shares, you get the sum of player win shares for current rosters – a crude measure of team strength. Here is what that table looks like:

Team Sum of WS
CLE 60.7
SAS 59.4
TOR 58.6
LAC 58
IND 56.6
OKC 54.1
GSW 53.8
PHO 52.6
POR 50.8
DAL 48.6
HOU 47.9
WAS 46.8
NYK 44.3
DET 43.9
MEM 43.5
MIA 40.8
CHI 40.4
ATL 38.9
DEN 36.9
MIN 34.3
SAC 32.6
CHA 32.1
NOP 31
LAL 26.9
BRK 26.1
BOS 25.9
UTA 23.8
ORL 23.1
MIL 21.3
PHI 7.2

I can think of at least five weaknesses in this “pivot table model”. Can you?

Predicting NBA Rookie Performance By Draft Position

Nate Silver (and others) have tracked how NBA draft position relates to total career performance, see for example this article. But what about first-year performance?

I pulled two sets of data from basketball-reference.com to answer this question:

I then merged them using Power Query and then created a pivot table to calculate the average number of rookie season “win shares” by draft position. You can download my Excel workbook here. Here is what I found:

AverageWinSharesByDraftPosition

The first pick in the draft averages nearly five Win Shares in his rookie season, and while the pattern is irregular, win shares decrease as we get deeper into the draft (duh). (The blip at the end is due to Isaiah Thomas, drafted by the Kings who promptly screwed up by letting him go.) I have drawn a logarithmic trendline which fits the data not-to-shabbily: R^2 of 0.7397. Obviously we could do much better if we considered additional factors related to the player (such as their college performance) and team (the strength of teammates playing the same position, who will compete with the rookie for playing time). Here are the averages for the first 30 draft positions:

Draft POSITION Win Shares
1 4.96
2 2.69
3 2.96
4 4.14
5 2.23
6 1.84
7 3.36
8 1.68
9 2.59
10 1.52
11 0.84
12 1.51
13 1.48
14 1.36
15 1.64
16 1.19
17 2.37
18 1.02
19 0.71
20 1.09
21 1.74
22 2.14
23 1.54
24 2.29
25 0.98
26 1.23
27 1.08
28 0.40
29 0.54
30 0.94
31 0.79

NBA Game Results: 2013-2014

The NBA preseason is in full swing! For those of you who like to fool around with data, I have prepared a CSV file with game-by-game results for the 2013-2014 season. The data was downloaded from basketball-reference.com using Power Query and cleaned up (see below).

The format is simple:

  • Date = When the game was played
  • Visitor = three letter abbreviation of the visiting team
  • VisitorPts = visiting team score
  • VisitorSeasonWins = number of wins by the visiting team for the entire season
  • Home = TLA of home team
  • HomePts = home team score
  • HomeSeasonWins = number of wins by the home team for the entire season
  • WinMargin = HomeSeasonWins – VisitorSeasonWins
  • Margin = HomePts – VistorPts

I include the number of wins for each team in the files because I wanted to see how often good teams beat bad teams. The diagram below plots the difference in total wins for teams against the margin of victory. I have used the trendline feature in Excel to verify that while (by definition) good teams beat bad ones frequently, the variability is quite high. Notice the R^2 value.

NBAWinsVersusMargin

The intercept for the trendline is 2.5967, which represents the home court advantage in points. In a future post I hope to use this data to make some predictions about the upcoming NBA season.

Enjoy!

Liberal Arts: Who’s Our Champion Now?

Last year, Ben Thompson traced the evolution of Apple product introductions in “Whither Liberal Arts?”, beginning with Steve Jobs’ iconic description of Apple being at the “intersection of technology and liberal arts” through Tim Cook’s emphasis on “feeds and speeds”; a move from human stories to product design. These transitions have resulted in difficulties with “that vision thing”, in Thompson’s view. But why? Thompson:

It’s telling that Jobs did not say that Apple was at the intersection of technology and design. While any designer will tell you that design is about how it works, not looks, to pretend there isn’t a tension is naive. The liberal arts, though, and the humanities, are literally about people. They are about how they learn and experience the world, and each other.

Go and check out the introduction of the Mac from 1984. It’s all great, but pay particular attention to the “Chariots of Fire” sequence from 1:43 to 3:15. Six different liberal arts majors are represented:

  • Scrolling MACINTOSH marquee + “insanely great”
  • Paint (geisha) [Art]
  • MacWrite [English]
  • Spreadsheet + calculator [Mathematics]
  • Print Shop certificate
  • A whole bunch of fonts [Graphic Design]
  • Greek columns [Architecture]
  • Pascal code [Computer Science]
  • Paint again (Steve Jobs)
  • Chess

You may find it cheesy, but I eat this stuff up, particularly Jobs’ emotional response to the crowd after the Mac’s ‘speech’ at 4:03. It’s all very…human. Sure, it’s a product introduction by a big company, yet it still resonates emotionally. It’s worth a thousand TED talks.

Recently I spent two days discussing the present and future of the College of Liberal Arts at the University of Iowa in my capacity as Dean’s Advisory Board member. Reflecting on those discussions, and Thompson’s essay, it strikes me that Steve Jobs was perhaps the greatest advocate for liberal arts in our generation, at least in terms of societal impact. Liberal arts “champions”, almost by necessity, are difficult to spot because the power to influence is a product of their ability to articulate the value of liberal arts combined with their notoriety. Notoriety usually comes from exceptional achievement in a relatively narrow domain. We live in an era of specialization; a modern da Vinci would follow a radically different path than the genuine article. So while Jobs was a computer guy, I think it is fair to put him forward as a liberal arts champion.

It embarrasses me slightly that my self-appointed champion of liberal arts was a rich technologist, especially one as hyped as Steve Jobs. Even so, Jobs marshaled learning from related disciplines (architecture, computer science, art, and psychology among others) in pursuit of business success at Apple, resulting in one of the world’s most valued brands. Certainly these are significant achievements, however much or how little we choose to value them. It is uncouth to speak of brands as if they mean something, but because the Apple brand is so universally identified with creativity, aesthetics, and (gulp) a certain humanity, and because these qualities are essential to liberal arts, and because Jobs was the driving force behind the construction of this brand, who else but Jobs?

And after Jobs, now what? It’s hard to envision another technologist, as some descend into Silicon Valley self-parody, and others speak of “gamifying education”. Elon Musk shows signs of becoming the next great industrial titan, but has been content to drag race up and down Technology Avenue thus far. A politician? Doubtful, in this polarized time. A writer? Maybe somewhere else, but probably not in America. Any ideas?