Chaining Machine Learning and Optimization Models

Rahul Swamy recently wrote about mixed integer programming and machine learning. I encourage you to go and read his article.

Though Swamy’s article focuses on mixed integer programming (MIP), a specific category of optimization problems for which there is robust, efficient software, his article applies to optimization generally. Optimization is goal seeking; searching for the values of variables that lead to the best outcomes. Optimizers solve for the best variable values.

Swamy describes two relationships between optimization and machine learning:

1. Optimization as a means for doing machine learning,
2. Machine learning as a means for doing optimization.

I want to put forward a third, but we’ll get to that in a moment.

Relationship 1: you can always describe predicting in terms of solving. A typical flow for prediction in ML is

1. Get historical data for:
1. The thing you want to predict (the outcome).
2. Things that you believe may influence the predicted variable (“features” or “predictors”).
2. Train a model using the past data.
3. Use the trained model to predict future values of the outcome.

Training a model often means “find model parameters that minimize prediction error in the test set”. Training is solving. Here is a visual representation:

Relationship 2. You can also use ML to optimize. Swami gives several examples of steps in optimization algorithms that can be described using the verbs “predict” or “classify”, so I won’t belabor the point. If the steps in our optimization algorithm are numbered 1, 2, 3, the relationship is like this:

In these two relationships, one verb is used as a subroutine for the other: solving as part of predicting, or predicting as part of solving.

There is a third way in which optimization and ML relate: using the results of machine learning as input data for an optimization model. In other words, ML and optimization are independent operations but chained together sequentially, like this:

My favorite example involves sales forecasting. Sales forecasting is a machine learning problem: predict sales given a set of features (weather, price, coupons, competition, etc). Typically business want to go further than this. They want to take actions that will increase future sales. This leads to the following chain of reasoning:

• If I can reliably predict future sales…
• and I can characterize the relationship between changes in feature values and changes in sales (‘elasticities’)…
• then I can find the set of feature values that will increase sales as much as possible.

The last step is an optimization problem.

But why are we breaking this apart? Why not just stick the machine learning (prediction) step inside the optimization? Why separate them? A couple of reasons:

• If the ML and optimization steps are separate, I can improve or change one without disturbing the other.
• I do not have to do the ML at the same time as I do the optimization.
• I can simplify or approximate the results of the ML model to produce a simpler optimization model, so it can run faster and/or at scale. Put a different way, I want the structure of the ML and optimization models to differ for practical reasons.

In the machine learning world it is common to refer to data pipelines. But ML pipelines can involve models feeding models, too! Chaining ML and optimization like this is often useful, so keep it in mind.

The past couple of days I’ve been playing around with Facebook’s Prophet, a time series forecasting package.

I used Prophet to forecast quarterly sales of the Apple iPad, all in about 30 lines of Python. The repository for my code is here, and here’s a Jupyter notebook that walks through how it works.

It’s a lot of fun, and you get nice little visualizations like this one:

Check it out!

2017 NCAA Tournament Picks

Every year since 2010 I have used analytics to predict the results of the NCAA Men’s Basketball Tournament. I missed the boat on posting the model prior to the start of this year’s tournament. However, I did build and run a model, and I did submit picks based on the results. Here are my model’s picks – as I write this (before the Final Four) these picks are better than 88% of those submitted to ESPN.

Here are the ground rules I set for myself:

• The picks should not be embarrassingly bad.
• I shall spend no more than one work day on this activity (and 30 minutes for this post).
• I will share my code and raw data. (Here it is.)

I used a combination of game-by-game results and team metrics from 2003-2016 to build the features in my model. Here is a summary:

I also performed some post-processing:

• I transformed team ranks to continuous variables given a heuristic created by Jeff Sonos.
• Standard normalization.
• One hot encoding of categorical features.
• Upset generation. I found the results aesthetically displeasing for bracket purposes, so I added a post-processing function that looks for games between Davids and Goliaths (i.e. I compare seeds) where David and Goliath are relatively close in strength. For those games, I go with David.

I submitted the model to the Kaggle NCAA competition, which asks for win probabilities for all possible tourney games, where submissions are scored by evaluating the log-loss of actual results and predictions. This naturally suggests logistic regression, which I used. I also built a fancy pants neural network model using Keras (which means to run my code you’ll need to get TensorFlow and Keras in addition to the usual Anaconda packages). Keras produces slightly better results in the log-loss sense. Both models predict something like 78% of past NCAA tournament games correctly.

There are a couple of obvious problems with the model:

• I did not actually train the model on past tournaments, only on regular season games. That’s just because I didn’t take the time.
• Not accounting for injuries.
• NCAA games are not purely “neutral site” games because sometimes game sites are closer to one team than another. I have code for this that I will probably use next year.
• I am splitting the difference between trying to create a good Kaggle submission and trying to create a “good” bracket. There are subtle differences between the two but I will spare you the details.

I will create a github repo for this code…sometime. For now, you can look at the code and raw data files here. The code is in ncaa.py.

The Logit and Sigmoid Functions

If you mess around with machine learning for long enough, you’ll eventually run into the logit and sigmoid functions. These are useful functions when you are working with probabilities or trying to classify data.

Given a probability p, the corresponding odds are calculated as p / (1 – p). For example if p=0.75, the odds are 3 to 1: 0.75/0.25 = 3.

The logit function is simply the logarithm of the odds: logit(x) = log(x / (1 – x)). Here is a plot of the logit function:

The value of the logit function heads towards infinity as p approaches 1 and towards negative infinity as it approaches 0.

The logit function is useful in analytics because it maps probabilities (which are values in the range [0, 1]) to the full range of real numbers. In particular, if you are working with “yes-no” (binary) inputs it can be useful to transform them into real-valued quantities prior to modeling. This is essentially what happens in logistic regression.

The inverse of the logit function is the sigmoid function. That is, if you have a probability p, sigmoid(logit(p)) = p. The sigmoid function maps arbitrary real values back to the range [0, 1]. The larger the value, the closer to 1 you’ll get.

The formula for the sigmoid function is σ(x) = 1/(1 + exp(-x)). Here is a plot of the function:

The sigmoid might be useful if you want to transform a real valued variable into something that represents a probability. This sometimes happens at the end of a classification process. (As Wikipedia and other sources note, the term “sigmoid function” is used to refer to a class of functions with S-shaped curves. In most machine learning contexts, “sigmoid” usually refers specifically to the function described above.)

There are other functions that map probabilities to reals (and vice-versa), so what’s so special about the logit and sigmoid? One reason is that the logit function has the nice connection to odds described at the beginning of the article. A second is that the gradients of the logit and sigmoid are simple to calculate (try it and see). The reason why this is important is that many optimization and machine learning techniques make use of gradients, for example when estimating parameters for a neural network.

The biggest drawback of the sigmoid function for many analytics practitioners is the so-called “vanishing gradient” problem. You can read more about this problem here (and here), but the point is that this problem pertains not only to the sigmoid function, but any function that squeezes real values to the [0, 1] range. In neural networks, where the vanishing gradient problem is particularly annoying, it is often a good idea to seek alternatives as suggested here.

2014 In Review: Five Data Science Trends

2014 was another transformative, exciting year for data science. Summarizing even one measly year of progress is difficult! Here, in no particular order, are five trends that caught my attention. They are are likely to continue in 2015.

Adoption of higher productivity analytics programming environments. Traditional languages and environments such as C, C++, and SAS, are diminishing in importance as R, Python, and Scala ascend. It is not that data scientists are dumping the old stuff; it is that a flood of new data scientists have entered the fray, overwhelmingly choosing more modern environments. These newer systems provide language conveniences as well as a rich library of built-in (or easy to install) libraries that handle higher abstraction analytics-related tasks. Modern data scientists don’t want to write CSV read routines, JSON parsers, SQL INSERTs or logging systems. R is notable in the sense that its productivity gains come from its packages and community support, not from the language itself. R’s clunkiness will be its downfall as Python, Scala, and other languages gain greater traction within the analytics community and narrow the libraries gap.

Machine learning is king. Data science, however you define it, is a broad discipline, but the thunder belongs to machine learning. Rather, machine learning is becoming synonymous with data science. Optimization, for example, is increasingly being described as a backend algorithm supporting prediction or classification. Objective functions are described as “learning functions”, and so on. We are collectively inventing definitions and classifications as we go along, so in some sense this is merely semantics. There is a real problem however: we risk ignoring the collective wisdom of past masters, throwing rather broad but shallow techniques at models with rather particular structure. Somewhere in a coffee shop right now, somebody is using a genetic algorithm to solve a network model.

Visualization is everywhere. To the point that it’s starting to get annoying. Whether in solutions such as Tableau or TIBCO Spotfire, add-ons such as Excel Power View or Frontline’s XLMiner Visualization app (plug!), or programming libraries such as matplotlib or d3.js, the machinery to build good visualizations is maturing. The explosion of infographics in popular media have raised expectations: users expect visualizations to guide and inform them at all stages of the analytics lifecycle. As you have no doubt seen, the problem is that it is so easy to build shitty viz: bar charts with no content; furious heat maps signifying nothing. We’ll start to see broader, if unarticulated, consensus on appropriate visualizations and visual metaphors for the results of quantitative analysis. I hope.

Spark is supplanting Hadoop. Apache Spark wins on performance, has well-designed streaming, text analytics, machine learning, and data access libraries, and has huge community momentum. This was all true at the beginning of 2014, but now at the end of 2014 we are starting to see a breakthrough in industry investment. Hadoop isn’t going anywhere, but in 2015 many new, “big time” big data projects will be built on Spark. The more flexible graph-based pipeline at the heart of Spark is begging for great data scientists to exploit – what will 2015 bring?

Service oriented architectures are coming to analytics. The ubiquity of REST-based endpoints in web development, combined with a new culture of code sharing, have engendered a new “mixtape” development paradigm. A kid (or even an older guy…) can whip out their MacBook, create a webservice on django, deploy it on AWS using Elastic Beanstalk, connect it to interactive visualizations in an afternoon, and submit an app that night. Amazon has built a \$1B+ side business on the strength of cloud-only services, and same these forces will drive analytics forward. The prime mover in data science is not big data. It is cloud. RESTful, service-based analytics services will explode.

Spark Summit Keynote Notes

Here is a summary of my key impressions from the Day 1 keynotes of the 2014 Spark Summit

This year’s Spark Summit was one of the deepest, most interesting technical conferences I have attended, and I don’t say that lightly. It is easy to get caught up in the excitement of a conference filled with enthusiasts, but trust me when I say that conventional MapReduce–based Hadoop is over and technologies like Spark will be part of the tipping point that will turn Big Data hype into real applications and much more widespread deployment. Hadoop is legacy.

Spark, like Hadoop, is more than one “thing”. The base component of Spark is a cluster computation engine that is like MapReduce on steroids. Instead of the simple two stage “map then reduce” computational model, Spark supports more general DAG-structured computational flows (Microsoft watchers will remember Dryad). This in itself, is a big innovation, especially for analytics scenarios. Indeed, Spark has been shown to be 10, 100, even 1000 times faster than Hadoop on a number of real workloads. More important than this, in my view, is that Spark includes higher level libraries for data access and analytics, surfaced in clean, consistent APIs that are available from three languages: Java, Scala, and Python. Three important Spark components are MLLib, a machine learning library; GraphX, a graph processing library; and Spark SQL, introduced at this conference. An analogue for those familiar with the Microsoft ecosystem is the .Net Framework – .Net provides languages, a runtime, and a set of libraries together. The integration of the pieces makes each much more useful.

The Summit is organized and principally sponsored by Databricks (tagline: “making big data easy”). This is the company founded by the Berkeley-based creators of Spark. Ion Stoico, CEO of Databricks, kicked off Monday’s festivities, introducing Databricks Cloud, a web based Spark workbench for doing big data analytics. You can find screenshots on the Databricks Cloud site, or on twitter. Key points:

• Databricks Cloud is currently in a private beta.
• It’s a standalone web interface.
• It has a command-line “REPL” interface.
• The examples I saw were in Scala (which is kind of like a mix of F# and Java).
• You can bring in data from Amazon S3 or other sources using Spark SQL (more on that in future posts).
• It includes pre-canned datasets such as a twitter snapshot/firehose (can’t tell which).
• You can do SQL queries right from the REPL.
• It has incredibly simple, clean looking visualizations tied to results.
• You can drag and drop dashboard components that correspond to REPL / program outputs. You can configure how often these components are “refreshed”.
• We were presented a live 10 demo to create a dashboard to filter live tweets, based on a similarity model authored with the help of MLLib, and trained on wikipedia.
• Databricks Cloud would be quite useful even as a standalone, single node analytics workbench, but recall that all of this is running on top of Spark, on the cluster without any “parallel programming” going on by the user.
• Everything you create in the workbench is Spark 1.0 compliant, meaning you can move it over to any other Spark 1.0 distribution without changes.

The development plan is sound, and there is a ton of corporate support for Spark from Cloudera, Hortonworks, DataBricks, SAP, IBM, and others. If time permits I will summarize some of the other keynotes and sessions.

Simulating data for a logistic regression model in Excel

Rick Wicklin from SAS recently wrote a very nice article about simulating data for a logistic regression model. (I have been a loyal reader for years.) I thought it would be interesting to see if we can do the same thing in Excel using Analytic Solver Platform. Yes!

Simulating Logistic Data

As Rick describes in his post, the first step is to generate random explanatory and response variables. He breaks this down as follows:

1. Assign the design matrix (X) of the explanatory variables. This step is done once. It establishes the values of the explanatory variables in the (simulated) study.

2. Compute the linear predictor, η = X β, where β is a vector of parameters. The parameters are the "true values" of the regression coefficients.

3. Transform the linear predictor by the logistic (inverse logit) function. The transformed values are in the range (0,1) and represent probabilities for each observation of the explanatory variables.

4. Simulate a binary response vector from the Bernoulli distribution, where each 0/1 response is randomly generated according to the specified probabilities from Step 3.

We can do this in Excel using Analytic Solver Platform’s built-in probability distributions.

Step 0. In Rick’s example, the parameters for the simulated logistic model are (2, 4, –1). So go get Analytic Solver Platform, fire up Excel, and enter 2, 4, –1 in a new worksheet:

Step 1. Let’s create the design matrix.

• Create a new worksheet and type in column headers x0, x1, x2 in row 1.
• Go to the next row. x0 is the intercept, so fill in 1 for A2.
• x1 is a uniform random number between 0 and 1, so use the Analytic Solver Platform formula =PsiUniform(0,1). (Excel’s RAND() would also work here.)
• x2 is normally distributed. Enter =PsiNormal(0,2).
• Now you have something like this. Your values will be different because x1 and x2 are random.

Step 2. Create the linear predictor. Add a column header “eta” and enter the following formula: =SUMPRODUCT(Sheet1!\$A\$1:\$C\$1,Sheet2!A2:C2). This multiplies the parameters for the simulated logistic model with the design matrix values, and sums them up.

Step 3. Transform the linear predictor by the inverse logit function. Add a column “mu” with the inverse logit formula: =EXP(D2)/(1+EXP(D2)).

Step 4. Simulate a binary response vector. You can use the PsiBernoulli function to simulate 0-1 values. Add a column “y” with the formula =PsiBernoulli(E2). Your spreadsheet now looks something like this:

Step 5. Now you’ve got one simulated point. Copy and paste this row down as far down as desired, for example a few hundred rows. If you use Excel’s “Fill Down” feature then make sure that column A is a column of “1” values, not “1, 2, 3, …”. Now you will have a big table with your simulated data.

Step 6. Freeze it! You may have noticed that the values of certain columns jiggle around. This is because new random values are being generated every time Excel recalcs. Click on the “Analytic Solver Platform” ribbon tab, click Tools, and click Freeze. This will lock the current samples in place.

Exploring the Data

Now we can use XLMiner to explore and model the simulated data! Click on the XLMiner ribbon tab, and then Explore –> Chart Wizard:

Select “Scatter Plot” as the chart type. Then select x2 for the y-axis, x1 for the x-axis, and color by y. You will get a nice color-coded scatter plot that you can customize to your heart’s content interactively:

Running Logistic Regression

Now we can run logistic regression to recover the coefficients for the simulated data. Click on the XLMiner ribbon tab and select Classify –> Logistic Regression. The simulation data should be preselected in the “Data range” box. Select x1 and x2 as Input Variables and y as the Output Variable. Your dialog will look something like this:

By default, XLMiner logistic regression assumes an intercept, so you can simply click Finish. The logistic regression runs, and the results are reported in an output worksheet. If you examine the coefficients, you should see that they are rather close to (2, 4, –1)!

Since everything is in Excel, you can perform additional analysis, build charts, or even score new data.

Presenting Analytic Solver Platform 2014-R2

In 2014 Frontline Systems released the newest version of its flagship product, Analytic Solver Platform. You can download a free trial of Analytic Solver Platform here.

Analytic Solver Platform makes it easy to learn from your data and make good decisions quickly. You don’t have to learn a new programming language, suffer through a complex deployment process, or abandon what you already know: you can grab data from your desktop, the web, or the cloud and build powerful predictive models in minutes from Excel.

In this release of Analytic Solver Platform you’ll find world class time series, prediction, classification, data cleaning, and clustering methods in XLMiner. XLMiner’s 30+ data mining methods have been rewritten from the ground up, combining the latest advances in machine learning with a straightforward Excel interface. Data sets that crash more expensive competitive products run flawlessly in XLMiner. Better yet, XLMiner produces reports with all the information you need to make the business case for your findings, including built-in charts and visualizations.

Analytic Solver Platform works with Microsoft Power BI to turn data into insight. My recent post showed how cloud hosted data can be ingested, cleaned, and mined for insight in minutes. Analytic Solver Platform supplements Power Query’s data cleaning with additional methods to help you categorize, clean, and handle missing data, and provides built in connectors to allow you to sample and score with popular data sources including Power Pivot.

Finally, Analytic Solver Platform helps you bridge the gap between experimentation and production deployment. Using Analytic Solver Platform with SharePoint allows your organization to audit and version your models. Use Frontline’s Solver SDK to integrate simulation and optimization in your application whether you use C++, C#, or web technologies. The latest version of Solver SDK will provide support for the popular F# language, allowing your team to build predictive models in a fraction of the development cost and lines of code.

Give it a try!

Time Series Forecasting using Analytic Solver Platform, Windows Azure Marketplace, and Power Query

In this post, I’ll show you how to use Analytic Solver Platform in Excel 2013 with Power BI to build a time series analysis model using data from the cloud. Together, Analytic Solver Platform and Power BI provide a powerful, easy-to-use predictive analytics platform at a fraction of the cost of alternatives. A new release of Analytic Solver Platform is coming later this week! Visit solver.com to download a free trial or purchase.

My goal is to forecast Gross Domestic Product (GDP) in the United States. Historical GDP is available for free on Windows Azure Marketplace, so our first step is to subscribe to the data feed.

2. Go to the Real GDP Per Capita By State data set and subscribe to it. The subscription is free.

If you look around you will find a number of interesting data sets.

Now we’d like to bring this data into Excel so we can work with it. Power Query, a free Excel add-on provided by Microsoft, makes this extremely easy. If you use Excel and do not know about Power Query, it is definitely worth your time!

2. Start Excel and click on the Power Query tab.
3. Select “From Other Sources” and then “From Windows Azure Marketplace”:
4. The Navigator task pane will open on the right hand side. Your Windows Azure Marketplace subscriptions will be listed, including Real GDP Per Capita By State:
5. Expand the Real GDP Per Capity By State entry and click on its only sub-item, called RealGDPByStatePerCapital [sic].
6. The Query Editor window is displayed. You should see GDP information by state as well as rolled up to the US:

7. Let’s focus on national GDP. Click on the Area column header, uncheck “Select All” to clear all entries and scroll down to United States (screenshot below). Click OK.
8. Click Apply and Close. A new worksheet with national GDP data will be created by Power Query.

These steps only scratch the surface of Power Query. For one thing, Power Query supports a number of interesting data sources, as you may have noticed in the menu in step 3. For another, Power Query lets you easily join and clean data, loading it into worksheets, or PowerPivot data models.

Now that we’ve got the data into Excel, we can use XLMiner to run a time series forecast.

1. Click on the XLMiner ribbon tab.
2. Under Time Series, select ARIMA -> ARIMA Model.
3. Enter the following options (and check out the screenshot below):
1. Choose RealGDPPerCapital as the “Selected Variable”. This is what we want to forecast.
2. Select Year as the Time Variable.
3. Set Autoregressive (p) = 4, Difference (d) = 0, Moving average (q) = 1.