Bertrand Russell said, “Mathematics, rightly viewed, possesses not only truth, but supreme beauty – a beauty cold and austere, like that of sculpture.” Analytical models, borne of math and forged with code, should possess the same properties.
A painting, or a sculpture, or a piece of music, is not made better by cramming more stuff into it, a lesson George Lucas famously unlearned. Adding too much to an analytical model results in overfitting, confusion, and mistakes. We must resist this temptation, and move to the next level: viewing the empty spaces as elements that enhance the whole. This is the concept of negative space.
A sales model should account not only for those who do purchase, but those who do not – otherwise you will overestimate the contributions of factors that sometimes generate sales (I am looking at you, Twitter). A scheduling model must consider not only which options are possible, but those that are not. A social model should consider not only those who are active tweeters, bloggers, and commenters, but also those who do their talking off the grid. Otherwise those who shout the loudest in one general direction are mistakenly interpreted as having real influence. There is no truth, and no beauty, in that!
You can build a pretty decent 401k simulation in a few minutes in Excel using Analytic Solver Platform:
Let’s give it a shot! You can download the completed workbook here.
First, let’s build a worksheet that calculates 401k balances for 10 years. At the top of the worksheet let’s enter a yearly contribution rate:
Let’s compute 401k balances for the next 10 years, based on this contribution. A simple calculation for the balance for a given year involves five factors:
- The 401k balance for the previous year.
- The rate of return for the 401k.
- The previous year’s salary.
- The rate of increase in the salary (your raise).
- The rate of contribution (entered above).
In row 6 we will enter in the starting values for return, salary increase, balance, and salary in columns B, C, D, E respectively. For now let’s assume:
- Return = 0.05
- Salary Increase = 0.05
- Balance = 5,000
- Salary = 100,000
With a couple of small assumptions, the new balance is old balance * return + contribution * (salary * (1 + salary increase)). In the next row we will compute Year 1, using this formula:
- Salary = D6 * (1 + C6). This simply means that this year’s salary is last year’s adjusted by raise. (Obviously salary could be modeled differently depending on when the raise kicks in.)
- Balance = E6*(1 + B6)+D6*$B$3. There are two terms. The first is the old balance times the portfolio return. The second is the current salary times the contribution rate.
We can fill these values down, giving us the 401k balance for the entire period:
Here’s the thing: we don’t actually know what our portfolio return and salary increases will be in future years. They’re uncertain. We can use Analytic Solver Platform to turn the wild guesses in columns B and C into probability distributions. Using simulation we can then determine the most likely range for future 401k balances.
For portfolio return, a reasonable thing to do is to go back and look at past performance. Rates of return for the S&P 500 (and other financial instruments) are given on this page. Using the “From Web” feature of Power Query (or by simply copy-pasting) you can bring this data into another Excel worksheet with no sweat:
Now let’s turn this historical data into a probability distribution we can use in our model. Select the S&P 500 historical return data and select Distibutions –> Distribution Wizard in the Analytic Solver Platform tab:
Fill in the first page of the wizard:
Select “continuous values” in the next step, “Fit the data” in the next, and then pick an empty cell for “Location” in the final step. In the cell that you selected, you will see a formula something like this:
This is a Weibull distribution that fits the historical data. If you hit “F9” to recalculate the spreadsheet you will see that the value for this cell changes as a result of sampling from this distribution. Each sample is a different plausible yearly return. Let’s copy this formula in place of the 0.05 values we entered in column B of our original spreadsheet. If we click on the “Model” button in the Analytic Solver Platform ribbon, we will see that these cells have been labeled as “Uncertain Variables” in the Simulation section.
For Salary Increase we will do something simpler. Let’s just assume that the increase will be between 2% and 7% each year. Enter =PsiUniform(0.02, 0.07) in cell C6, and fill down.
The last thing we need to do is to define an “output” for the simulation, called an Uncertain Function. When we define Uncertain Functions, we get nice charts and stats for these cells when we run a simulation. Click on the Balance entry for Year 10, then click on arrow next to the “+” in the Model Pane, and then Add Uncertain Function. Your Model Pane will look something like this:
And your spreadsheet will look something like this:
Now all we need to do is click Simulate in the ribbon. Analytic Solver Platform draws samples for the uncertain variables (and evaluates everything in parallel for fast performance) and then shows you a chart showing the different possible 401k balances. As you can see, the possible balances vary widely but are concentrated around $100,000:
Here’s the great thing: you can now build out this spreadsheet to your heart’s content to build simulations that incorporate more factors. If you want to get really fancy, you can correlate yearly returns. Check out the extensive help on solver.com for more.
Here is a summary of my key impressions from the Day 1 keynotes of the 2014 Spark Summit.
This year’s Spark Summit was one of the deepest, most interesting technical conferences I have attended, and I don’t say that lightly. It is easy to get caught up in the excitement of a conference filled with enthusiasts, but trust me when I say that conventional MapReduce–based Hadoop is over and technologies like Spark will be part of the tipping point that will turn Big Data hype into real applications and much more widespread deployment. Hadoop is legacy.
Spark, like Hadoop, is more than one “thing”. The base component of Spark is a cluster computation engine that is like MapReduce on steroids. Instead of the simple two stage “map then reduce” computational model, Spark supports more general DAG-structured computational flows (Microsoft watchers will remember Dryad). This in itself, is a big innovation, especially for analytics scenarios. Indeed, Spark has been shown to be 10, 100, even 1000 times faster than Hadoop on a number of real workloads. More important than this, in my view, is that Spark includes higher level libraries for data access and analytics, surfaced in clean, consistent APIs that are available from three languages: Java, Scala, and Python. Three important Spark components are MLLib, a machine learning library; GraphX, a graph processing library; and Spark SQL, introduced at this conference. An analogue for those familiar with the Microsoft ecosystem is the .Net Framework – .Net provides languages, a runtime, and a set of libraries together. The integration of the pieces makes each much more useful.
The Summit is organized and principally sponsored by Databricks (tagline: “making big data easy”). This is the company founded by the Berkeley-based creators of Spark. Ion Stoico, CEO of Databricks, kicked off Monday’s festivities, introducing Databricks Cloud, a web based Spark workbench for doing big data analytics. You can find screenshots on the Databricks Cloud site, or on twitter. Key points:
- Databricks Cloud is currently in a private beta.
- It’s a standalone web interface.
- It has a command-line “REPL” interface.
- The examples I saw were in Scala (which is kind of like a mix of F# and Java).
- You can bring in data from Amazon S3 or other sources using Spark SQL (more on that in future posts).
- It includes pre-canned datasets such as a twitter snapshot/firehose (can’t tell which).
- You can do SQL queries right from the REPL.
- It has incredibly simple, clean looking visualizations tied to results.
- You can drag and drop dashboard components that correspond to REPL / program outputs. You can configure how often these components are “refreshed”.
- We were presented a live 10 demo to create a dashboard to filter live tweets, based on a similarity model authored with the help of MLLib, and trained on wikipedia.
- Databricks Cloud would be quite useful even as a standalone, single node analytics workbench, but recall that all of this is running on top of Spark, on the cluster without any “parallel programming” going on by the user.
- Everything you create in the workbench is Spark 1.0 compliant, meaning you can move it over to any other Spark 1.0 distribution without changes.
The development plan is sound, and there is a ton of corporate support for Spark from Cloudera, Hortonworks, DataBricks, SAP, IBM, and others. If time permits I will summarize some of the other keynotes and sessions.
Many moons have passed since my last conference report: let’s do this. I will follow Steven Sinofsky’s style and try to keep this fact-based, saving a few more subjective thoughts for the end. I attended the INFORMS Big Data Conference in San Jose, representing Frontline Systems. At the conference we announced the release of the newest version of Analytic Solver Platform.
This was the very first INFORMS Big Data Conference. INFORMS was originally focused on “operations research” aka optimization aka prescriptive analytics. In recent years, it has embraced analytics more broadly: the spring “practice conference” was rebranded as a “Business Analytics” conference, an analytics professional certification program was rolled out, and last week INFORMS introduced an analytics maturity model for organizations. Holding a “big data” conference is a natural extension.
The conference was relatively small. There were between 15 and 20 exhibitors, including Frontline. The two biggest guns were SAS and FICO. There were several booths that were connected to academia: a couple of graduate programs and a booth for a company run by students. There were also several booths by smaller analytics and/or big data firms, mostly offering web-based experiences for authoring and visualizing predictive models. I think it is fair to say that the majority of the exhibitors have an analytics, as opposed to a big data, emphasis.
There were several technology workshops on Sunday, followed by two days of talks. Monday’s keynote was given by Bill Franks, Chief Analytics Officer at Teradata and author of Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics. Tuesday’s keynote was given by Michael Svilar of Accenture. The talks were divided into several tracks, among them Big Data 101, Case Studies, and Emerging Trends. The talks in the Big Data 101 track were generally well attended. The attendance in other tracks varied widely. In general, talks given by people from “cutting edge” Analytics 10.0 organizations such as Kaggle were quite popular.
Diego Klajban, the founding director of Northwestern University’s MS in Analytics, gave a nice overview of the basic Hadoop stack beyond basic Java-based MapReduce, in particular Pig and Hive. This talk, along with others in the “Big Data 101” track, functioned as mini “survey courses” in various aspects of practicing analytics on Big Data platforms. These talks were helpful for framing technologies and concepts, weaving them into a much more coherent totality.
Paul Kent, VP of Big Data at SAS, gave a talk in the “Emerging Trends” track titled “Big Data and Big Analytics – So Much More Gunpowder!” Paul’s talk focused on four themes: abundance, Hadoop, SAS on Hadoop, and Big Data ideas for organizations. We find ourselves in an “era of abundance” because the cost of storing information has become less than the cost of making the decision to throw it away. We can use this data to answer questions that have not even been formulated at the time of data collection. Paul summarized the Hadoop ecosystem which supports the collection and processing of such data. He went on to describe several SAS offerings which interact with Hadoop in various ways. It was interesting to me to learn how many SAS procedures are now supported “on node” for high performance, among them HPMIXED, HPNEURAL, HPFOREST, HPSVM, and so on. SAS’s continued investment in Hadoop is reflective of a more general challenge: how can organizations realize the potential of Big Data and “Big Data Analytics” when they often have large existing investments in “good old fashioned” storage and analytics.
Paul provided his own definition of Big Data: it is “the amount of data or complexity that puts you out of your comfort zone”. Indeed I heard several different definitions of Big Data at the conference. This variety is indicative of the buzz that surrounds Big Data, particularly among commercial organizations looking to position their offerings.
Frontline has long been in the business of providing analytics solutions to business analysts, and has had a strong presence at various INFORMS conferences for years. I spent a lot of time at the Frontline booth talking to students, professors, business analysts and consultants. Our booth had lots of traffic, and it was interesting to note both how many familiar “INFORMS faces” came by, as well as the number of people who had never heard of Frontline before. In this sense, the INFORMS Big Data conference achieved its mission of connecting the traditional analytics and big data communities. It was interesting to note how many questions there were about Excel’s capabilities. Many did not seem to realize that Excel’s row and column limits increased years ago, and that PowerPivot can bring in much larger data sets with ease – let alone features offered in the recently released Power BI. The hype cycle has largely left Excel behind.
In short, the conference was worth Frontline’s time. We were able to tell the story of not only our most recent release (go get it!) but also Frontline’s overall value proposition. INFORMS Big Data was a nice first bridge between two communities that really should be one. The conference, bluntly, should be much more interesting in a couple of years time, as hype diminishes and case studies increase.
Rick Wicklin from SAS recently wrote a very nice article about simulating data for a logistic regression model. (I have been a loyal reader for years.) I thought it would be interesting to see if we can do the same thing in Excel using Analytic Solver Platform. Yes!
Simulating Logistic Data
As Rick describes in his post, the first step is to generate random explanatory and response variables. He breaks this down as follows:
1. Assign the design matrix (X) of the explanatory variables. This step is done once. It establishes the values of the explanatory variables in the (simulated) study.
2. Compute the linear predictor, η = X β, where β is a vector of parameters. The parameters are the "true values" of the regression coefficients.
3. Transform the linear predictor by the logistic (inverse logit) function. The transformed values are in the range (0,1) and represent probabilities for each observation of the explanatory variables.
4. Simulate a binary response vector from the Bernoulli distribution, where each 0/1 response is randomly generated according to the specified probabilities from Step 3.
We can do this in Excel using Analytic Solver Platform’s built-in probability distributions.
Step 0. In Rick’s example, the parameters for the simulated logistic model are (2, 4, –1). So go get Analytic Solver Platform, fire up Excel, and enter 2, 4, –1 in a new worksheet:
Step 1. Let’s create the design matrix.
- Create a new worksheet and type in column headers x0, x1, x2 in row 1.
- Go to the next row. x0 is the intercept, so fill in 1 for A2.
- x1 is a uniform random number between 0 and 1, so use the Analytic Solver Platform formula =PsiUniform(0,1). (Excel’s RAND() would also work here.)
- x2 is normally distributed. Enter =PsiNormal(0,2).
Now you have something like this. Your values will be different because x1 and x2 are random.
Step 2. Create the linear predictor. Add a column header “eta” and enter the following formula: =SUMPRODUCT(Sheet1!$A$1:$C$1,Sheet2!A2:C2). This multiplies the parameters for the simulated logistic model with the design matrix values, and sums them up.
Step 3. Transform the linear predictor by the inverse logit function. Add a column “mu” with the inverse logit formula: =EXP(D2)/(1+EXP(D2)).
Step 4. Simulate a binary response vector. You can use the PsiBernoulli function to simulate 0-1 values. Add a column “y” with the formula =PsiBernoulli(E2). Your spreadsheet now looks something like this:
Step 5. Now you’ve got one simulated point. Copy and paste this row down as far down as desired, for example a few hundred rows. If you use Excel’s “Fill Down” feature then make sure that column A is a column of “1” values, not “1, 2, 3, …”. Now you will have a big table with your simulated data.
Step 6. Freeze it! You may have noticed that the values of certain columns jiggle around. This is because new random values are being generated every time Excel recalcs. Click on the “Analytic Solver Platform” ribbon tab, click Tools, and click Freeze. This will lock the current samples in place.
Exploring the Data
Now we can use XLMiner to explore and model the simulated data! Click on the XLMiner ribbon tab, and then Explore –> Chart Wizard:
Select “Scatter Plot” as the chart type. Then select x2 for the y-axis, x1 for the x-axis, and color by y. You will get a nice color-coded scatter plot that you can customize to your heart’s content interactively:
Running Logistic Regression
Now we can run logistic regression to recover the coefficients for the simulated data. Click on the XLMiner ribbon tab and select Classify –> Logistic Regression. The simulation data should be preselected in the “Data range” box. Select x1 and x2 as Input Variables and y as the Output Variable. Your dialog will look something like this:
By default, XLMiner logistic regression assumes an intercept, so you can simply click Finish. The logistic regression runs, and the results are reported in an output worksheet. If you examine the coefficients, you should see that they are rather close to (2, 4, –1)!
Since everything is in Excel, you can perform additional analysis, build charts, or even score new data.
I am pleased to announce that Frontline Systems has released the newest version of our flagship product, Analytic Solver Platform. You can download a free trial of Analytic Solver Platform here.
Analytic Solver Platform makes it easy to learn from your data and make good decisions quickly. You don’t have to learn a new programming language, suffer through a complex deployment process, or abandon what you already know: you can grab data from your desktop, the web, or the cloud and build powerful predictive models in minutes from Excel.
In this release of Analytic Solver Platform you’ll find world class time series, prediction, classification, data cleaning, and clustering methods in XLMiner. XLMiner’s 30+ data mining methods have been rewritten from the ground up, combining the latest advances in machine learning with a straightforward Excel interface. Data sets that crash more expensive competitive products run flawlessly in XLMiner. Better yet, XLMiner produces reports with all the information you need to make the business case for your findings, including built-in charts and visualizations.
Analytic Solver Platform works with Microsoft Power BI to turn data into insight. My recent post showed how cloud hosted data can be ingested, cleaned, and mined for insight in minutes. Analytic Solver Platform supplements Power Query’s data cleaning with additional methods to help you categorize, clean, and handle missing data, and provides built in connectors to allow you to sample and score with popular data sources including Power Pivot.
Finally, Analytic Solver Platform helps you bridge the gap between experimentation and production deployment. Using Analytic Solver Platform with SharePoint allows your organization to audit and version your models. Use Frontline’s Solver SDK to integrate simulation and optimization in your application whether you use C++, C#, or web technologies. The latest version of Solver SDK will provide support for the popular F# language, allowing your team to build predictive models in a fraction of the development cost and lines of code.
It’s all backed by our live support, comprehensive user guides, and over a hundred examples out of the box. We’re excited about Analytic Solver Platform 2014-R2 – we hope you’ll give it a try.
In this post, I’ll show you how to use Analytic Solver Platform in Excel 2013 with Power BI to build a time series analysis model using data from the cloud. Together, Analytic Solver Platform and Power BI provide a powerful, easy-to-use predictive analytics platform at a fraction of the cost of alternatives. A new release of Analytic Solver Platform is coming later this week! Visit solver.com to download a free trial or purchase.
My goal is to forecast Gross Domestic Product (GDP) in the United States. Historical GDP is available for free on Windows Azure Marketplace, so our first step is to subscribe to the data feed.
- Go to http://datamarket.azure.com and sign in using your Windows ID.
- Go to the Real GDP Per Capita By State data set and subscribe to it. The subscription is free.
If you look around you will find a number of interesting data sets.
Now we’d like to bring this data into Excel so we can work with it. Power Query, a free Excel add-on provided by Microsoft, makes this extremely easy. If you use Excel and do not know about Power Query, it is definitely worth your time!
- Download and install Microsoft Power Query for Excel here.
- Start Excel and click on the Power Query tab.
- Select “From Other Sources” and then “From Windows Azure Marketplace”:
- The Navigator task pane will open on the right hand side. Your Windows Azure Marketplace subscriptions will be listed, including Real GDP Per Capita By State:
- Expand the Real GDP Per Capity By State entry and click on its only sub-item, called RealGDPByStatePerCapital [sic].
- The Query Editor window is displayed. You should see GDP information by state as well as rolled up to the US:
- Let’s focus on national GDP. Click on the Area column header, uncheck “Select All” to clear all entries and scroll down to United States (screenshot below). Click OK.
- Click Apply and Close. A new worksheet with national GDP data will be created by Power Query.
These steps only scratch the surface of Power Query. For one thing, Power Query supports a number of interesting data sources, as you may have noticed in the menu in step 3. For another, Power Query lets you easily join and clean data, loading it into worksheets, or PowerPivot data models.
Now that we’ve got the data into Excel, we can use XLMiner to run a time series forecast.
- Click on the XLMiner ribbon tab.
- Under Time Series, select ARIMA -> ARIMA Model.
- Enter the following options (and check out the screenshot below):
- Choose RealGDPPerCapital as the “Selected Variable”. This is what we want to forecast.
- Select Year as the Time Variable.
- Set Autoregressive (p) = 4, Difference (d) = 0, Moving average (q) = 1.
- Click Advanced and:
- Check “Fitted values and residuals”. This will show the error in the time series analysis in the report.
- Check “Produce forecasts” and change “Number of forecasts” to 5.
- Click OK.
- Click OK. XLMiner produces an output worksheet, a residuals worksheet, and a stored model worksheet [which can be used for scoring].
- We want to look at how well the ARIMA model performed. Click on the ARIMA_Residuals worksheet and compare the Actual Value and Fitted Value columns. Not too shabby!
If I explore the tables and charts in the output worksheets, I can explore other aspects of the model results, including the forecasted GDP values.
Once you get going, you’ll find that it’s really easy to build cool predictive models using Power Query and Analytic Solver Platform. In the coming days I will share some other interesting scenarios.