Simulating data for a logistic regression model in Excel

Rick Wicklin from SAS recently wrote a very nice article about simulating data for a logistic regression model. (I have been a loyal reader for years.) I thought it would be interesting to see if we can do the same thing in Excel using Analytic Solver Platform. Yes!

Download the completed workbook here.

Download Analytic Solver Platform here.

Simulating Logistic Data

As Rick describes in his post, the first step is to generate random explanatory and response variables. He breaks this down as follows:

1. Assign the design matrix (X) of the explanatory variables. This step is done once. It establishes the values of the explanatory variables in the (simulated) study.

2. Compute the linear predictor, η = X β, where β is a vector of parameters. The parameters are the "true values" of the regression coefficients.

3. Transform the linear predictor by the logistic (inverse logit) function. The transformed values are in the range (0,1) and represent probabilities for each observation of the explanatory variables.

4. Simulate a binary response vector from the Bernoulli distribution, where each 0/1 response is randomly generated according to the specified probabilities from Step 3.

We can do this in Excel using Analytic Solver Platform’s built-in probability distributions.

Step 0. In Rick’s example, the parameters for the simulated logistic model are (2, 4, –1). So go get Analytic Solver Platform, fire up Excel, and enter 2, 4, –1 in a new worksheet:

image

Step 1. Let’s create the design matrix.

  • Create a new worksheet and type in column headers x0, x1, x2 in row 1.
  • Go to the next row. x0 is the intercept, so fill in 1 for A2.
  • x1 is a uniform random number between 0 and 1, so use the Analytic Solver Platform formula =PsiUniform(0,1). (Excel’s RAND() would also work here.)
  • x2 is normally distributed. Enter =PsiNormal(0,2).
  • Now you have something like this. Your values will be different because x1 and x2 are random.

image

 

Step 2. Create the linear predictor. Add a column header “eta” and enter the following formula: =SUMPRODUCT(Sheet1!$A$1:$C$1,Sheet2!A2:C2). This multiplies the parameters for the simulated logistic model with the design matrix values, and sums them up.

Step 3. Transform the linear predictor by the inverse logit function. Add a column “mu” with the inverse logit formula: =EXP(D2)/(1+EXP(D2)).

Step 4. Simulate a binary response vector. You can use the PsiBernoulli function to simulate 0-1 values. Add a column “y” with the formula =PsiBernoulli(E2). Your spreadsheet now looks something like this:

image

Step 5. Now you’ve got one simulated point. Copy and paste this row down as far down as desired, for example a few hundred rows. If you use Excel’s “Fill Down” feature then make sure that column A is a column of “1” values, not “1, 2, 3, …”. Now you will have a big table with your simulated data.

Step 6. Freeze it! You may have noticed that the values of certain columns jiggle around. This is because new random values are being generated every time Excel recalcs. Click on the “Analytic Solver Platform” ribbon tab, click Tools, and click Freeze. This will lock the current samples in place.

image

Exploring the Data

Now we can use XLMiner to explore and model the simulated data! Click on the XLMiner ribbon tab, and then Explore –> Chart Wizard:

image

Select “Scatter Plot” as the chart type. Then select x2 for the y-axis, x1 for the x-axis, and color by y. You will get a nice color-coded scatter plot that you can customize to your heart’s content interactively:

image

Running Logistic Regression

Now we can run logistic regression to recover the coefficients for the simulated data. Click on the XLMiner ribbon tab and select Classify –> Logistic Regression. The simulation data should be preselected in the “Data range” box. Select x1 and x2 as Input Variables and y as the Output Variable. Your dialog will look something like this:

image

By default, XLMiner logistic regression assumes an intercept, so you can simply click Finish. The logistic regression runs, and the results are reported in an output worksheet. If you examine the coefficients, you should see that they are rather close to (2, 4, –1)!

image

Since everything is in Excel, you can perform additional analysis, build charts, or even score new data.

Download the complete workbook here, and download Analytic Solver platform here.

Advertisements

Presenting Analytic Solver Platform 2014-R2

In 2014 Frontline Systems released the newest version of its flagship product, Analytic Solver Platform. You can download a free trial of Analytic Solver Platform here.

Analytic Solver Platform makes it easy to learn from your data and make good decisions quickly. You don’t have to learn a new programming language, suffer through a complex deployment process, or abandon what you already know: you can grab data from your desktop, the web, or the cloud and build powerful predictive models in minutes from Excel.

ASP1

In this release of Analytic Solver Platform you’ll find world class time series, prediction, classification, data cleaning, and clustering methods in XLMiner. XLMiner’s 30+ data mining methods have been rewritten from the ground up, combining the latest advances in machine learning with a straightforward Excel interface. Data sets that crash more expensive competitive products run flawlessly in XLMiner. Better yet, XLMiner produces reports with all the information you need to make the business case for your findings, including built-in charts and visualizations.

ASPTree

ASPChart

Analytic Solver Platform works with Microsoft Power BI to turn data into insight. My recent post showed how cloud hosted data can be ingested, cleaned, and mined for insight in minutes. Analytic Solver Platform supplements Power Query’s data cleaning with additional methods to help you categorize, clean, and handle missing data, and provides built in connectors to allow you to sample and score with popular data sources including Power Pivot.

Finally, Analytic Solver Platform helps you bridge the gap between experimentation and production deployment. Using Analytic Solver Platform with SharePoint allows your organization to audit and version your models. Use Frontline’s Solver SDK to integrate simulation and optimization in your application whether you use C++, C#, or web technologies. The latest version of Solver SDK will provide support for the popular F# language, allowing your team to build predictive models in a fraction of the development cost and lines of code.

SDK1

Give it a try!

Time Series Forecasting using Analytic Solver Platform, Windows Azure Marketplace, and Power Query

In this post, I’ll show you how to use Analytic Solver Platform in Excel 2013 with Power BI to build a time series analysis model using data from the cloud. Together, Analytic Solver Platform and Power BI provide a powerful, easy-to-use predictive analytics platform at a fraction of the cost of alternatives. A new release of Analytic Solver Platform is coming later this week! Visit solver.com to download a free trial or purchase.

My goal is to forecast Gross Domestic Product (GDP) in the United States. Historical GDP is available for free on Windows Azure Marketplace, so our first step is to subscribe to the data feed.

  1. Go to http://datamarket.azure.com and sign in using your Windows ID.
  2. Go to the Real GDP Per Capita By State data set and subscribe to it. The subscription is free.

If you look around you will find a number of interesting data sets. 

Now we’d like to bring this data into Excel so we can work with it. Power Query, a free Excel add-on provided by Microsoft, makes this extremely easy. If you use Excel and do not know about Power Query, it is definitely worth your time!

  1. Download and install Microsoft Power Query for Excel here.
  2. Start Excel and click on the Power Query tab.
  3. Select “From Other Sources” and then “From Windows Azure Marketplace”:From Windows Azure Marketplace
  4. The Navigator task pane will open on the right hand side. Your Windows Azure Marketplace subscriptions will be listed, including Real GDP Per Capita By State:Selecting Data Source
  5. Expand the Real GDP Per Capity By State entry and click on its only sub-item, called RealGDPByStatePerCapital [sic].
  6. The Query Editor window is displayed. You should see GDP information by state as well as rolled up to the US:
    GDP by State in Power Query 
  7. Let’s focus on national GDP. Click on the Area column header, uncheck “Select All” to clear all entries and scroll down to United States (screenshot below). Click OK.
  8. Click Apply and Close. A new worksheet with national GDP data will be created by Power Query.

These steps only scratch the surface of Power Query. For one thing, Power Query supports a number of interesting data sources, as you may have noticed in the menu in step 3. For another, Power Query lets you easily join and clean data, loading it into worksheets, or PowerPivot data models.

Now that we’ve got the data into Excel, we can use XLMiner to run a time series forecast.

  1. Click on the XLMiner ribbon tab.
  2. Under Time Series, select ARIMA -> ARIMA Model.
  3. Enter the following options (and check out the screenshot below):
    1. Choose RealGDPPerCapital as the “Selected Variable”. This is what we want to forecast.
    2. Select Year as the Time Variable.
    3. Set Autoregressive (p) = 4, Difference (d) = 0, Moving average (q) = 1.
      ARIMA Options 
  4. Click Advanced and:
    1. Check “Fitted values and residuals”. This will show the error in the time series analysis in the report.
    2. Check “Produce forecasts” and change “Number of forecasts” to 5.
    3. Click OK.
  5. Click OK. XLMiner produces an output worksheet, a residuals worksheet, and a stored model worksheet [which can be used for scoring].
  6. We want to look at how well the ARIMA model performed. Click on the ARIMA_Residuals worksheet and compare the Actual Value and Fitted Value columns. Not too shabby!

If I explore the tables and charts in the output worksheets, I can explore other aspects of the model results, including the forecasted GDP values.

Once you get going, you’ll find that it’s really easy to build cool predictive models using Power Query and Analytic Solver Platform. In the coming days I will share some other interesting scenarios.