Simulating data for a logistic regression model in Excel

Rick Wicklin from SAS recently wrote a very nice article about simulating data for a logistic regression model. (I have been a loyal reader for years.) I thought it would be interesting to see if we can do the same thing in Excel using Analytic Solver Platform. Yes!

Download the completed workbook here.

Download Analytic Solver Platform here.

Simulating Logistic Data

As Rick describes in his post, the first step is to generate random explanatory and response variables. He breaks this down as follows:

1. Assign the design matrix (X) of the explanatory variables. This step is done once. It establishes the values of the explanatory variables in the (simulated) study.

2. Compute the linear predictor, η = X β, where β is a vector of parameters. The parameters are the "true values" of the regression coefficients.

3. Transform the linear predictor by the logistic (inverse logit) function. The transformed values are in the range (0,1) and represent probabilities for each observation of the explanatory variables.

4. Simulate a binary response vector from the Bernoulli distribution, where each 0/1 response is randomly generated according to the specified probabilities from Step 3.

We can do this in Excel using Analytic Solver Platform’s built-in probability distributions.

Step 0. In Rick’s example, the parameters for the simulated logistic model are (2, 4, –1). So go get Analytic Solver Platform, fire up Excel, and enter 2, 4, –1 in a new worksheet:

image

Step 1. Let’s create the design matrix.

  • Create a new worksheet and type in column headers x0, x1, x2 in row 1.
  • Go to the next row. x0 is the intercept, so fill in 1 for A2.
  • x1 is a uniform random number between 0 and 1, so use the Analytic Solver Platform formula =PsiUniform(0,1). (Excel’s RAND() would also work here.)
  • x2 is normally distributed. Enter =PsiNormal(0,2).
  • Now you have something like this. Your values will be different because x1 and x2 are random.

image

 

Step 2. Create the linear predictor. Add a column header “eta” and enter the following formula: =SUMPRODUCT(Sheet1!$A$1:$C$1,Sheet2!A2:C2). This multiplies the parameters for the simulated logistic model with the design matrix values, and sums them up.

Step 3. Transform the linear predictor by the inverse logit function. Add a column “mu” with the inverse logit formula: =EXP(D2)/(1+EXP(D2)).

Step 4. Simulate a binary response vector. You can use the PsiBernoulli function to simulate 0-1 values. Add a column “y” with the formula =PsiBernoulli(E2). Your spreadsheet now looks something like this:

image

Step 5. Now you’ve got one simulated point. Copy and paste this row down as far down as desired, for example a few hundred rows. If you use Excel’s “Fill Down” feature then make sure that column A is a column of “1” values, not “1, 2, 3, …”. Now you will have a big table with your simulated data.

Step 6. Freeze it! You may have noticed that the values of certain columns jiggle around. This is because new random values are being generated every time Excel recalcs. Click on the “Analytic Solver Platform” ribbon tab, click Tools, and click Freeze. This will lock the current samples in place.

image

Exploring the Data

Now we can use XLMiner to explore and model the simulated data! Click on the XLMiner ribbon tab, and then Explore –> Chart Wizard:

image

Select “Scatter Plot” as the chart type. Then select x2 for the y-axis, x1 for the x-axis, and color by y. You will get a nice color-coded scatter plot that you can customize to your heart’s content interactively:

image

Running Logistic Regression

Now we can run logistic regression to recover the coefficients for the simulated data. Click on the XLMiner ribbon tab and select Classify –> Logistic Regression. The simulation data should be preselected in the “Data range” box. Select x1 and x2 as Input Variables and y as the Output Variable. Your dialog will look something like this:

image

By default, XLMiner logistic regression assumes an intercept, so you can simply click Finish. The logistic regression runs, and the results are reported in an output worksheet. If you examine the coefficients, you should see that they are rather close to (2, 4, –1)!

image

Since everything is in Excel, you can perform additional analysis, build charts, or even score new data.

Download the complete workbook here, and download Analytic Solver platform here.

Author: natebrix

Follow me on twitter at @natebrix.

1 thought on “Simulating data for a logistic regression model in Excel”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s