Regression in Microsoft EXCEL




This past week, I assigned a project in my Applied Calculus class in which the students must obtain a logistic regression model of the population growth of the United States. This is quite easy to do on the TI-83Plus graphing calculator. In a previous posting, I outlined the steps needed to produce an exponential regression of the form f(x) = abx. The only real difference between the steps needed to obtain an exponential model and a logistic model is that for the logistic model, after you've entered the data, you press STAT, use the arrow keys to select CALC--> B:Logistic. You can obtain regression models in MS EXCEL, too. For an exponential regression model of the form y=f(x) = aebx, you must find a linear regression model for ln(y)= ln(a)+ bx.
Read more for details.


The following figure shows United States Census population data from 1790 to 1940, stored in an MS EXCEL workbook. The year is stored in Column A, the population (in millions) is stored in Column C, and Column B stores the value of X, which denotes the number of years after 1790. In the class project, I gave an exponential function which fits the data quite well for the years 1790-1860. I actually used only those years to obtain the exponential model. (Later, I'll show you the exponential regression model obtained using the entire data, 1790-1940.)

To obtain the exponential model for years 1790-1860, store the natural logarithm of the population in Column D, as shown in the following figure.




Next, define a few other columns. Column E will store the value of a, Column F will store the value of F, column H will store the values of the exponential model (evaluated at the x-values stored in Column B), and Column I will store the corresponding square of the difference between the actual population data and the exponential model's population value. Cell I18 will store the sum of the square differences. (I've also labeled Column G as constant c, which is not used in the exponential regression. That's simply not used here and can be ignored.) I've initialized the values of a and b, but those are not actually needed here. I'm just going to change them later, anyway.

















Once all the data has been entered, you can produce the regression model. From Tools, select Data Analysis..., and then in the Data Analysis Window, select Regression, as shown in the following two figures.




Fill in the appropriate data and output ranges, as shown in the following figure.




After you click OK, the output will be occupy the cells you specified in the Output Range field. The coefficients store the ln(a) and b values, as shown in the figure below. In particular, the intercept value, ln(a) = 1.368117339, and the slope value b = 0.029583301.




The corresponding cells are used to store the appropriate values of a and b. (Notice that, since cell B37 contains the value of ln(a), we must take EXP(B37) to obtain the correct value of a. )



The following figure shows the finished product, after reformatting the P(x) cells to show only one decimal place.



The exponential model produced is

The following figure shows the results of obtaining an exponential model following the procedure just described, but using the data for the every 10 years beginning with 1790 and ending with 1940. I'll leave it to the reader to examine the quality and appropriateness of the two exponential models.





In the class project, I gave an exponential model developed from data in the range 1790-1860 that I obtained with MS EXCEL's Solver. Here's how I did it.

Enter the data, as shown in the figure below. Fill in values for a and b. I've chosen 3.9 (initial population) and an initial guess of 0.039 for b. (I had a good idea of what values to take from the earlier computation. That's cheating! On the other hand, you can obtain reasonable initial guesses by taking a couple of data points and finding the exponential function that satisfies the two points. Then take the resulting values of a and b as your initial guesses.)





Highlight the cell containing the average square difference, and select Tools --> Solver...




In the Solver Parameters window, set the appropriate fields to minimize the average square difference by modifying the values of a and b, in Cells E2 and F2, respectively. This is illustrated in the following figure.




Clicking the Options button, presents you with a few additional choices. You can select Assume Non-negative, since all values involved are positive.




Click OK in the Solver Options window and then click OK in the Solver Parameters window. You'll be presented with some options on whether or not to replace the original values of a and b, in Cells E2 and F2, respectively, and how much detail to include in the report on the computation.



The final product is shown in the figure below. To see the report, you much click on the Limits Report tab at the bottom of the EXCEL workbook window.




The exponential model produced by the procedure and used in the class project was .

In a manner similar to the one outlined above, Solver can be used to obtain a logistic model.

Of course, you'll need to have the Data Analysis tools and Solver included with your MS Office installation. These nifty tools may not be installed by default, but you can install them after your initial installation of MS Office. In my opinion Solver is one of the chief advantages of MS Office over OpenOffice. There's a prerelease version of "Solver" for OpenOffice, but it's not ready for prime time, and is not yet available for every OS. There's no version for Mac OS X, and the Windows version I tried seemed a bit buggy. (I couldn't obtain a logistic model with it.) None of that is unusual for prerelease software, though. It is a very promising contribution to the OpenOffice community.

Once there's a decent version of Solver for OpenOffice (for Mac OS X, Linux, and Windows), MS Office will have some real competition.





Posted: Friday - November 24, 2006 at 11:03 AM        


©