[问答] [讨论]怎样用spss做回归!







                          David P. Nichols
                     Senior Support Statistician
                             SPSS, Inc.
                 From SPSS Keywords, Number 56, 1995

When we polled Keywords readers to find out what kinds of topics they most
wanted to see covered in future Statistically Speaking articles, we found
that many SPSS users are concerned about the proper use of categorical
predictor variables in regression models. Since the interpretation of the
estimated coefficients is a major part of the analysis of a regression
model, and since this interpretation depends upon how the predictors have
been coded (or in technical terms, how the model has been parameterized),
this is indeed an important topic.

To begin with, we will assume that the model under consideration involves
only first order or main effects of predictor variables. That is, no
higher order polynomial terms such as squares or cubes are used, and no
interactions between predictors are involved. Such higher order or product
terms introduce complexities beyond those introduced by the presence of main
effects involving categorical variables. We will avoid these complexities
for the time being. We will further assume that we have complete data; that
is, no missing values on any predictor or dependent variables. We begin with
a brief review of the interpretation of estimated regression coefficients.

As you may remember, in a linear regression model the estimated raw or
unstandardized regression coefficient for a predictor variable (referred
to as B on the SPSS REGRESSION output) is interpreted as the change in the
predicted value of the dependent variable for a one unit increase in the
predictor variable. Thus a B coefficient of 1.0 would indicate that for
every unit increase in the predictor, the predicted value of the dependent
variable also increases by one unit. In the common case where there are two
or more correlated predictors in the model, the B coefficient is known as a
partial regression coefficient, and it represents the predicted change in
the dependent variable when that predictor is increased by one unit while
holding all other predictors constant. The intercept or constant term gives
the predicted value of the dependent variable when all predictors are set
to 0.

For our purposes the important distinction between different types of
predictor variables is between those measured on at least an interval scale,
where a change of one unit in the predictor has a constant meaning across
the entire scale, and those where such consistency of unit differences is
not assumed. Though these are theoretically distinct, in practice it often
happens that the terms interval and subinterval are replaced by continuous
and categorical. The interpretation of estimated regression coefficients
given above applies in a fairly straightforward manner to interval
predictors, continuous or not, and their use in procedures like REGRESSION
is quite simple as a practical matter: just name them as independent
variables and specify when you want them used. For subinterval variables,
which is the assumption in SPSS for categorical variables, things are more
complicated. Despite the fact that equating continuous with interval and
categorical with subinterval is an abuse of language, we'll proceed to do
just that, to avoid confusion related to use of SPSS procedures.

One reason that the handling of categorical predictors is so important is
that by the time one gets to the actual computation of the regression
equation, no distinction is made between subinterval and interval variables.
To put it another way, a matrix algebra routine knows nothing about different
types of numbers; they're all just numbers. Some SPSS procedures used to
analyze linear and generalized linear regression models are designed to
handle the translation from categorical to interval representations with
only minimal guidance from the user. These include the T-TEST procedure,
the analysis of variance procedures ONEWAY, ANOVA and MANOVA, and the newer
nonlinear regression procedures LOGISTIC REGRESSION and COX REGRESSION.
However, even when such automatic handling of categorical predictors is
available, it is still incumbent upon the user to make sure that he or she
understands categorical variable representations well enough to produce
useful results and to be able to interpret these results.

The simplest possible regression involving categorical predictors is one
with a single dichotomous (two level) independent variable. An example of
such a regression model would be the prediction of 1990 murder rates in
each of the 50 states in the U.S.A. based upon whether or not each state
had a death penalty statute in force just prior to and during that time.
The data are compiled from almanac sources; murder rates are measured in
number per 100,000 population. The variable of interest, denoted MURDER90,
has a mean value of about 4.97 for the fourteen states without a death
penalty statute, and about 7.86 for the 36 states with the death penalty.

Figure 1 presents the results of a dummy variable regression of MURDER90
on DEATHPEN, a categorical variable taking on a value of 0 for the no
death penalty states and 1 for the death penalty states. 0-1 coding, known
as dummy or indicator coding, is quite popular, as it often lends itself to
the simplest possible interpretation.

Figure 1
Multiple R           .33556
R Square             .11260
Adjusted R Square    .09411
Standard Error      3.72103

Analysis of Variance
                    DF      Sum of Squares      Mean Square
Regression           1            84.33257         84.33257
Residual            48           664.61163         13.84608

F =       6.09072       Signif F =  .0172

------------------ Variables in the Equation ------------------

Variable              B        SE B       Beta         T  Sig T

DEATHPEN       2.892460    1.172015    .335562     2.468  .0172
(Constant)     4.971429     .994488                4.999  .0000
End Figure 1

Here we have two coefficients, a constant or intercept term, and a "slope"
coefficient for the DEATHPEN variable. Recall that the interpretation is
that the constant is the predicted value when all predictors are set to
0, which here simply represents those states with no death penalty. Thus
the constant coefficient is equal to the mean murder rate for this group.
The DEATHPEN coefficient is the predicted increase in murder rate for a
unit increase in the DEATHPEN variable. Since those states with a DEATHPEN
value of 1 are those states with a death penalty statute, this coefficient
represents the change in estimated or predicted murder rate for these
states relative to those without the death penalty. The 2.89 value is
exactly the difference between the two means, so that adding it to the
constant produces the mean for the death penalty states. Since we are
considering the entire population of states, the significance level is
not necessary of particular interest, though if we were to conceptualize
the current situation as resulting from a sampling from some hypothetical
populations, the p-value of .0172 would indicate that so large a coefficient
is unlikely to result from chance were random samples of this size drawn
from hypothetical populations with equal means.

Other results of note are that the p-value for the t-test for the MURDER90
coefficient is the same as that for the overall regression F-test. This is
due to the fact that the t-test tests the null hypothesis that this
coefficient is 0 in the population, while the F-test tests the null
hypothesis that all coefficients other than intercept are 0 in the
population, and with only one predictor, these hypotheses are the same.
The F-value is precisely the square of the t-value. This holds only for
a simple regression involving one predictor. Also of note is the fact
that the the Multiple R, which reduces to the absolute value of the
correlation between the predictor and the dependent variable in a simple
regression, is equal to the standardized regression coefficient (Beta).
In a simple regression, the standardized coefficient is the correlation
between the predictor and dependent variables, and is thus constrained
to be between -1 and +1. Note that this generally holds true only for a
simple regression, and that with correlated predictor variables, the
standardized coefficients may be larger than 1 in absolute value.

This correlation between a dichotomous variable and a continuous variable
is sometimes known as a point-biserial correlation. No special formula is
required; special computational formulas in texts are simply special cases
of the general Pearson product moment correlation coefficient formula
applied to this combination of variable types. If both variables are
dichotomous, the standard formula reduces further to that for a phi

Finally, note that there are a number of ways in SPSS to achieve the same
results we obtained from REGRESSION, if our purpose were to test the null
hypothesis of equality of means between the two groups of states drawn
from our hypothetical populations. Precisely the same t-statistic (or the
negative of the value from REGRESSION, which means the same thing, given
the variable codings) could be obtained from T-TEST, the CONTRAST option in
ONEWAY or parameter estimate output in MANOVA, and the F-statistic could be
duplicated in ONEWAY, ANOVA or MANOVA. In ONEWAY or ANOVA we would have to
use the dummy variable for DEATHPEN as a two level factor, while in MANOVA
we could either specify it as a factor or as a covariate. The results in
any case would be the same in terms of test statistics and p-values. One
example is given in Figure 2, using default DEVIATION contrasts in MANOVA:

Figure 2
Tests of Significance for MURDER90 using UNIQUE sums of squares
Source of Variation          SS      DF        MS         F  Sig of F

WITHIN+RESIDUAL          664.61      48     13.85
DEATHPEN                  84.33       1     84.33      6.09      .017
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Estimates for MURDER90
--- Individual univariate .9500 confidence intervals


  Parameter     Coeff.  Std. Err.    t-Value     Sig. t Lower -95%  CL- Upper

        1   6.41765873     .58601   10.95150     .00000    5.23941    7.59591


  Parameter     Coeff.  Std. Err.    t-Value     Sig. t Lower -95%  CL- Upper

        2   -1.4462302     .58601   -2.46794     .01721   -2.62448    -.26798
End Figure 2

We see that the significance level for the t-test for the DEATHPEN parameter
estimate is the same as we obtained in REGRESSION. However, it is opposite
in sign to our earlier coefficient, and only half the size. Our constant
has also changed; note that it's value is now halfway between the means of
the two groups. The differences we see here are due to the use of a different
set of predictor codings being used internally by MANOVA. That is, MANOVA has
parameterized the model somewhat differently than we did earlier. The default
DEVIATION contrasts in MANOVA are designed to compare each level of a factor
to the mean of all levels. In this case the DEATHPEN coefficient compares the
no death penalty mean to the simple average of the two group means. The
CONSTANT coefficient is this simple mean of group means. The F-statistic
remains the same, the square of the t-value, as was the case in REGRESSION.

These results point to three important features of the regression model. One
is that the interpretation of the estimated model coefficients is dependent
upon the parameterization of the model; in order to know how to interpret the
coefficients, we must be aware of how the predictor values have been coded.
Second, despite having used two different parameterizations, we obtained the
same results in terms of the test statistics for DEATHPEN. This result would
occur regardless of the two numerical values used to represent the groups;
all we could do by changing these values would be to flip the sign of the
coefficient and inflate or deflate the coefficient and it's standard error by
an equal factor, so that the absolute value of the ratio remains the same.
This is true because the existence of only two groups means that a numerical
representation of a comparison between them can only be done in one way; any
differences in practical results are due to scaling considerations. Another
way of saying this is to note that since we have only two groups, there can
be only one degree of freedom in any test used to compare them, and the
results must therefore always be the same. Finally, though the identical
error sums of squares only intimate this and do not necessarily prove it,
it is the case that the predicted values produced by the two approaches are
identical. In other words, we have really fitted the same overall model in
two slightly different ways.

We have yet to identify the codings given to the two levels of DEATHPEN that
resulted in the MANOVA parameter estimates. In MANOVA we specified DEATHPEN
as a categorical factor variable with codes of 0 and 1, and had the procedure
internally create the design or basis matrix required for the model fitting.
In REGRESSION, only the constant or intercept column of 1's is provided
automatically by the procedure; the other columns are provided by the user in
the form of the predictor variables specified. In MANOVA, the procedure
automatically creates a set of predictor variables to represent a factor
instead of requiring the user to do so. In the case of a dichotomous factor,
MANOVA creates only one predictor in addition to the constant term, and by
default it gives this variable values of 1 and -1, respectively. In our
example, the states without the death penalty are the first group (having
factor variable value 0), and are coded 1, while states with the death
penalty receive a value of -1.

If we recall the interpretation of the regression coefficient as the increase
in the predicted value of the dependent variable for a unit increase in the
predictor, we can see why the DEATHPEN coefficient in MANOVA is -1/2 that of
the one in REGRESSION. First, the directionality has been changed. That is,
an increase in the predictor means moving from the death penalty group toward
the no death penalty group. Thus the change in sign. Second, in order to
compare the two groups in this parameterization, we must move two units, from
-1 to 1, rather than from 0 to 1. Thus the two parameterizations are really
telling us exactly the same thing. This is further illustrated by using the
MANOVA results to predict the murder rates of the two groups. For the states
with no death penalty, we add the CONSTANT and DEATHPEN coefficients, giving
us a predicted value of about 4.97. For the death penalty group, we subtract
the DEATHPEN coefficient from the CONSTANT, and obtain a predicted value of
about 7.86. These are of course the same values obtained using REGRESSION.

What if we wanted to produce the same estimates in MANOVA that we obtained
in REGRESSION? The only straightforward way to produce exactly the same
estimates would be to enter the DEATHPEN predictor as a covariate coded 0-1.
(There is a way to trick MANOVA into providing the same coefficients as
REGRESSION even with DEATHPEN as a factor, but we'll ignore that here.)
The reason for this is that in it's automatic reparameterization or internal
recoding of the factor(s), MANOVA enforces a sum to 0 restriction on the
values of the category codings. Thus 0-1 coding is not available. We can
still obtain the same parameter value for the difference between the two
groups of states however. This can be obtained by using SIMPLE contrasts
with the first category as the reference category. This uses category codes
of -1/2 and 1/2, so that an increase of one unit in the predictor would mean
a change from no death penalty to the death penalty, and the resulting
coefficient would be the same in both magnitude and sign as that given in
REGRESSION. However, the constant or intercept term would still be the
unweighted mean of the two group means. Using the CONSTANT coefficient from
the MANOVA output plus or minus 1/2 times the DEATHPEN coefficient from the
REGRESSION output, you can verify that this parameterization again produces
exactly the same predicted values as our earlier approaches.

So much for the simple situation of a dichotomous predictor. As we have seen,
in this situation the coding of the variable is important in interpreting the
value of the regression coefficient, but not when we want to test whether the
predictor has a nonzero population relationship with the dependent variable.
One way to think about this fact is that when there are only two values of a
predictor, there is only one interval between those values, so the assumption
of equal meanings of intervals is automatically satisfied. However, once we
move to predictors with more than two levels, things become more complicated.
We'll save those complications for the next issue.

