http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm
类似做法,仔细去看看吧
3.3 Multicollinearity
Multicollinearity (or collinearity for short) occurs when two or more independent variables in the model are approximately determined by a linear combination of other independent variables in the model. For example, we would have a problem with multicollinearity if we had both height measured in inches and height measured in feet in the same model. The degree of multicollinearity can vary and can have different effects on the model. When perfect collinearity occurs, that is, when one independent variable is a perfect linear combination of the others, it is impossible to obtain a unique estimate of regression coefficients with all the independent variables in the model. What Stata does in this case is to drop a variable that is a perfect linear combination of the others, leaving only the variables that are not exactly linear combinations of others in the model to assure unique estimate of regression coefficients. For example, we can artificially create a new variable called perli as the sum of yr_rnd and meals. Notice that the only purpose of this example and the creation of the variable perli is to show what Stata does when perfect collinearity occurs. Notice that Stata issues a note, informing us that the variable yr_rnd has been dropped from the model due to collinearity. We cannot assume that the variable that Stata drops from the model is the "correct" variable to omit from the model; rather, we need to rely on theory to determine which variable should be omitted.
use http://www.ats.ucla.edu/stat/Stata/webbooks/logistic/apilog, clear
gen perli=yr_rnd+meals
logit hiqual perli meals yr_rnd
note: yr_rnd dropped due to collinearity
(Iterations omitted.)
Logit estimates Number of obs = 1200
LR chi2(2) = 898.30
Prob > chi2 = 0.0000
Log likelihood = -308.27755 Pseudo R2 = 0.5930
------------------------------------------------------------------------------
hiqual | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
perli | -.9908119 .3545667 -2.79 0.005 -1.68575 -.2958739
meals | .8833963 .3542845 2.49 0.013 .1890113 1.577781
_cons | 3.61557 .2418967 14.95 0.000 3.141462 4.089679
------------------------------------------------------------------------------
Moderate multicollinearity is fairly common since any correlation among the independent variables is an indication of collinearity. When severe multicollinearity occurs, the standard errors for the coefficients tend to be very large (inflated), and sometimes the estimated logistic regression coefficients can be highly unreliable. Let's consider the following example. In this model, the dependent variable will be hiqual, and the predictor variables will include avg_ed, yr_rnd, meals, full, and the interaction between yr_rnd and full, yxfull. After the logit procedure, we will also run a goodness-of-fit test. Notice that the goodness-of-fit test indicates that, overall, our model fits pretty well.
gen yxfull= yr_rnd*full
logit hiqual avg_ed yr_rnd meals full yxfull, nolog or
Logit estimates Number of obs = 1158
LR chi2(5) = 933.71
Prob > chi2 = 0.0000
Log likelihood = -263.83452 Pseudo R2 = 0.6389
------------------------------------------------------------------------------
hiqual | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avg_ed | 7.163138 2.041592 6.91 0.000 4.097315 12.52297
yr_rnd | 70719.31 208020 3.80 0.000 221.6864 2.26e+07
meals | .9240607 .0073503 -9.93 0.000 .9097661 .93858
full | 1.051269 .0152644 3.44 0.001 1.021773 1.081617
yxfull | .8755202 .0284632 -4.09 0.000 .8214734 .9331228
------------------------------------------------------------------------------
lfit, group(10)
Logistic model for hiqual, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
number of observations = 1158
number of groups = 10
Hosmer-Lemeshow chi2(8) = 5.50
Prob > chi2 = 0.7034
Nevertheless, notice the odd ratio and standard error for the variable yr_rnd are incredibly high. Apparently something went wrong. A direct cause for the incredibly large odd ratio and very large standard error is the multicollinearity among the independent variables. We can use a program called collin to detect the multicollinearity. You can download the program from the ATS website of Stata programs for teaching and research. (findit tag)
collin avg_ed yr_rnd meals full yxfull
Collinearity Diagnostics
SQRT Cond
Variable VIF VIF Tolerance Eigenval Index
-------------------------------------------------------------
avg_ed 3.28 1.81 0.3050 2.7056 1.0000
yr_rnd 35.53 5.96 0.0281 1.4668 1.3581
meals 3.80 1.95 0.2629 0.6579 2.0279
full 1.72 1.31 0.5819 0.1554 4.1728
yxfull 34.34 5.86 0.0291 0.0144 13.7284
-------------------------------------------------------------
Mean VIF 15.73 Condition Number 13.7284
All the measures in the above output are measures of the strength of the interrelationships among the variables. Two commonly used measures are tolerance (an indicator of how much collinearity that a regression analysis can tolerate) and VIF (variance inflation factor-an indicator of how much of the inflation of the standard error could be caused by collinearity). The tolerance for a particular variable is 1 minus the R2 that results from the regression of the other variables on that variable. The corresponding VIF is simply 1/tolerance. If all of the variables are orthogonal to each other, in other words, completely uncorrelated with each other, both the tolerance and VIF are 1. If a variable is very closely related to another variable(s), the tolerance goes to 0, and the variance inflation gets very large. For example, in the output above, we see that the tolerance and VIF for the variable yxfull is 0.0291 and 34.34, respectively. We can reproduce these results by doing the corresponding regression.
regress yxfull full meals yr_rnd avg_ed
Source | SS df MS Number of obs = 1158
-------------+------------------------------ F( 4, 1153) = 9609.80
Model | 1128915.43 4 282228.856 Prob > F = 0.0000
Residual | 33862.2808 1153 29.3688472 R-squared = 0.9709
-------------+------------------------------ Adj R-squared = 0.9708
Total | 1162777.71 1157 1004.9937 Root MSE = 5.4193
------------------------------------------------------------------------------
yxfull | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
full | .2313279 .0140312 16.49 0.000 .2037983 .2588574
meals | -.00088 .0099863 -0.09 0.930 -.0204733 .0187134
yr_rnd | 83.10644 .4408941 188.50 0.000 82.2414 83.97149
avg_ed | -.4611434 .3744277 -1.23 0.218 -1.195779 .2734925
_cons | -19.38205 2.100101 -9.23 0.000 -23.5025 -15.2616
------------------------------------------------------------------------------
Notice that the R2 is .9709. Therefore, the tolerance is 1-.9709 = .0291. The VIF is 1/.0291 = 34.36 (the difference between 34.34 and 34.36 being rounding error). As a rule of thumb, a tolerance of 0.1 or less (equivalently VIF of 10 or greater) is a cause for concern.
Now we have seen what tolerance and VIF measure and we have been convinced that there is a serious collinearity problem, what do we do about it? Notice that in the above regression, the variables full and yr_rnd are the only significant predictors and the coefficient for yr_rnd is very large. This is because often times when we create an interaction term, we also create some collinearity problem. This can be seen in the output of the correlation below. One way of fixing the collinearity problem is to center the variable full as shown below. We use the sum command to obtain the mean of the variable full, and then generate a new variable called fullc, which is full minus its mean. Next, we generate the interaction of yr_rnd and fullc, called yxfc. Finally, we run the logit command with fullc and yxfc as predictors instead of full and yxfull. Remember that if you use a centered variable as a predictor, you should create any necessary interaction terms using the centered version of that variable (rather than the uncentered version).
corr yxfull yr_rnd full
(obs=1200)
| yxfull yr_rnd full
-------------+---------------------------
yxfull | 1.0000
yr_rnd | 0.9810 1.0000
full | -0.1449 -0.2387 1.0000
sum full
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
full | 1200 88.12417 13.39733 13 100
gen fullc=full-r(mean)
gen yxfc=yr_rnd*fullc
corr yxfc yr_rnd fullc
(obs=1200)
| yxfc yr_rnd fullc
-------------+---------------------------
yxfc | 1.0000
yr_rnd | -0.3910 1.0000
fullc | 0.5174 -0.2387 1.0000
logit hiqual avg_ed yr_rnd meals fullc yxfc, nolog or
Logit estimates Number of obs = 1158
LR chi2(5) = 933.71
Prob > chi2 = 0.0000
Log likelihood = -263.83452 Pseudo R2 = 0.6389
------------------------------------------------------------------------------
hiqual | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avg_ed | 7.163138 2.041592 6.91 0.000 4.097315 12.52297
yr_rnd | .5778193 .2126551 -1.49 0.136 .280882 1.188667
meals | .9240607 .0073503 -9.93 0.000 .9097661 .93858
fullc | 1.051269 .0152644 3.44 0.001 1.021773 1.081617
yxfc | .8755202 .0284632 -4.09 0.000 .8214734 .9331228
------------------------------------------------------------------------------
collin hiqual avg_ed yr_rnd meals fullc yxfc
Collinearity Diagnostics
SQRT Cond
Variable VIF VIF Tolerance Eigenval Index
-------------------------------------------------------------
hiqual 2.40 1.55 0.4173 3.1467 1.0000
avg_ed 3.46 1.86 0.2892 1.2161 1.6086
yr_rnd 1.24 1.12 0.8032 0.7789 2.0100
meals 4.46 2.11 0.2241 0.4032 2.7938
fullc 1.72 1.31 0.5816 0.3044 3.2153
yxfc 1.54 1.24 0.6488 0.1508 4.5685
-------------------------------------------------------------
Mean VIF 2.47 Condition Number 4.5685
We display the correlation matrix before and after the centering and notice how much change the centering has produced. (Where are these correlation matrices??) The centering of the variable full in this case has fixed the problem of collinearity, and our model fits well overall. The variable yr_rnd is no longer a significant predictor, but the interaction term between yr_rnd and full is. By being able to keep all the predictors in our model, it will be easy for us to interpret the effect of each of the predictors. This centering method is a special case of a transformation of the variables. Transformation of the variables is the best remedy for multicollinearity when it works, since we don't lose any variables from our model. But the choice of transformation is often difficult to make, other than the straightforward ones such as centering. It would be a good choice if the transformation makes sense in terms of modeling since we can interpret the results. (What would be a good choice? Is this sentence redundant?) Other commonly suggested remedies include deleting some of the variables and increasing sample size to get more information. The first one is not always a good option, as it might lead to a misspecified model, and the second option is not always possible. We refer our readers to Berry and Feldman (1985, pp. 46-50) for more detailed discussion of remedies for collinearity. title of book or article?
|