楼主: wilfrid613
7551 5

[一般统计问题] 零膨胀的负二项模型的回归中,怎么检验变量之间的多重共线性 [推广有奖]

  • 0关注
  • 2粉丝

大专生

23%

还不是VIP/贵宾

-

威望
0
论坛币
13146 个
通用积分
0.3000
学术水平
1 点
热心指数
0 点
信用等级
0 点
经验
222 点
帖子
7
精华
0
在线时间
83 小时
注册时间
2012-4-9
最后登录
2023-10-27

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
我想问下在零膨胀的负二项模型的回归中,怎么检验变量之间的多重共线性呢?stata的命令是什么?输入VIF 和estat vif都没有用。。。

谢谢各位了!!!!

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:多重共线性 多重共线 共线性 estat Stata 模型

沙发
jerker 发表于 2015-4-25 17:48:01 |只看作者 |坛友微信交流群
这个帖子应该可以帮到你。
https://bbs.pinggu.org/thread-3679732-1-1.html
〖答疑风暴 · 专治不懂〗“论坛答疑”客服专号成立
已有 1 人评分经验 热心指数 收起 理由
SpencerMeng + 100 + 1 热心帮助其他会员

总评分: 经验 + 100  热心指数 + 1   查看全部评分

使用道具

藤椅
论坛答疑 发表于 2015-4-25 21:43:34 |只看作者 |坛友微信交流群
jerker 发表于 2015-4-25 17:48
这个帖子应该可以帮到你。
https://bbs.pinggu.org/thread-3679732-1-1.html
〖答疑风暴 · 专治不懂〗“论 ...
谢谢JERKER版主热心协助,已记录,并将很快专家解答

使用道具

板凳
jerker 发表于 2015-4-25 22:32:56 |只看作者 |坛友微信交流群
论坛答疑 发表于 2015-4-25 21:43
谢谢JERKER版主热心协助,已记录,并将很快专家解答

使用道具

报纸
蓝色 发表于 2015-4-25 22:41:35 |只看作者 |坛友微信交流群
http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm
类似做法,仔细去看看吧

3.3 Multicollinearity

Multicollinearity (or collinearity for short) occurs when two or more independent variables in the model are approximately determined by a linear combination of other independent variables in the model. For example, we would have a problem with multicollinearity if we had both height measured in inches and height measured in feet in the same model. The degree of multicollinearity can vary and can have different effects on the model. When perfect collinearity occurs, that is, when one independent variable is a perfect linear combination of the others, it is impossible to obtain a unique estimate of regression coefficients with all the independent variables in the model. What Stata does in this case is to drop a variable that is a perfect linear combination of the others, leaving only the variables that are not exactly linear combinations of others in the model to assure unique estimate of regression coefficients. For example, we can artificially create a new variable called perli as the sum of yr_rnd and meals. Notice that the only purpose of this example and the creation of the variable perli is to show what Stata does when perfect collinearity occurs. Notice that Stata issues a note, informing us that the variable yr_rnd has been dropped from the model due to collinearity. We cannot assume that the variable that Stata drops from the model is the "correct" variable to omit from the model; rather, we need to rely on theory to determine which variable should be omitted.

use http://www.ats.ucla.edu/stat/Stata/webbooks/logistic/apilog, clear
gen perli=yr_rnd+meals
logit hiqual perli meals yr_rnd


note: yr_rnd dropped due to collinearity
(Iterations omitted.)

Logit estimates                                   Number of obs   =       1200
                                                  LR chi2(2)      =     898.30
                                                  Prob > chi2     =     0.0000
Log likelihood = -308.27755                       Pseudo R2       =     0.5930

------------------------------------------------------------------------------
      hiqual |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       perli |  -.9908119   .3545667    -2.79   0.005     -1.68575   -.2958739
       meals |   .8833963   .3542845     2.49   0.013     .1890113    1.577781
       _cons |    3.61557   .2418967    14.95   0.000     3.141462    4.089679
------------------------------------------------------------------------------
Moderate multicollinearity is fairly common since any correlation among the independent variables is an indication of collinearity. When severe multicollinearity occurs, the standard errors for the coefficients tend to be very large (inflated), and sometimes the estimated logistic regression coefficients can be highly unreliable. Let's consider the following example. In this model, the dependent variable will be hiqual, and the predictor variables will include avg_ed, yr_rnd, meals, full, and the interaction between yr_rnd and full, yxfull. After the logit procedure, we will also run a goodness-of-fit test. Notice that the goodness-of-fit test indicates that, overall, our model fits pretty well.

gen yxfull=  yr_rnd*full
logit  hiqual avg_ed yr_rnd meals full yxfull, nolog or

Logit estimates                                   Number of obs   =       1158
                                                  LR chi2(5)      =     933.71
                                                  Prob > chi2     =     0.0000
Log likelihood = -263.83452                       Pseudo R2       =     0.6389

------------------------------------------------------------------------------
      hiqual | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      avg_ed |   7.163138   2.041592     6.91   0.000     4.097315    12.52297
      yr_rnd |   70719.31     208020     3.80   0.000     221.6864    2.26e+07
       meals |   .9240607   .0073503    -9.93   0.000     .9097661      .93858
        full |   1.051269   .0152644     3.44   0.001     1.021773    1.081617
      yxfull |   .8755202   .0284632    -4.09   0.000     .8214734    .9331228
------------------------------------------------------------------------------
lfit, group(10)
Logistic model for hiqual, goodness-of-fit test
  (Table collapsed on quantiles of estimated probabilities)
       number of observations =      1158
             number of groups =        10
      Hosmer-Lemeshow chi2(8) =         5.50
                  Prob > chi2 =         0.7034
Nevertheless, notice the odd ratio and standard error for the variable yr_rnd are incredibly high. Apparently something went wrong. A direct cause for the incredibly large odd ratio and very large standard error is the multicollinearity among the independent variables. We can use a program called collin to detect the multicollinearity. You can download the program from the ATS website of Stata programs for teaching and research. (findit tag)
collin avg_ed yr_rnd meals full yxfull

  Collinearity Diagnostics

                        SQRT                           Cond
  Variable       VIF    VIF    Tolerance  Eigenval     Index
-------------------------------------------------------------
    avg_ed      3.28    1.81    0.3050     2.7056     1.0000
    yr_rnd     35.53    5.96    0.0281     1.4668     1.3581
     meals      3.80    1.95    0.2629     0.6579     2.0279
      full      1.72    1.31    0.5819     0.1554     4.1728
    yxfull     34.34    5.86    0.0291     0.0144    13.7284
-------------------------------------------------------------
  Mean VIF     15.73              Condition Number   13.7284
All the measures in the above output are measures of the strength of the interrelationships among the variables. Two commonly used measures are tolerance (an indicator of how much collinearity that a regression analysis can tolerate) and VIF (variance inflation factor-an indicator of how much of the inflation of the standard error could be caused by collinearity). The tolerance for a particular variable is 1 minus the R2 that results from the regression of the other variables on that variable. The corresponding VIF is simply 1/tolerance.  If all of the variables are orthogonal to each other, in other words, completely uncorrelated with each other, both the tolerance and VIF are 1. If a variable is very closely related to another variable(s), the tolerance goes to 0, and the variance inflation gets very large. For example, in the output above, we see that the tolerance and VIF for the variable yxfull is 0.0291 and 34.34, respectively. We can reproduce these results by doing the corresponding regression.
regress  yxfull full meals yr_rnd avg_ed

      Source |       SS       df       MS              Number of obs =    1158
-------------+------------------------------           F(  4,  1153) = 9609.80
       Model |  1128915.43     4  282228.856           Prob > F      =  0.0000
    Residual |  33862.2808  1153  29.3688472           R-squared     =  0.9709
-------------+------------------------------           Adj R-squared =  0.9708
       Total |  1162777.71  1157   1004.9937           Root MSE      =  5.4193

------------------------------------------------------------------------------
      yxfull |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        full |   .2313279   .0140312    16.49   0.000     .2037983    .2588574
       meals |    -.00088   .0099863    -0.09   0.930    -.0204733    .0187134
      yr_rnd |   83.10644   .4408941   188.50   0.000      82.2414    83.97149
      avg_ed |  -.4611434   .3744277    -1.23   0.218    -1.195779    .2734925
       _cons |  -19.38205   2.100101    -9.23   0.000     -23.5025    -15.2616
------------------------------------------------------------------------------
Notice that the R2 is .9709. Therefore, the tolerance is 1-.9709 = .0291. The VIF is 1/.0291 =  34.36 (the difference between 34.34 and 34.36 being rounding error). As a rule of thumb, a tolerance of 0.1 or less (equivalently VIF of 10 or greater)  is a cause for concern.
Now we have seen what tolerance and VIF measure and we have been convinced that there is a serious collinearity problem, what do we do about it? Notice that in the above regression, the variables full and yr_rnd are the only significant predictors and the coefficient for yr_rnd is very large. This is because often times when we create an interaction term, we also create some collinearity problem. This can be seen in the output of the correlation below.  One way of fixing the collinearity problem is to center the variable full as shown below. We use the sum command to obtain the mean of the variable full, and then generate a new variable called fullc, which is full minus its mean. Next, we generate the interaction of yr_rnd and fullc, called yxfc. Finally, we run the logit command with fullc and yxfc as predictors instead of full and yxfull.  Remember that if you use a centered variable as a predictor, you should create any necessary interaction terms using the centered version of that variable (rather than the uncentered version).

corr yxfull yr_rnd full
(obs=1200)

             |   yxfull   yr_rnd     full
-------------+---------------------------
      yxfull |   1.0000
      yr_rnd |   0.9810   1.0000
        full |  -0.1449  -0.2387   1.0000
        
sum full

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
        full |    1200    88.12417   13.39733         13        100

gen fullc=full-r(mean)
gen yxfc=yr_rnd*fullc
corr yxfc  yr_rnd fullc
(obs=1200)

             |     yxfc   yr_rnd    fullc
-------------+---------------------------
        yxfc |   1.0000
      yr_rnd |  -0.3910   1.0000
       fullc |   0.5174  -0.2387   1.0000


logit  hiqual avg_ed yr_rnd meals fullc yxfc, nolog or

Logit estimates                                   Number of obs   =       1158
                                                  LR chi2(5)      =     933.71
                                                  Prob > chi2     =     0.0000
Log likelihood = -263.83452                       Pseudo R2       =     0.6389

------------------------------------------------------------------------------
      hiqual | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      avg_ed |   7.163138   2.041592     6.91   0.000     4.097315    12.52297
      yr_rnd |   .5778193   .2126551    -1.49   0.136      .280882    1.188667
       meals |   .9240607   .0073503    -9.93   0.000     .9097661      .93858
       fullc |   1.051269   .0152644     3.44   0.001     1.021773    1.081617
        yxfc |   .8755202   .0284632    -4.09   0.000     .8214734    .9331228
------------------------------------------------------------------------------

collin hiqual avg_ed yr_rnd meals fullc yxfc

  Collinearity Diagnostics

                        SQRT                           Cond
  Variable       VIF    VIF    Tolerance  Eigenval     Index
-------------------------------------------------------------
    hiqual      2.40    1.55    0.4173     3.1467     1.0000
    avg_ed      3.46    1.86    0.2892     1.2161     1.6086
    yr_rnd      1.24    1.12    0.8032     0.7789     2.0100
     meals      4.46    2.11    0.2241     0.4032     2.7938
     fullc      1.72    1.31    0.5816     0.3044     3.2153
      yxfc      1.54    1.24    0.6488     0.1508     4.5685
-------------------------------------------------------------
  Mean VIF      2.47              Condition Number    4.5685  
We display the correlation matrix before and after the centering and notice how much change the centering has produced. (Where are these correlation matrices??) The centering of the variable full in this case has fixed the problem of collinearity, and our model fits well overall. The variable yr_rnd is no longer a significant predictor, but the interaction term between yr_rnd and full is. By being able to keep all the predictors in our model, it will be easy for us to interpret the effect of each of the predictors. This centering method is a special case of a transformation of the variables. Transformation of the variables is the best remedy for multicollinearity when it works, since we don't lose any variables from our model. But the choice of transformation is often difficult to make, other than the straightforward ones such as centering. It would be a good choice if the transformation makes sense in terms of modeling since we can interpret the results. (What would be a good choice? Is this sentence redundant?) Other commonly suggested remedies include deleting some of the variables and increasing sample size to get more information. The first one is not always a good option, as it might lead to a misspecified model, and the second option is not always possible. We refer our readers to Berry and Feldman (1985, pp. 46-50) for more detailed discussion of remedies for collinearity. title of book or article?

使用道具

地板
peyzf 发表于 2015-8-5 10:17:29 |只看作者 |坛友微信交流群
学习中~

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-27 21:37