人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › Stata专版 › 零膨胀的负二项模型的回归中，怎么检验变量之间的多重共 ...

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: wilfrid613

7551 5

[一般统计问题] 零膨胀的负二项模型的回归中，怎么检验变量之间的多重共线性 [推广有奖]

0关注
2粉丝

大专生

23%

还不是VIP/贵宾

威望: 0 级
论坛币: 13146 个
通用积分: 0.3000
学术水平: 1 点
热心指数: 0 点
信用等级: 0 点
经验: 222 点
帖子: 7
精华: 0
在线时间: 83 小时
注册时间: 2012-4-9
最后登录: 2023-10-27

楼主

wilfrid613 发表于 2015-4-25 17:37:54 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

我想问下在零膨胀的负二项模型的回归中，怎么检验变量之间的多重共线性呢？stata的命令是什么？输入VIF 和estat vif都没有用。。。

谢谢各位了！！！！

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：多重共线性多重共线共线性 estat Stata 模型

相关帖子

使用道具举报

沙发

jerker 发表于 2015-4-25 17:48:01 |只看作者 |坛友微信交流群

这个帖子应该可以帮到你。
https://bbs.pinggu.org/thread-3679732-1-1.html
〖答疑风暴 · 专治不懂〗“论坛答疑”客服专号成立

已有 1 人评分	经验	热心指数	收起理由
SpencerMeng	+ 100	+ 1	热心帮助其他会员

总评分: 经验 + 100 热心指数 + 1 查看全部评分

使用道具举报

藤椅

论坛答疑 发表于 2015-4-25 21:43:34 |只看作者 |坛友微信交流群

jerker 发表于 2015-4-25 17:48
这个帖子应该可以帮到你。
https://bbs.pinggu.org/thread-3679732-1-1.html
〖答疑风暴 · 专治不懂〗“论 ...

谢谢JERKER版主热心协助，已记录，并将很快专家解答

使用道具举报

板凳

jerker 发表于 2015-4-25 22:32:56 |只看作者 |坛友微信交流群

论坛答疑发表于 2015-4-25 21:43
谢谢JERKER版主热心协助，已记录，并将很快专家解答

使用道具举报

报纸

蓝色 发表于 2015-4-25 22:41:35 |只看作者 |坛友微信交流群

http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm
类似做法，仔细去看看吧

3.3 Multicollinearity

Multicollinearity (or collinearity for short) occurs when two or more independent variables in the model are approximately determined by a linear combination of other independent variables in the model. For example, we would have a problem with multicollinearity if we had both height measured in inches and height measured in feet in the same model. The degree of multicollinearity can vary and can have different effects on the model. When perfect collinearity occurs, that is, when one independent variable is a perfect linear combination of the others, it is impossible to obtain a unique estimate of regression coefficients with all the independent variables in the model. What Stata does in this case is to drop a variable that is a perfect linear combination of the others, leaving only the variables that are not exactly linear combinations of others in the model to assure unique estimate of regression coefficients. For example, we can artificially create a new variable called perli as the sum of yr_rnd and meals. Notice that the only purpose of this example and the creation of the variable perli is to show what Stata does when perfect collinearity occurs. Notice that Stata issues a note, informing us that the variable yr_rnd has been dropped from the model due to collinearity. We cannot assume that the variable that Stata drops from the model is the "correct" variable to omit from the model; rather, we need to rely on theory to determine which variable should be omitted.

use http://www.ats.ucla.edu/stat/Stata/webbooks/logistic/apilog, clear
gen perli=yr_rnd+meals
logit hiqual perli meals yr_rnd

note: yr_rnd dropped due to collinearity
(Iterations omitted.)

Logit estimates                                  Number of obs =    1200
                                                LR chi2(2)    =    898.30
                                                Prob > chi2    =    0.0000
Log likelihood = -308.27755                      Pseudo R2    =    0.5930

------------------------------------------------------------------------------
   hiqual |    Coef. Std. Err.    z P>|z|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
   perli |  -.9908119 .3545667 -2.79 0.005    -1.68575 -.2958739
   meals | .8833963 .3542845    2.49 0.013    .1890113 1.577781
   _cons | 3.61557 .2418967 14.95 0.000    3.141462 4.089679
------------------------------------------------------------------------------
Moderate multicollinearity is fairly common since any correlation among the independent variables is an indication of collinearity. When severe multicollinearity occurs, the standard errors for the coefficients tend to be very large (inflated), and sometimes the estimated logistic regression coefficients can be highly unreliable. Let's consider the following example. In this model, the dependent variable will be hiqual, and the predictor variables will include avg_ed, yr_rnd, meals, full, and the interaction between yr_rnd and full, yxfull. After the logit procedure, we will also run a goodness-of-fit test. Notice that the goodness-of-fit test indicates that, overall, our model fits pretty well.

gen yxfull=  yr_rnd*full
logit  hiqual avg_ed yr_rnd meals full yxfull, nolog or

Logit estimates                                  Number of obs =    1158
                                                LR chi2(5)    =    933.71
                                                Prob > chi2    =    0.0000
Log likelihood = -263.83452                      Pseudo R2    =    0.6389

------------------------------------------------------------------------------
   hiqual | Odds Ratio Std. Err.    z P>|z|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
   avg_ed | 7.163138 2.041592    6.91 0.000    4.097315 12.52297
   yr_rnd | 70719.31    208020    3.80 0.000    221.6864 2.26e+07
   meals | .9240607 .0073503 -9.93 0.000    .9097661    .93858
      full | 1.051269 .0152644    3.44 0.001    1.021773 1.081617
   yxfull | .8755202 .0284632 -4.09 0.000    .8214734 .9331228
------------------------------------------------------------------------------
lfit, group(10)
Logistic model for hiqual, goodness-of-fit test
  (Table collapsed on quantiles of estimated probabilities)
   number of observations =    1158
         number of groups =       10
   Hosmer-Lemeshow chi2(8) =       5.50
               Prob > chi2 =       0.7034
Nevertheless, notice the odd ratio and standard error for the variable yr_rnd are incredibly high. Apparently something went wrong. A direct cause for the incredibly large odd ratio and very large standard error is the multicollinearity among the independent variables. We can use a program called collin to detect the multicollinearity. You can download the program from the ATS website of Stata programs for teaching and research. (findit tag)
collin avg_ed yr_rnd meals full yxfull

  Collinearity Diagnostics

                     SQRT                         Cond
  Variable    VIF VIF Tolerance  Eigenval    Index
-------------------------------------------------------------
avg_ed    3.28 1.81 0.3050    2.7056    1.0000
yr_rnd    35.53 5.96 0.0281    1.4668    1.3581
   meals    3.80 1.95 0.2629    0.6579    2.0279
   full    1.72 1.31 0.5819    0.1554    4.1728
yxfull    34.34 5.86 0.0291    0.0144 13.7284
-------------------------------------------------------------
  Mean VIF    15.73             Condition Number 13.7284
All the measures in the above output are measures of the strength of the interrelationships among the variables. Two commonly used measures are tolerance (an indicator of how much collinearity that a regression analysis can tolerate) and VIF (variance inflation factor-an indicator of how much of the inflation of the standard error could be caused by collinearity). The tolerance for a particular variable is 1 minus the R2 that results from the regression of the other variables on that variable. The corresponding VIF is simply 1/tolerance.  If all of the variables are orthogonal to each other, in other words, completely uncorrelated with each other, both the tolerance and VIF are 1. If a variable is very closely related to another variable(s), the tolerance goes to 0, and the variance inflation gets very large. For example, in the output above, we see that the tolerance and VIF for the variable yxfull is 0.0291 and 34.34, respectively. We can reproduce these results by doing the corresponding regression.
regress  yxfull full meals yr_rnd avg_ed

   Source |    SS    df    MS             Number of obs = 1158
-------------+------------------------------          F(  4,  1153) = 9609.80
   Model |  1128915.43    4  282228.856          Prob > F    =  0.0000
Residual |  33862.2808  1153  29.3688472          R-squared    =  0.9709
-------------+------------------------------          Adj R-squared =  0.9708
   Total |  1162777.71  1157 1004.9937          Root MSE    =  5.4193

------------------------------------------------------------------------------
   yxfull |    Coef. Std. Err.    t P>|t|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
      full | .2313279 .0140312 16.49 0.000    .2037983 .2588574
   meals | -.00088 .0099863 -0.09 0.930 -.0204733 .0187134
   yr_rnd | 83.10644 .4408941 188.50 0.000    82.2414 83.97149
   avg_ed |  -.4611434 .3744277 -1.23 0.218 -1.195779 .2734925
   _cons |  -19.38205 2.100101 -9.23 0.000    -23.5025 -15.2616
------------------------------------------------------------------------------
Notice that the R2 is .9709. Therefore, the tolerance is 1-.9709 = .0291. The VIF is 1/.0291 =  34.36 (the difference between 34.34 and 34.36 being rounding error). As a rule of thumb, a tolerance of 0.1 or less (equivalently VIF of 10 or greater)  is a cause for concern.
Now we have seen what tolerance and VIF measure and we have been convinced that there is a serious collinearity problem, what do we do about it? Notice that in the above regression, the variables full and yr_rnd are the only significant predictors and the coefficient for yr_rnd is very large. This is because often times when we create an interaction term, we also create some collinearity problem. This can be seen in the output of the correlation below.  One way of fixing the collinearity problem is to center the variable full as shown below. We use the sum command to obtain the mean of the variable full, and then generate a new variable called fullc, which is full minus its mean. Next, we generate the interaction of yr_rnd and fullc, called yxfc. Finally, we run the logit command with fullc and yxfc as predictors instead of full and yxfull.  Remember that if you use a centered variable as a predictor, you should create any necessary interaction terms using the centered version of that variable (rather than the uncentered version).

corr yxfull yr_rnd full
(obs=1200)

         | yxfull yr_rnd    full
-------------+---------------------------
   yxfull | 1.0000
   yr_rnd | 0.9810 1.0000
      full |  -0.1449  -0.2387 1.0000

sum full

Variable |    Obs       Mean Std. Dev.    Min       Max
-------------+-----------------------------------------------------
      full | 1200 88.12417 13.39733       13       100

gen fullc=full-r(mean)
gen yxfc=yr_rnd*fullc
corr yxfc  yr_rnd fullc
(obs=1200)

         |    yxfc yr_rnd fullc
-------------+---------------------------
      yxfc | 1.0000
   yr_rnd |  -0.3910 1.0000
   fullc | 0.5174  -0.2387 1.0000

logit  hiqual avg_ed yr_rnd meals fullc yxfc, nolog or

Logit estimates                                  Number of obs =    1158
                                                LR chi2(5)    =    933.71
                                                Prob > chi2    =    0.0000
Log likelihood = -263.83452                      Pseudo R2    =    0.6389

------------------------------------------------------------------------------
   hiqual | Odds Ratio Std. Err.    z P>|z|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
   avg_ed | 7.163138 2.041592    6.91 0.000    4.097315 12.52297
   yr_rnd | .5778193 .2126551 -1.49 0.136    .280882 1.188667
   meals | .9240607 .0073503 -9.93 0.000    .9097661    .93858
   fullc | 1.051269 .0152644    3.44 0.001    1.021773 1.081617
      yxfc | .8755202 .0284632 -4.09 0.000    .8214734 .9331228
------------------------------------------------------------------------------

collin hiqual avg_ed yr_rnd meals fullc yxfc

  Collinearity Diagnostics

                     SQRT                         Cond
  Variable    VIF VIF Tolerance  Eigenval    Index
-------------------------------------------------------------
hiqual    2.40 1.55 0.4173    3.1467    1.0000
avg_ed    3.46 1.86 0.2892    1.2161    1.6086
yr_rnd    1.24 1.12 0.8032    0.7789    2.0100
   meals    4.46 2.11 0.2241    0.4032    2.7938
   fullc    1.72 1.31 0.5816    0.3044    3.2153
   yxfc    1.54 1.24 0.6488    0.1508    4.5685
-------------------------------------------------------------
  Mean VIF    2.47             Condition Number 4.5685
We display the correlation matrix before and after the centering and notice how much change the centering has produced. (Where are these correlation matrices??) The centering of the variable full in this case has fixed the problem of collinearity, and our model fits well overall. The variable yr_rnd is no longer a significant predictor, but the interaction term between yr_rnd and full is. By being able to keep all the predictors in our model, it will be easy for us to interpret the effect of each of the predictors. This centering method is a special case of a transformation of the variables. Transformation of the variables is the best remedy for multicollinearity when it works, since we don't lose any variables from our model. But the choice of transformation is often difficult to make, other than the straightforward ones such as centering. It would be a good choice if the transformation makes sense in terms of modeling since we can interpret the results. (What would be a good choice? Is this sentence redundant?) Other commonly suggested remedies include deleting some of the variables and increasing sample size to get more information. The first one is not always a good option, as it might lead to a misspecified model, and the second option is not always possible. We refer our readers to Berry and Feldman (1985, pp. 46-50) for more detailed discussion of remedies for collinearity. title of book or article?

使用道具举报