| y=0 | y=1 | |
x=0 | 54 | 6628 |
x=1 | 2 | 3316 |
可想而知,这样的数据分布对计算机的参数估计来说挑战非常大;即使x和y相关,计算机也算不出来。一般来说,每个单元格的占比不应少于总样本的5%。
解决问题的办法也比较简单,就是把y的均值调小,让x和y的分布更均衡些。具体见下面的code:
- ### 第零步
- set.seed(129)
- g1=rbinom(10000, 1, 0.9)
- c1=rnorm(10000, mean=10, sd=2)
- c2=rbinom(10000, 2, 0.4)
- u=rnorm(10000, mean = 10, sd = 2)
- exi=rnorm(10000, mean = 0, sd = 1)
- eyi=rnorm(10000, mean = 0, sd = 1)
- ### 第一步
- probit_x=0.3*g1+0.2*c1-0.3*c2+u+exi-13
- probit_data=as.data.frame(cbind(g1,c1,c2,u,probit_x))
- probit_data$x[probit_x>0]=1
- probit_data$x[probit_x<=0]=0
- anorex.1 <- glm(x~g1+c1+c2+u,
- family=binomial(link = "probit"),data =probit_data)
- summary(anorex.1)
- ### 第二步
- x <- probit_data$x # x为0/1变量
- probit_y=0.3*x+2*c1-4*c2+u+eyi-25 # 把mean搞对
- # probit_y=0.3*x+2*c1-4*c2+u+eyi-13 # 原均值大大超过了0:20-4+10+0-13 = 14
- # hist(probit_y)
- # table(probit_y>0)
- # probit_data$probit_y = NULL
- # probit_data$y = NULL
- probit_data=as.data.frame(cbind(probit_data,probit_y))
- probit_data$y[probit_data$probit_y>0]=1
- probit_data$y[probit_data$probit_y<=0]=0
- # table(probit_data$x, probit_data$y) # 问题在这:y取0值的样本量太小
- anorex.2 <- glm(y~x+c1+c2+u,
- family=binomial(link = "probit"),data =probit_data)
- summary(anorex.2)


雷达卡
京公网安备 11010802022788号







