R语言线性回归函数lm()中关于缺失值的处理方式有一个参数:na.action,经查阅文献,得知其默认值与R自己的options()中的na.action一致,正常是na.omit,我理解就是忽略了数据中有缺失值的case。但是当我手动把没有NA值的cases筛出来做回归的时候,却发现跟用全部数据得到的回归方程不一样,请各位点拨一下是什么原因,回头给大家发奖励,谢谢先!
- #### 用的是VIM包中的函数sleep
- ### 不对数据做任何预处理,直接回归
- lm_res <- lm(BodyWgt~BrainWgt+NonD+Dream,data=sleep,na.action=na.omit)
- summary(lm_res)
复制代码得到的结果如下:
- Call:
- lm(formula = BodyWgt ~ BrainWgt + NonD + Dream, data = sleep,
- na.action = na.omit)
- Residuals:
- Min 1Q Median 3Q Max
- -619.28 -3.23 8.17 19.86 242.61
- Coefficients:
- Estimate Std. Error t value Pr(>|t|)
- (Intercept) 12.45354 47.21972 0.264 0.793
- BrainWgt 0.51703 0.02615 19.771 <2e-16 ***
- NonD -3.92827 5.69966 -0.689 0.494
- Dream 5.42715 13.49113 0.402 0.689
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Residual standard error: 113.6 on 44 degrees of freedom
- (14 observations deleted due to missingness)
- Multiple R-squared: 0.9149, Adjusted R-squared: 0.9091
- F-statistic: 157.8 on 3 and 44 DF, p-value: < 2.2e-16
复制代码然后先用complete.cases()筛出来没有数据缺失情况的cases,然后再用没有缺失的数据做线性回归,代码和结果分别如下:
- sleep_complete <- sleep
- sleep_complete$label <- complete.cases(sleep_complete)
- sleep_complete <- sleep_complete[sleep_complete$label == T,]
- lm_complete <- lm(BodyWgt ~ BrainWgt + NonD + Dream, data = sleep_complete )
- summary(lm_complete)
复制代码- Call:
- lm(formula = BodyWgt ~ BrainWgt + NonD + Dream, data = sleep_complete)
- Residuals:
- Min 1Q Median 3Q Max
- -618.56 -4.46 10.30 23.98 242.33
- Coefficients:
- Estimate Std. Error t value Pr(>|t|)
- (Intercept) 13.1329 53.2917 0.246 0.807
- BrainWgt 0.5173 0.0286 18.090 <2e-16 ***
- NonD -3.7833 6.3681 -0.594 0.556
- Dream 4.0130 16.2712 0.247 0.807
- ---
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Residual standard error: 122.2 on 38 degrees of freedom
- Multiple R-squared: 0.9145, Adjusted R-squared: 0.9077
- F-statistic: 135.4 on 3 and 38 DF, p-value: < 2.2e-16
复制代码比较两个回归的结果,很容易发现两者并不一致,那也就是说na.omit并不是把所有的含NA值的case都剔除掉了,那na.omit到底是对NA值做了什么处理呢?请论坛里的各位小伙伴多指教!