[问答] 为什么OLS回归的sklearn和statsmodels实现给出了不同的R平方 [推广有奖]

0关注
9粉丝

已卖：483份资源

泰斗

51%

还不是VIP/贵宾

威望: 0 级
论坛币: 65160 个
通用积分: 7568.0273
学术水平: 200 点
热心指数: 243 点
信用等级: 179 点
经验: 4002 点
帖子: 29105
精华: 0
在线时间: 10020 小时
注册时间: 2014-4-10
最后登录: 2026-1-31

楼主

olympic 发表于 2021-4-3 10:51:55 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

sklearn和statsmodels实现的OLS模型在不拟合截距时会产生不同的R平方值。否则他们似乎工作得很好。以下代码

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

差异从何而来？

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：models stats model Learn Earn

相关帖子

沙发

yunnandlg

发表于 2021-4-6 13:55:20

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).R2可能为负。当模型的预测使数据拟合度比输出值的平均值差时，就会出现负分数。
score(X, y[,]sample_weight) 定义为(1-u/v)，
u = （（y_true - y_pred）**2）.sum()，
v=((y_true-y_true.mean())**2).mean()
最好的得分为1.0，一般的得分都比1.0低，得分越低代表结果越差。

SST=SSE+SSR 在没有截距项的回归模型中，该等式不成立。不带截距项的线性回归的R^2会小于0或者大于1。但此时我们可用Uncentered R-square。
【1】sklearn计算出的score（r2）是严格按照公式计算。
【2】statsmodels计算出的r2 在没有截距时R2 is computed without centering (uncentered) since the model does not contain a constant.

rsquared – R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.
rsquared_adj – Adjusted R-squared. This is defined here as 1 - (nobs-1)/df_resid * (1-rsquared) if a constant is included and 1 - nobs/df_resid * (1-rsquared) if no constant is included.

【3】特别说明：centered/uncentered R2与 R2/ adjusted R2 不是一个概念，而一般的教科书也提醒我们，使用 IV 估计时，R2 是没有太大意义的（所以通常不报告 R2 值）！

可以参考 https://www.stata-journal.com/sjpdf.html?articlenum=st0030

Regression through the origin is an important and useful tool in applied statistics, but it remains a subject of pedagogical neglect, controversy and confusion. Hopefully, this synthesis provides some clarity. However, in the light ofthe unresolved debate, perhaps the strongest conclusion to be drawn from this review is that the practice of statistics remains as much an art asit is a science, and the development of statistical judgment is therefore as important as computational skill.

1617694650(1).png