[專題大討論]R or Python, Who Is Better for Statistical Analysis?

0关注
62粉丝

VIP

已卖：4195份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50289 个
通用积分: 83.7506
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2014-7-1 02:22:06 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

R or Python, Who Is Better for Statistical Analysis?

class NLS:
''' This provides a wrapper for scipy.optimize.leastsq to get the relevant output for nonlinear least squares. Although scipy provides curve_fit for that reason, curve_fit only returns parameter estimates and covariances. This wrapper returns numerous statistics and diagnostics'''
import numpy as np
import scipy.stats as spst
from scipy.optimize import leastsq
def __init__(self, func, p0, xdata, ydata):
# Check the data
if len(xdata) != len(ydata):
msg = 'The number of observations does not match the number of rows for the predictors'
raise ValueError(msg)
# Check parameter estimates
if type(p0) != dict:
msg = "Initial parameter estimates (p0) must be a dictionry of form p0={'a':1, 'b':2, etc}"
raise ValueError(msg)
self.func = func
self.inits = p0.values()
self.xdata = xdata
self.ydata = ydata
self.nobs = len( ydata )
self.nparm= len( self.inits )
self.parmNames = p0.keys()
for i in range( len(self.parmNames) ):
if len(self.parmNames[i]) > 5:
self.parmNames[i] = self.parmNames[i][0:4]
# Run the model
self.mod1 = leastsq(self.func, self.inits, args = (self.xdata, self.ydata), full_output=1)
# Get the parameters
self.parmEsts = np.round( self.mod1[0], 4 )
# Get the Error variance and standard deviation
self.RSS = np.sum( self.mod1[2]['fvec']**2 )
self.df = self.nobs - self.nparm
self.MSE = self.RSS / self.df
self.RMSE = np.sqrt( self.MSE )
# Get the covariance matrix
self.cov = self.MSE * self.mod1[1]
# Get parameter standard errors
self.parmSE = np.diag( np.sqrt( self.cov ) )
# Calculate the t-values
self.tvals = self.parmEsts/self.parmSE
# Get p-values
self.pvals = (1 - spst.t.cdf( np.abs(self.tvals), self.df))*2
# Get biased variance (MLE) and calculate log-likehood
self.s2b = self.RSS / self.nobs
self.logLik = -self.nobs/2 * np.log(2*np.pi) - self.nobs/2 * np.log(self.s2b) - 1/(2*self.s2b) * self.RSS
del(self.mod1)
del(self.s2b)
del(self.inits)
# Get AIC. Add 1 to the df to account for estimation of standard error
def AIC(self, k=2):
return -2*self.logLik + k*(self.nparm + 1)
del(np)
del(leastsq)
# Print the summary
def summary(self):
print
print 'Non-linear least squares'
print 'Model: ' + self.func.func_name
print 'Parameters:'
print " Estimate Std. Error t-value P(>|t|)"
for i in range( len(self.parmNames) ):
print "% -5s % 5.4f % 5.4f % 5.4f % 5.4f" % tuple( [self.parmNames[i], self.parmEsts[i], self.parmSE[i], self.tvals[i], self.pvals[i]] )
print
print 'Residual Standard Error: % 5.4f' % self.RMSE
print 'Df: %i' % self.df

复制代码

import pandas as pd
respData = pd.read_csv(my datafile here)
respData['respDaily'] = respData['C_Resp_Mass'] * 24
# Create the Arrhenius temperature
respData['Ar'] = -1 / (8.617 * 10**-5 * (respData['Temp']+273))
# First, define the likelihood null model
def nullMod(params, mass, yObs):
a = params[0]
c = params[1]
yHat = a*mass**c
err = yObs - yHat
return(err)
p0 = {'a':1, 'b':1}
tMod = NLS(nullMod, p0, respData['UrchinMass'], respData['respDaily'] )
tMod.summary()
tMod.AIC()
tMod.logLik

复制代码

The above NLS code is the first that I know if in Python to do statistical NLS (not just curve fitting) and get the output I needed. But, it wasn’t easy, it took me about a week of my off (and on) hours. The following R code to do the same takes me maybe 15 minutes:
# Import the data
respiration <- read.csv( my file)
# Standardize to 24 h
respiration$C_Resp_Day <- respiration$C_Resp_Mass*24
# Create the Arrhenius temperature
respiration$Ar<--1/(8.617*10^-5*(respiration$Temp+273))
Null.mod <- nls(C_Resp_Day~a*UrchinMass^c, data=respiration, start=list(a=exp(6),c=0.25))
summary(Null.mod)
AIC(Null.mod)
logLik(Null.mod)

复制代码

R is just so much easier for real data analysis and has more functions. Python can do these things, but the modules are scattered (there’s at least three separate modules to fit curves that people have written to do different things) and don’t always give the needed output. The NLS example using Python above is, more or less, a replica of R’s NLS output and the missing pieces can be easily added.

R is currently head-and-shoulders above Python for data analysis, but we remain convinced that Python CAN catch up, easily and quickly. It is entirely possible to do statistical analysis in Python if you want to spend the time coding the analyses yourself. By and large, R is the way to go (for now). We would not be surprised that, if 10 years or so, Python’s data analyses libraries coalesce and evolve into something that can rival R.

For those of you interested in trying this out, the data can be downloaded here.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Statistical statistica statistic Analysis Analysi provides relevant returns import reason

相关帖子

沙发

songlinjl 发表于 2014-7-1 02:31:39 来自手机

Lisrelchen 发表于 2014-7-1 02:22
Nonlinear Regression using Python

R is just so much easier for real data analysis and has more fu ...

齐头并进

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

藤椅

大家开心 发表于 2014-7-1 02:47:23

研究研究

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

板凳

charlieq 发表于 2014-7-1 04:02:48

如果只是STAT的话那当然是R了，我不知道如果算像logistic regression这样的模型Python有什么有效地办法。如果是Data Mining，两个差不多吧。但是如果是Numerical Analysis 我觉得还是Python方便一些

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

报纸

complicated

发表于 2014-7-1 09:32:20

charlieq 发表于 2014-7-1 04:02
如果只是STAT的话那当然是R了，我不知道如果算像logistic regression这样的模型Python有什么有效地办法。如 ...

那个，logistic regression不是data mining的分类算法之一么。。。。一直搞不懂

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

地板

zgy_Russell 发表于 2014-7-1 11:21:36

不太了解Python，但是觉得R软件不错。貌似也是现在用户增长量增加最快的。R功能较齐全，还能根据自己研究添加程序包，网上也能下载资源，感觉比较强大吧。

7楼

zgy_Russell 发表于 2014-7-1 11:21:41

不太了解Python，但是觉得R软件不错。貌似也是现在用户增长量增加最快的。R功能较齐全，还能根据自己研究添加程序包，网上也能下载资源，感觉比较强大吧。

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

8楼

charlieq 发表于 2014-7-1 22:21:14

complicated 发表于 2014-7-1 09:32
那个，logistic regression不是data mining的分类算法之一么。。。。一直搞不懂

logit regression 是regression的，属于统计方法，Data Mining又是另一回事，两者有思想交叉的地方，但是严格来讲是两个领域。Data Mining主要是大数据处理

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

9楼

complicated

发表于 2014-7-2 09:09:54

charlieq 发表于 2014-7-1 22:21
logit regression 是regression的，属于统计方法，Data Mining又是另一回事，两者有思想交叉的地方，但是 ...

OK~

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

10楼

410234198 发表于 2014-7-2 13:07:46

数据一多，总感觉python好用些。

已有 1 人评分	论坛币	收起理由
Lisrelchen	+ 20	精彩帖子

总评分: 论坛币 + 20 查看全部评分

[專題大討論]R or Python, Who Is Better for Statistical Analysis? [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

中级学术勋章

中级热心勋章

初级信用勋章

高级学术勋章

高级热心勋章

中级信用勋章

特级学术勋章

高级信用勋章

特级信用勋章

本版微信群

[專題大討論]R or Python, Who Is Better for Statistical Analysis? [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

中级学术勋章

中级热心勋章

初级信用勋章

高级学术勋章

高级热心勋章

中级信用勋章

特级学术勋章

高级信用勋章

特级信用勋章

本版微信群

扫码加我拉你入群