R or Python, Who Is Better for Statistical Analysis?
- class NLS:
- ''' This provides a wrapper for scipy.optimize.leastsq to get the relevant output for nonlinear least squares. Although scipy provides curve_fit for that reason, curve_fit only returns parameter estimates and covariances. This wrapper returns numerous statistics and diagnostics'''
-
- import numpy as np
- import scipy.stats as spst
- from scipy.optimize import leastsq
-
- def __init__(self, func, p0, xdata, ydata):
- # Check the data
- if len(xdata) != len(ydata):
- msg = 'The number of observations does not match the number of rows for the predictors'
- raise ValueError(msg)
-
- # Check parameter estimates
- if type(p0) != dict:
- msg = "Initial parameter estimates (p0) must be a dictionry of form p0={'a':1, 'b':2, etc}"
- raise ValueError(msg)
-
- self.func = func
- self.inits = p0.values()
- self.xdata = xdata
- self.ydata = ydata
- self.nobs = len( ydata )
- self.nparm= len( self.inits )
-
- self.parmNames = p0.keys()
-
- for i in range( len(self.parmNames) ):
- if len(self.parmNames[i]) > 5:
- self.parmNames[i] = self.parmNames[i][0:4]
-
- # Run the model
- self.mod1 = leastsq(self.func, self.inits, args = (self.xdata, self.ydata), full_output=1)
-
- # Get the parameters
- self.parmEsts = np.round( self.mod1[0], 4 )
-
- # Get the Error variance and standard deviation
- self.RSS = np.sum( self.mod1[2]['fvec']**2 )
- self.df = self.nobs - self.nparm
- self.MSE = self.RSS / self.df
- self.RMSE = np.sqrt( self.MSE )
-
- # Get the covariance matrix
- self.cov = self.MSE * self.mod1[1]
-
- # Get parameter standard errors
- self.parmSE = np.diag( np.sqrt( self.cov ) )
-
- # Calculate the t-values
- self.tvals = self.parmEsts/self.parmSE
-
- # Get p-values
- self.pvals = (1 - spst.t.cdf( np.abs(self.tvals), self.df))*2
-
- # Get biased variance (MLE) and calculate log-likehood
- self.s2b = self.RSS / self.nobs
- self.logLik = -self.nobs/2 * np.log(2*np.pi) - self.nobs/2 * np.log(self.s2b) - 1/(2*self.s2b) * self.RSS
-
- del(self.mod1)
- del(self.s2b)
- del(self.inits)
-
- # Get AIC. Add 1 to the df to account for estimation of standard error
- def AIC(self, k=2):
- return -2*self.logLik + k*(self.nparm + 1)
-
- del(np)
- del(leastsq)
-
- # Print the summary
- def summary(self):
- print
- print 'Non-linear least squares'
- print 'Model: ' + self.func.func_name
- print 'Parameters:'
- print " Estimate Std. Error t-value P(>|t|)"
- for i in range( len(self.parmNames) ):
- print "% -5s % 5.4f % 5.4f % 5.4f % 5.4f" % tuple( [self.parmNames[i], self.parmEsts[i], self.parmSE[i], self.tvals[i], self.pvals[i]] )
- print
- print 'Residual Standard Error: % 5.4f' % self.RMSE
- print 'Df: %i' % self.df
- import pandas as pd
-
- respData = pd.read_csv(my datafile here)
- respData['respDaily'] = respData['C_Resp_Mass'] * 24
-
- # Create the Arrhenius temperature
-
- respData['Ar'] = -1 / (8.617 * 10**-5 * (respData['Temp']+273))
-
- # First, define the likelihood null model
- def nullMod(params, mass, yObs):
- a = params[0]
- c = params[1]
-
- yHat = a*mass**c
- err = yObs - yHat
- return(err)
-
- p0 = {'a':1, 'b':1}
-
- tMod = NLS(nullMod, p0, respData['UrchinMass'], respData['respDaily'] )
-
- tMod.summary()
-
- tMod.AIC()
-
- tMod.logLik
- The above NLS code is the first that I know if in Python to do statistical NLS (not just curve fitting) and get the output I needed. But, it wasn’t easy, it took me about a week of my off (and on) hours. The following R code to do the same takes me maybe 15 minutes:
- # Import the data
- respiration <- read.csv( my file)
- # Standardize to 24 h
- respiration$C_Resp_Day <- respiration$C_Resp_Mass*24
- # Create the Arrhenius temperature
- respiration$Ar<--1/(8.617*10^-5*(respiration$Temp+273))
-
- Null.mod <- nls(C_Resp_Day~a*UrchinMass^c, data=respiration, start=list(a=exp(6),c=0.25))
-
- summary(Null.mod)
- AIC(Null.mod)
- logLik(Null.mod)
R is just so much easier for real data analysis and has more functions. Python can do these things, but the modules are scattered (there’s at least three separate modules to fit curves that people have written to do different things) and don’t always give the needed output. The NLS example using Python above is, more or less, a replica of R’s NLS output and the missing pieces can be easily added.
R is currently head-and-shoulders above Python for data analysis, but we remain convinced that Python CAN catch up, easily and quickly. It is entirely possible to do statistical analysis in Python if you want to spend the time coding the analyses yourself. By and large, R is the way to go (for now). We would not be surprised that, if 10 years or so, Python’s data analyses libraries coalesce and evolve into something that can rival R.
For those of you interested in trying this out, the data can be downloaded here.


雷达卡




京公网安备 11010802022788号







