552 11

# hierreg:Python Package for Linear Models [推广有奖]

0%

-

TA的文库  其他...

Python Programming

SAS Programming

16

12370609 个

2876 点

2880 点

2693 点

440568 点

19365

87

7471 小时

2005-4-23

2019-2-22

Nicolle   发表于 2018-2-14 22:46:38 |显示全部楼层
 hierregThis is a Python package intended for fitting linear models whose coefficients can have some deviation according to groups to which observations belong, in a similar way as multilevel / random effects and hierarchical bayes models, but following more of a ‘statistical leaning’ procedure for the estimation, by applying regularization to group deviations (random effects).An example such application would be to predict some variable of interest based on predictors/covariates collected at different physical locations (e.g. students from different schools, sales form different stores, etc.), or survey answers from different people, when data from these different groups (people/schools/regions/etc.) doesn’t behave the same and it’s desired to have different coefficients for each, but with these coefficients still being close to each other.Under the package’s default settings, the loss to minimize is as follows:L(w, v) = norm( y - X*(w + sum_groups(v_group*I[x in group]) )/sqrt(nobs) + reg_param*norm(group_weights*v)Where: * 'X' is the predictors/covariates matrix * 'y' is the value to predict * 'w' are the coefficients for each variable * 'v_group' are deviations from those coefficients for each group (1 variable per coefficient per group) * ‘group_weights' are weights for the deviation for each group, summing up to 1, and inversely proportional to the number of observations coming from each group * 'I[x in group]' is an indicator column-matrix with ones when a row of X belongs to a given groupWhile this doesn’t provide the same inference power or statistical testing capabilities as hierarchical bayes or random effects models, respectively, it can sometimes result in models that are able to make better predictions than either of them (see references at the bottom).InstallationPackage is available on PyPI, can be installed withpip install hierreg .custom_tag{background:none; padding-left:0;} .custom_tag a{text-decoration:none; margin-right:7px;} .custom_tag a:hover{text-decoration:underline;} .custom_tag i{display: inline-block;width: 60px;}

### 本帖被以下文库推荐

Nicolle   发表于 2018-2-14 22:50:17 |显示全部楼层
 Example usage import numpy as np, pandas as pd from hierreg import HierarchicalRegression ## Generating sample data X=np.random.normal(size=(1000,3)) groups=pd.get_dummies(np.random.randint(low=1,high=4,size=1000).astype('str')).as_matrix() ## Making a variable dependent on the input y=(37+2*groups[:,0]-4*groups[:,2]) + X[:,0]*(0+2*groups[:,1]-groups[:,0]) + 0.5*X[:,1] + 7*X[:,2] - 2*X[:,1]**2 ## Adding noise y+=np.random.normal(scale=5,size=1000) ## Fitting the model hlr=HierarchicalRegression() hlr.fit(X,y,groups) ## Making predictions yhat=hlr.predict(X,groups)复制代码

Nicolle   发表于 2018-2-14 22:51:18 |显示全部楼层
 Example - Predicting student attainment Here I'll try to build a model to predict student attaintment based on standardized test scores, family information and schools that children attended - the data was downloaded from here: http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#lev-xc Description from the webpage 2LEV-XC The data are on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. There are 11 fields in the data set of which the following are used. VRQ: A verbal reasoning score from tests pupils took when they entered secondary school ATTAIN: Attainment score of pupils at age 16 PID: Primary school identifying code SEX: Pupil's gender 0 = boy 1 = girl SC: Pupil's social class scale (continuous scale score from low to high social class) SID: Secondary school identifying code Taking a look at the data: import pandas as pd, numpy as np, warnings warnings.filterwarnings("ignore") df=pd.read_csv('2lev-xc.txt',names=['VRQ','ATTAIN','PID','SEX','SC','SID'],sep='\s',engine='python') df.head() VRQ        ATTAIN        PID        SEX        SC        SID 0        11.0        10.0        1.0        0        0.0        9.0 1        0.0        3.0        1.0        1        0.0        9.0 2        -14.0        2.0        1.0        0        0.0        9.0 3        -6.0        3.0        1.0        0        20.0        9.0 4        -30.0        2.0        1.0        1        0.0        9.0 There are, overall, over a hundred primary schools and over a dozen secondary schools. For a dataset of this size, this is a very large amount of parameters: print(len(set(df.PID))) print(len(set(df.SID))) 148 19 Now fitting a model with school-varying coefficients: from hierreg import HierarchicalRegression from sklearn.model_selection import train_test_split y=df['ATTAIN'] X=df[['VRQ','SEX','SC']] groups=df[['PID','SID']] ## simple train-test split - note that I'm suing a random seed to get the same split again later X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(      X, y, groups, test_size=0.33, random_state=42) ## for small problems like this, solver 'ECOS' works faster than the default 'SCS' ## it would still work fine without chaning the default options though, only a bit slower ## note the large regularization parameter hr=HierarchicalRegression(l2_reg=1e4, cvxpy_opts={'solver':'ECOS'}) hr.fit(X_train, y_train, groups_train) print(np.mean((y_test-hr.predict(X_test, groups_test))**2)) 4.503111161611763 The choice of which variables to turn into groups with deviations in coefficients and which to use as categorical variables can be quite hard though - in this very simple example, a tuned regularized model with no group-variying coefficients ends up performing better: from sklearn.linear_model import Ridge y=df['ATTAIN'] X=df[['VRQ','SEX','SC','PID','SID']] X['PID']=X.PID.map(lambda x: str(x)) X['SID']=X.SID.map(lambda x: str(x)) X=pd.get_dummies(X) ## same train-test split X_train, X_test, y_train, y_test = train_test_split(      X, y, test_size=0.33, random_state=42) lr=Ridge(100).fit(X_train, y_train) print(np.mean((y_test-lr.predict(X_test))**2)) 4.386018754899887 (With larger datasets with more variables, this might not necessarily be the case though, as illustrated in the next example)复制代码

Nicolle   发表于 2018-2-14 22:52:55 |显示全部楼层
 Example - Predicting sales at different storesrossman=pd.read_csv('train.csv',engine='python') rossman['StateHoliday']=rossman.StateHoliday.map(lambda x: str(x)) # exlcuding days with no sales rossman=rossman.loc[rossman.Sales>0] rossman.head()复制代码

Nicolle   发表于 2018-2-14 22:53:56 |显示全部楼层
 Example - Predicting sales at different storesrossman.shape (844338, 9) stores=pd.read_csv('C:\\ipython\\mixed effects\\rossman sales\\store.csv') stores.head()复制代码

Nicolle   发表于 2018-2-14 22:54:49 |显示全部楼层
 Example - Predicting sales at different storesrossman['Year']=rossman.Date.map(lambda x: x[:4]) rossman['Month']=rossman.Date.map(lambda x: x[5:7]) rossman['DayOfWeek']=rossman.DayOfWeek.map(lambda x: str(x)) max_comp_dist=stores.CompetitionDistance.max() stores['CompetitionDistance'].loc[stores.CompetitionDistance.isnull()]=max_comp_dist rossman=pd.merge(rossman,stores[['Store','StoreType','Assortment','CompetitionDistance']],on='Store') rossman.head() Store        DayOfWeek        Date        Sales        Customers        Open        Promo        StateHoliday        SchoolHoliday        Year        Month        StoreType        Assortment        CompetitionDistance 0        1        5        2015-07-31        5263        555        1        1        0        1        2015        07        c        a        1270.0 1        1        4        2015-07-30        5020        546        1        1        0        1        2015        07        c        a        1270.0 2        1        3        2015-07-29        4782        523        1        1        0        1        2015        07        c        a        1270.0 3        1        2        2015-07-28        5011        560        1        1        0        1        2015        07        c        a        1270.0 4        1        1        2015-07-27        6102        612        1        1        0        1        2015        07        c        a        1270.0复制代码

Nicolle   发表于 2018-2-14 22:56:00 |显示全部楼层

## Example - Predicting sales at different stores

 As a comparison point, I'll first fit a linear regression model, using the stores and store types as categorical variables in the model. Note that this is model is libnear in all parameters (no interactions, no polynomials, etc.), but as the dataset is very large, processing the data and fitting the model can still take a very long time. The models in the winning solutions for this competition took over a day to fit according to their authors. %%time from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split from scipy.sparse import coo_matrix, csr_matrix, hstack y=rossman['Sales'] X=rossman[['Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance']] Xcateg=rossman[['Store', 'DayOfWeek', 'StoreType','Assortment','Year','Month']] Xcateg['Store']=Xcateg.Store.map(lambda x: str(x)) Xcateg=coo_matrix(pd.get_dummies(Xcateg).as_matrix()) X=hstack([pd.get_dummies(X).as_matrix(),Xcateg]) X_train, X_test, y_train, y_test = train_test_split(      X, y, test_size=0.33, random_state=100) lr=Ridge() lr.fit(csr_matrix(X_train),y_train) preds_lr=lr.predict(X_test) print(np.sqrt(np.mean((y_test-preds_lr)**2))) 2678.6297086 Wall time: 11min 23s复制代码

Nicolle   发表于 2018-2-14 22:57:15 |显示全部楼层
 Example - Predicting sales at different stores Now fitting a model with coefficients that vary by store - note that this is a far larger problem with over 40k model parameters, so it's no surprise that it takes a long time to fit and takes 4GB of ram memory. Nevertheless, this results in a huge improvement over the previous model: %%time from hierreg import HierarchicalRegression y=rossman['Sales'] X=rossman[['DayOfWeek','Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance','Year','Month']] group=rossman[['Store','StoreType','Assortment']] X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(      pd.get_dummies(X), y, group, test_size=0.33, random_state=100) ## for larger datasets, casadi --> IPOPT can provide better running times, but only supports the default parameters hlr=HierarchicalRegression(solver_interface='casadi') hlr.fit(X_train,y_train,group_train) preds_hlr1=hlr.predict(X_test,group_test) print(np.sqrt(np.mean((y_test-preds_hlr1)**2))) ****************************************************************************** This program contains Ipopt, a library for large-scale nonlinear optimization. Ipopt is released as open source code under the Eclipse Public License (EPL).          For more information visit http://projects.coin-or.org/Ipopt ****************************************************************************** 673.711325158 Wall time: 7h 21min 50s复制代码

jianyu1118   发表于 2018-2-14 23:14:07 |显示全部楼层
 kan kanq

qqchun 发表于 2018-2-15 03:57:51 |显示全部楼层
 The decription looks very interesting.

 您需要登录后才可以回帖 登录 | 我要注册 回帖后跳转到最后一页