楼主: Nicolle
316 11

hierreg:Python Package for Linear Models [推广有奖]

版主

巨擘

0%

还不是VIP/贵宾

-

TA的文库  其他...

Python Programming

SAS Programming

Structural Equation Modeling

威望
16
论坛币
12282206 个
学术水平
2750 点
热心指数
2704 点
信用等级
2563 点
经验
423914 点
帖子
17936
精华
76
在线时间
6726 小时
注册时间
2005-4-23
最后登录
2018-8-21

Nicolle 学生认证  发表于 2018-2-14 22:46:38 |显示全部楼层
hierreg

This is a Python package intended for fitting linear models whose coefficients can have some deviation according to groups to which observations belong, in a similar way as multilevel / random effects and hierarchical bayes models, but following more of a ‘statistical leaning’ procedure for the estimation, by applying regularization to group deviations (random effects).

An example such application would be to predict some variable of interest based on predictors/covariates collected at different physical locations (e.g. students from different schools, sales form different stores, etc.), or survey answers from different people, when data from these different groups (people/schools/regions/etc.) doesn’t behave the same and it’s desired to have different coefficients for each, but with these coefficients still being close to each other.

Under the package’s default settings, the loss to minimize is as follows:

L(w, v) = norm( y - X*(w + sum_groups(v_group*I[x in group]) )/sqrt(nobs) + reg_param*norm(group_weights*v)

Where: * 'X' is the predictors/covariates matrix * 'y' is the value to predict * 'w' are the coefficients for each variable * 'v_group' are deviations from those coefficients for each group (1 variable per coefficient per group) * ‘group_weights' are weights for the deviation for each group, summing up to 1, and inversely proportional to the number of observations coming from each group * 'I[x in group]' is an indicator column-matrix with ones when a row of X belongs to a given group

While this doesn’t provide the same inference power or statistical testing capabilities as hierarchical bayes or random effects models, respectively, it can sometimes result in models that are able to make better predictions than either of them (see references at the bottom).

Installation

Package is available on PyPI, can be installed with

pip install hierreg



本帖被以下文库推荐

stata SPSS
Nicolle 学生认证  发表于 2018-2-14 22:50:17 |显示全部楼层
  1. Example usage

  2. import numpy as np, pandas as pd
  3. from hierreg import HierarchicalRegression

  4. ## Generating sample data
  5. X=np.random.normal(size=(1000,3))
  6. groups=pd.get_dummies(np.random.randint(low=1,high=4,size=1000).astype('str')).as_matrix()

  7. ## Making a variable dependent on the input
  8. y=(37+2*groups[:,0]-4*groups[:,2]) + X[:,0]*(0+2*groups[:,1]-groups[:,0]) + 0.5*X[:,1] + 7*X[:,2] - 2*X[:,1]**2

  9. ## Adding noise
  10. y+=np.random.normal(scale=5,size=1000)

  11. ## Fitting the model
  12. hlr=HierarchicalRegression()
  13. hlr.fit(X,y,groups)

  14. ## Making predictions
  15. yhat=hlr.predict(X,groups)
复制代码
回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-14 22:51:18 |显示全部楼层
  1. Example - Predicting student attainment
  2. Here I'll try to build a model to predict student attaintment based on standardized test scores, family information and schools that children attended - the data was downloaded from here: http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#lev-xc

  3. Description from the webpage

  4. 2LEV-XC
  5. The data are on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. There are 11 fields in the data set of which the following are used.

  6. VRQ: A verbal reasoning score from tests pupils took when they entered secondary school
  7. ATTAIN: Attainment score of pupils at age 16
  8. PID: Primary school identifying code
  9. SEX: Pupil's gender
  10. 0 = boy
  11. 1 = girl
  12. SC: Pupil's social class scale (continuous scale score from low to high social class)
  13. SID: Secondary school identifying code
  14. Taking a look at the data:

  15. import pandas as pd, numpy as np, warnings
  16. warnings.filterwarnings("ignore")

  17. df=pd.read_csv('2lev-xc.txt',names=['VRQ','ATTAIN','PID','SEX','SC','SID'],sep='\s',engine='python')
  18. df.head()
  19. VRQ        ATTAIN        PID        SEX        SC        SID
  20. 0        11.0        10.0        1.0        0        0.0        9.0
  21. 1        0.0        3.0        1.0        1        0.0        9.0
  22. 2        -14.0        2.0        1.0        0        0.0        9.0
  23. 3        -6.0        3.0        1.0        0        20.0        9.0
  24. 4        -30.0        2.0        1.0        1        0.0        9.0
  25. There are, overall, over a hundred primary schools and over a dozen secondary schools. For a dataset of this size, this is a very large amount of parameters:

  26. print(len(set(df.PID)))
  27. print(len(set(df.SID)))
  28. 148
  29. 19
  30. Now fitting a model with school-varying coefficients:

  31. from hierreg import HierarchicalRegression
  32. from sklearn.model_selection import train_test_split

  33. y=df['ATTAIN']
  34. X=df[['VRQ','SEX','SC']]
  35. groups=df[['PID','SID']]

  36. ## simple train-test split - note that I'm suing a random seed to get the same split again later
  37. X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(
  38.      X, y, groups, test_size=0.33, random_state=42)

  39. ## for small problems like this, solver 'ECOS' works faster than the default 'SCS'
  40. ## it would still work fine without chaning the default options though, only a bit slower
  41. ## note the large regularization parameter
  42. hr=HierarchicalRegression(l2_reg=1e4, cvxpy_opts={'solver':'ECOS'})
  43. hr.fit(X_train, y_train, groups_train)
  44. print(np.mean((y_test-hr.predict(X_test, groups_test))**2))
  45. 4.503111161611763
  46. The choice of which variables to turn into groups with deviations in coefficients and which to use as categorical variables can be quite hard though - in this very simple example, a tuned regularized model with no group-variying coefficients ends up performing better:

  47. from sklearn.linear_model import Ridge

  48. y=df['ATTAIN']
  49. X=df[['VRQ','SEX','SC','PID','SID']]
  50. X['PID']=X.PID.map(lambda x: str(x))
  51. X['SID']=X.SID.map(lambda x: str(x))
  52. X=pd.get_dummies(X)

  53. ## same train-test split
  54. X_train, X_test, y_train, y_test = train_test_split(
  55.      X, y, test_size=0.33, random_state=42)

  56. lr=Ridge(100).fit(X_train, y_train)
  57. print(np.mean((y_test-lr.predict(X_test))**2))
  58. 4.386018754899887
  59. (With larger datasets with more variables, this might not necessarily be the case though, as illustrated in the next example)
复制代码


回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-14 22:52:55 |显示全部楼层
Example - Predicting sales at different stores
  1. rossman=pd.read_csv('train.csv',engine='python')
  2. rossman['StateHoliday']=rossman.StateHoliday.map(lambda x: str(x))

  3. # exlcuding days with no sales
  4. rossman=rossman.loc[rossman.Sales>0]
  5. rossman.head()
复制代码



回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-14 22:53:56 |显示全部楼层
Example - Predicting sales at different stores
  1. rossman.shape
  2. (844338, 9)
  3. stores=pd.read_csv('C:\\ipython\\mixed effects\\rossman sales\\store.csv')
  4. stores.head()
复制代码



回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-14 22:54:49 |显示全部楼层
Example - Predicting sales at different stores
  1. rossman['Year']=rossman.Date.map(lambda x: x[:4])
  2. rossman['Month']=rossman.Date.map(lambda x: x[5:7])
  3. rossman['DayOfWeek']=rossman.DayOfWeek.map(lambda x: str(x))

  4. max_comp_dist=stores.CompetitionDistance.max()
  5. stores['CompetitionDistance'].loc[stores.CompetitionDistance.isnull()]=max_comp_dist
  6. rossman=pd.merge(rossman,stores[['Store','StoreType','Assortment','CompetitionDistance']],on='Store')
  7. rossman.head()
  8. Store        DayOfWeek        Date        Sales        Customers        Open        Promo        StateHoliday        SchoolHoliday        Year        Month        StoreType        Assortment        CompetitionDistance
  9. 0        1        5        2015-07-31        5263        555        1        1        0        1        2015        07        c        a        1270.0
  10. 1        1        4        2015-07-30        5020        546        1        1        0        1        2015        07        c        a        1270.0
  11. 2        1        3        2015-07-29        4782        523        1        1        0        1        2015        07        c        a        1270.0
  12. 3        1        2        2015-07-28        5011        560        1        1        0        1        2015        07        c        a        1270.0
  13. 4        1        1        2015-07-27        6102        612        1        1        0        1        2015        07        c        a        1270.0
复制代码


回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-14 22:56:00 |显示全部楼层

Example - Predicting sales at different stores

  1. As a comparison point, I'll first fit a linear regression model, using the stores and store types as categorical variables in the model. Note that this is model is libnear in all parameters (no interactions, no polynomials, etc.), but as the dataset is very large, processing the data and fitting the model can still take a very long time. The models in the winning solutions for this competition took over a day to fit according to their authors.

  2. %%time
  3. from sklearn.linear_model import Ridge
  4. from sklearn.model_selection import train_test_split
  5. from scipy.sparse import coo_matrix, csr_matrix, hstack

  6. y=rossman['Sales']
  7. X=rossman[['Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance']]
  8. Xcateg=rossman[['Store', 'DayOfWeek', 'StoreType','Assortment','Year','Month']]
  9. Xcateg['Store']=Xcateg.Store.map(lambda x: str(x))
  10. Xcateg=coo_matrix(pd.get_dummies(Xcateg).as_matrix())
  11. X=hstack([pd.get_dummies(X).as_matrix(),Xcateg])
  12. X_train, X_test, y_train, y_test = train_test_split(
  13.      X, y, test_size=0.33, random_state=100)

  14. lr=Ridge()
  15. lr.fit(csr_matrix(X_train),y_train)
  16. preds_lr=lr.predict(X_test)
  17. print(np.sqrt(np.mean((y_test-preds_lr)**2)))
  18. 2678.6297086
  19. Wall time: 11min 23s
复制代码

回复

使用道具 举报

Nicolle 学生认证  发表于 2018-2-14 22:57:15 |显示全部楼层
Example - Predicting sales at different stores
  1. Now fitting a model with coefficients that vary by store - note that this is a far larger problem with over 40k model parameters, so it's no surprise that it takes a long time to fit and takes 4GB of ram memory. Nevertheless, this results in a huge improvement over the previous model:

  2. %%time
  3. from hierreg import HierarchicalRegression

  4. y=rossman['Sales']
  5. X=rossman[['DayOfWeek','Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance','Year','Month']]
  6. group=rossman[['Store','StoreType','Assortment']]
  7. X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
  8.      pd.get_dummies(X), y, group, test_size=0.33, random_state=100)

  9. ## for larger datasets, casadi --> IPOPT can provide better running times, but only supports the default parameters
  10. hlr=HierarchicalRegression(solver_interface='casadi')
  11. hlr.fit(X_train,y_train,group_train)
  12. preds_hlr1=hlr.predict(X_test,group_test)
  13. print(np.sqrt(np.mean((y_test-preds_hlr1)**2)))
  14. ******************************************************************************
  15. This program contains Ipopt, a library for large-scale nonlinear optimization.
  16. Ipopt is released as open source code under the Eclipse Public License (EPL).
  17.          For more information visit http://projects.coin-or.org/Ipopt
  18. ******************************************************************************

  19. 673.711325158
  20. Wall time: 7h 21min 50s
复制代码
回复

使用道具 举报

jianyu1118 在职认证  发表于 2018-2-14 23:14:07 |显示全部楼层
kan kanq
回复

使用道具 举报

qqchun 发表于 2018-2-15 03:57:51 |显示全部楼层
The decription looks very interesting.
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 我要注册

GMT+8, 2018-8-21 16:18