楼主: Lisrelchen
4844 41

[Lecture Notes]Genevera Allen:Statistical Learning using R and Matlab [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50164 个
通用积分
81.5628
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Genevera Allen.JPG

Course Description
This course is a survey of statistical learning methods and will cover major techniques and concepts for both supervised and unsupervised learning. Topics covered include penalized regression and classification, support vector machines, kernel methods, model selection, matrix factorizations, graphical models, clustering, boosting, random forests, and ensemble learning. Students will learn how and when to apply statistical learning techniques, their comparative strengths and weaknesses, and how to critically evaluate the performance of learning algorithms. Students completing this course should be able to
  • (i) apply basic statistical learning methods to build predictive models or perform exploratory analysis,
  • (ii) properly tune and select statistical learning models,
  • (iii) correctly assess model fit and error, and
  • (iv) build an ensemble of learning algorithms.

本帖隐藏的内容

Genevera Allen Machine Learning Rice University.zip (4.45 MB, 需要: 1 个论坛币)




二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Statistical statistica statistic Learning earning techniques machines learning concepts include

已有 1 人评分论坛币 学术水平 热心指数 信用等级 收起 理由
jerker + 5 + 5 + 5 + 5 精彩帖子

总评分: 论坛币 + 5  学术水平 + 5  热心指数 + 5  信用等级 + 5   查看全部评分

本帖被以下文库推荐

沙发
Lisrelchen 发表于 2016-5-29 22:02:34 |只看作者 |坛友微信交流群
  1. #########################
  2. #lecture 2 code
  3. #ozone data

  4. #load in ozone data (taken from "cosso" R package)
  5. ozone = read.csv("ozone.csv")

  6. #take a look at the data
  7. plot(ozone,pch=16,cex=.5)

  8. #directly fit least squares
  9. Y = as.numeric(ozone[,1])
  10. aX = cbind(rep(1,nrow(ozone)),as.matrix(ozone[,-1]))

  11. #estimated coefficients
  12. betahat = solve(t(aX)%*%aX)%*%t(aX)%*%Y

  13. #fitted data
  14. Yhat = aX%*%betahat

  15. #hat matrix
  16. H = aX%*%solve(t(aX)%*%aX)%*%t(aX)
  17. evH =  eigen(H)
  18. evH$values[1:10]

  19. #fit linear model using "lm"
  20. fit = lm(ozone~.,data=ozone)
  21. summary(fit)

  22. cbind(betahat,t(t(fit$coefficients)))


  23. #diagnostic plots
  24. layout(matrix(c(1,2,3,4),2,2))
  25. plot(fit)
  26. dev.off()
复制代码

使用道具

藤椅
Lisrelchen 发表于 2016-5-29 22:03:36 |只看作者 |坛友微信交流群
  1. ###########################
  2. #lecture 3 code
  3. #ozone data

  4. #load in ozone data (taken from "cosso" R package)
  5. ozone = read.csv("ozone.csv")

  6. #take a look at the data
  7. plot(ozone,pch=16,cex=.5)


  8. #ridge regression
  9. Y = as.numeric(ozone[,1]); Y = Y - mean(Y)
  10. X = as.matrix(ozone[,-1]); X = scale(X,center=T,scale=F)

  11. #ridge regression solution
  12. lam = nrow(ozone)

  13. betals = solve(t(X)%*%X)%*%t(X)%*%Y
  14. betar = solve(t(X)%*%X + diag(rep(lam,8)))%*%t(X)%*%Y
  15. cbind(betals,betar)


  16. #ridge regression coefficient paths
  17. lambdas = exp(seq(log(.01),log(10*nrow(ozone)),l=100))
  18. betasr = matrix(0,length(lambdas),8)
  19. for(i in 1:length(lambdas))
  20. {
  21.   betasr[i,] = solve(t(X)%*%X + diag(rep(lambdas[i],8)))%*%t(X)%*%Y
  22. }

  23. plot(c(1,length(lambdas)),range(betasr),type="n",ylab="Coefficients",xlab="Lambda Index")
  24. for(j in 1:8)
  25. {
  26.   lines(betasr[length(lambdas):1,j],col=j)
  27. }
  28. legend(0,20,legend=names(ozone)[2:9],col=1:9,lty=rep(1,9))


  29. ################################
  30. #looking at principal components
  31. svdx = svd(X)

  32. #scatterplots of samples PCs
  33. par(mar=c(1,1,1,1))
  34. layout(matrix(1:25,5,5))
  35. mycols = rainbow(length(Y))
  36. orY = order(Y)
  37. for(i in 1:5)
  38. {
  39.   for(j in 1:5)
  40.     {
  41.       plot(svdx$u[,i],svdx$u[,j],type="p",pch=16,col=mycols[orY])
  42.     }
  43. }

  44. #amount of variance explained
  45. varex = 0; cumvarex = 0;
  46. for(i in 1:8)
  47. {
  48.   varex[i] = svdx$d[i]/sum(svdx$d)
  49.   cumvarex[i] = sum(varex)
  50. }
  51. par(mfrow=c(1,2))
  52. par(mar=c(5,4,4,2))
  53. barplot(varex,ylab="Amount of Var Explained",xlab="PCs")
  54. barplot(cumvarex,ylab="Cummulative Var Explained",xlab="PCs")


  55. #PC direction weights
  56. par(mfrow=c(3,2))
  57. par(mar=c(5,4,3,2))
  58. for(i in 1:6)
  59.   {
  60.     barplot(svdx$v[,i],names.arg=names(ozone)[2:9])
  61.   }


  62. #ridge paths again
  63. dev.off()
  64. plot(c(1,length(lambdas)),range(betasr),type="n",ylab="Coefficients",xlab="Lambda Index")
  65. for(j in 1:8)
  66. {
  67.   lines(betasr[length(lambdas):1,j],col=j)
  68. }
  69. legend(0,20,legend=names(ozone)[2:9],col=1:9,lty=rep(1,9))



  70. #####################
  71. #least squares on derived inputs

  72. #PC regression
  73. betapcr = diag(svdx$d)%*%t(svdx$u)%*%Y


  74. #PLS regression
  75. plsfunc = function(x,y)
  76. {
  77.   p = ncol(x); n = nrow(x);
  78.   M = t(x)%*%y
  79.   Z = NULL; V = NULL; P = NULL;
  80.   for(k in 1:p)
  81.     {
  82.       svdm = svd(M)
  83.       z = x%*%svdm$u
  84.       z = z*as.numeric(1/sqrt(t(z)%*%z))
  85.       V = cbind(V,svdm$u)
  86.       p = t(x)%*%z/as.numeric(t(z)%*%z)
  87.       P = cbind(P,p);
  88.       Z = cbind(Z,z);
  89.       M = M - P%*%solve(t(P)%*%P)%*%t(P)%*%M;
  90.     }
  91.   return(list(Z=Z,V=V))
  92. }

  93. plsx = plsfunc(X,Y)


  94. #scatterplots of pls components
  95. par(mar=c(1,1,1,1))
  96. layout(matrix(1:25,5,5))
  97. mycols = rainbow(length(Y))
  98. orY = order(Y)
  99. for(i in 1:5)
  100. {
  101.   for(j in 1:5)
  102.     {
  103.       plot(plsx$Z[,i],plsx$Z[,j],type="p",pch=16,col=mycols[orY])
  104.     }
  105. }

  106. betapls = t(plsx$Z)%*%Y

  107. cbind(betapcr,betapls)
复制代码

使用道具

板凳
Lisrelchen 发表于 2016-5-29 22:04:37 |只看作者 |坛友微信交流群
  1. ###########################
  2. #lecture 4 code
  3. #ozone data
  4. #ridge regression, PCA, PC regression & PLS regression


  5. #load in ozone data (taken from "cosso" R package)
  6. ozone = read.csv("ozone.csv")

  7. #take a look at the data
  8. plot(ozone,pch=16,cex=.5)

  9. ##################
  10. #ridge regression

  11. Y = as.numeric(ozone[,1]); Y = Y - mean(Y)
  12. X = as.matrix(ozone[,-1]); X = scale(X,center=T,scale=F)

  13. #ridge regression solution
  14. lam = nrow(ozone)*.1

  15. betals = solve(t(X)%*%X)%*%t(X)%*%Y
  16. betar = solve(t(X)%*%X + diag(rep(lam,8)))%*%t(X)%*%Y
  17. cbind(betals,betar)


  18. #ridge regression coefficient paths
  19. lambdas = exp(seq(log(.01),log(10*nrow(ozone)),l=100))
  20. betasr = matrix(0,length(lambdas),8)
  21. for(i in 1:length(lambdas))
  22. {
  23.   betasr[i,] = solve(t(X)%*%X + diag(rep(lambdas[i],8)))%*%t(X)%*%Y
  24. }

  25. plot(c(1,length(lambdas)),range(betasr),type="n",ylab="Coefficients",xlab="Lambda Index")
  26. for(j in 1:8)
  27. {
  28.   lines(betasr[length(lambdas):1,j],col=j)
  29. }
  30. legend(0,20,legend=names(ozone)[2:9],col=1:9,lty=rep(1,9))


  31. ################################
  32. #looking at principal components
  33. svdx = svd(X)

  34. #scatterplots of samples PCs
  35. par(mar=c(1,1,1,1))
  36. layout(matrix(1:25,5,5))
  37. mycols = rainbow(length(Y))
  38. orY = order(Y)
  39. for(i in 1:5)
  40. {
  41.   for(j in 1:5)
  42.     {
  43.       plot(svdx$u[,i],svdx$u[,j],type="p",pch=16,col=mycols[orY])
  44.     }
  45. }

  46. #amount of variance explained
  47. varex = 0; cumvarex = 0;
  48. for(i in 1:8)
  49. {
  50.   varex[i] = svdx$d[i]/sum(svdx$d)
  51.   cumvarex[i] = sum(varex)
  52. }
  53. par(mfrow=c(1,2))
  54. par(mar=c(5,4,4,2))
  55. barplot(varex,ylab="Amount of Var Explained",xlab="PCs")
  56. barplot(cumvarex,ylab="Cummulative Var Explained",xlab="PCs")


  57. #PC direction weights
  58. par(mfrow=c(3,2))
  59. par(mar=c(5,4,3,2))
  60. for(i in 1:6)
  61.   {
  62.     barplot(svdx$v[,i],names.arg=names(ozone)[2:9])
  63.   }


  64. #ridge paths again
  65. dev.off()
  66. plot(c(1,length(lambdas)),range(betasr),type="n",ylab="Coefficients",xlab="Lambda Index")
  67. for(j in 1:8)
  68. {
  69.   lines(betasr[length(lambdas):1,j],col=j)
  70. }
  71. legend(0,20,legend=names(ozone)[2:9],col=1:9,lty=rep(1,9))



  72. #####################
  73. #least squares on derived inputs

  74. #PC regression
  75. betapcr = diag(svdx$d)%*%t(svdx$u)%*%Y


  76. #PLS regression
  77. plsfunc = function(x,y)
  78. {
  79.   p = ncol(x); n = nrow(x);
  80.   M = t(x)%*%y
  81.   Z = NULL; V = NULL; P = NULL;
  82.   for(k in 1:p)
  83.     {
  84.       svdm = svd(M)
  85.       z = x%*%svdm$u
  86.       z = z*as.numeric(1/sqrt(t(z)%*%z))
  87.       V = cbind(V,svdm$u)
  88.       p = t(x)%*%z/as.numeric(t(z)%*%z)
  89.       P = cbind(P,p);
  90.       Z = cbind(Z,z);
  91.       M = M - P%*%solve(t(P)%*%P)%*%t(P)%*%M;
  92.     }
  93.   return(list(Z=Z,V=V))
  94. }

  95. plsx = plsfunc(X,Y)


  96. #scatterplots of pls components
  97. par(mar=c(1,1,1,1))
  98. layout(matrix(1:25,5,5))
  99. mycols = rainbow(length(Y))
  100. orY = order(Y)
  101. for(i in 1:5)
  102. {
  103.   for(j in 1:5)
  104.     {
  105.       plot(plsx$Z[,i],plsx$Z[,j],type="p",pch=16,col=mycols[orY])
  106.     }
  107. }

  108. betapls = t(plsx$Z)%*%Y

  109. cbind(betapcr,betapls)


  110. ########################################
  111. #feature selection in linear models

  112. #libraries required
  113. library(MASS)
  114. library(leaps)
  115. library(glmnet)

  116. #####################
  117. #Algorithmic

  118. #best subsets selection
  119. fitbsub = regsubsets(x=ozone[,2:9],y=ozone[,1])
  120. summary(fitbsub)

  121. #forward step-wise  - via BIC
  122. fit0 = lm(ozone~1,data=ozone)
  123. fitf = stepAIC(fit0,scope=ozone~wind+temp+invHt+press+hum+vis+milPress+invTemp,direction="forward",data=ozone,k=log(nrow(ozone)))
  124. summary(fitf)

  125. #backwards step-wise - via BIC
  126. fit = lm(ozone~.,data=ozone)
  127. fitb = stepAIC(fit,direction="backward",data=ozone,k=log(nrow(ozone)))
  128. summary(fitb)
复制代码

使用道具

报纸
yangbing1008 发表于 2016-5-29 22:11:05 |只看作者 |坛友微信交流群
感谢分享
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

使用道具

地板
Lisrelchen 发表于 2016-5-29 22:12:42 |只看作者 |坛友微信交流群
  1. ###########################
  2. #lecture 5 code
  3. #ozone data

  4. ozone = read.csv("ozone.csv")

  5. ##################
  6. #ridge regression

  7. Y = as.numeric(ozone[,1]); Y = Y - mean(Y)
  8. X = as.matrix(ozone[,-1]); X = scale(X,center=T,scale=F)

  9. #ridge regression coefficient paths
  10. lambdas = exp(seq(log(.01),log(10*nrow(ozone)),l=100))
  11. betasr = matrix(0,length(lambdas),8)
  12. for(i in 1:length(lambdas))
  13. {
  14.   betasr[i,] = solve(t(X)%*%X + diag(rep(lambdas[i],8)))%*%t(X)%*%Y
  15. }

  16. plot(c(1,length(lambdas)),range(betasr),type="n",ylab="Coefficients",xlab="Lambda Index")
  17. for(j in 1:8)
  18. {
  19.   lines(betasr[length(lambdas):1,j],col=j)
  20. }
  21. legend(0,20,legend=names(ozone)[2:9],col=1:9,lty=rep(1,9))

  22. plot(ozone)

  23. ########################################
  24. #feature selection in linear models

  25. #libraries required
  26. library(MASS)
  27. library(leaps)
  28. library(glmnet)

  29. #####################
  30. #Algorithmic

  31. #best subsets selection
  32. fitbsub = regsubsets(x=ozone[,2:9],y=ozone[,1])
  33. summary(fitbsub)

  34. #forward step-wise  - via BIC
  35. fit0 = lm(ozone~1,data=ozone)
  36. fitf = stepAIC(fit0,scope=ozone~wind+temp+invHt+press+hum+vis+milPress+invTemp,direction="forward",data=ozone,k=log(nrow(ozone)))
  37. summary(fitf)

  38. #backwards step-wise - via BIC
  39. fit = lm(ozone~.,data=ozone)
  40. fitb = stepAIC(fit,direction="backward",data=ozone,k=log(nrow(ozone)))
  41. summary(fitb)

  42. #######################
  43. #L1 Regularization

  44. fit0 = lm(ozone~.-1,data=ozone)
  45. Y = as.numeric(ozone[,1])
  46. X = as.matrix(ozone[,-1])

  47. lam = 1
  48. fitl = glmnet(x=X,y=Y,family="gaussian",lambda=lam,alpha=1)
  49. cbind(fit0$coef,as.matrix(fitl$beta))

  50. #lasso paths
  51. fitl = glmnet(x=X,y=Y,family="gaussian",alpha=1)
  52. plot(fitl,col=1:8)
  53. legend(0,19,legend=names(ozone)[2:9],col=1:8,lty=rep(1,8),cex=.8)
复制代码

使用道具

7
Lisrelchen 发表于 2016-5-29 22:13:44 |只看作者 |坛友微信交流群
  1. ###########################
  2. #lecture 6 code
  3. #ozone data
  4. #comparison: sparse regression methods

  5. #libraries required
  6. library(glmnet)
  7. library(ncvreg)

  8. #load in ozone data
  9. ozone = read.csv("ozone.csv")
  10. Y = as.numeric(ozone[,1]); Y = Y - mean(Y)
  11. X = as.matrix(ozone[,-1]); X = scale(X,center=T,scale=F)

  12. #Lasso
  13. lam = 1
  14. fit0 = lm(Y~X-1)
  15. fitl = glmnet(x=X,y=Y,family="gaussian",lambda=lam,alpha=1)
  16. cbind(fit0$coef,as.matrix(fitl$beta))

  17. #Lasso paths
  18. fitl = glmnet(x=X,y=Y,family="gaussian",alpha=1)
  19. plot(fitl,col=1:8)
  20. legend(0,19,legend=names(ozone)[2:9],col=1:8,lty=rep(1,8),cex=.8)



  21. ###############################
  22. #least squares, lasso, adaptive lasso, SCAD, ridge, elastic net, MC+
  23. lam = 1

  24. betals = solve(t(X)%*%X)%*%t(X)%*%Y
  25. betar = solve(t(X)%*%X + diag(rep(lam/2*nrow(ozone),8)))%*%t(X)%*%Y
  26. fitl = glmnet(x=X,y=Y,family="gaussian",lambda=lam,alpha=1)
  27. fital = glmnet(x=X,y=Y,family="gaussian",lambda=lam,alpha=1,penalty.factor=1/abs(betals))
  28. fitel = glmnet(x=X,y=Y,family="gaussian",lambda=lam,alpha=.5)
  29. fitscad = ncvreg(X,Y,family="gaussian",penalty="SCAD",lambda=lam)
  30. fitmcp = ncvreg(X,Y,family="gaussian",penalty="MCP",lambda=lam)
  31. mat = cbind(betals,betar,as.matrix(fitl$beta),as.matrix(fital$beta),as.matrix(fitel$beta),fitscad$beta[-1],fitmcp$beta[-1])
  32. colnames(mat) = c("LS","Ridge","Lasso","A-Lasso","EL","SCAD","MC+")
  33. mat


  34. #############################
  35. #compare ridge, lasso, elastic net & SCAD regualrization paths

  36. par(mfrow=c(2,3))
  37. par(mar=c(5,4,3,2))
  38. betals = solve(t(X)%*%X)%*%t(X)%*%Y
  39. lambdas = exp(seq(log(.01),log(100*nrow(ozone)),l=100))
  40. betasr = matrix(0,length(lambdas),8)
  41. for(i in 1:length(lambdas))
  42. {
  43.   betasr[i,] = solve(t(X)%*%X + diag(rep(lambdas[i],8)))%*%t(X)%*%Y
  44. }
  45. plot(c(1,length(lambdas)),range(betals),type="n",ylab="Coefficients",xlab="Lambda Index",main="Ridge")
  46. for(j in 1:8)
  47. {
  48.   lines(betasr[length(lambdas):1,j],col=j)
  49. }
  50. legend(0,20,legend=names(ozone)[2:9],col=1:9,lty=rep(1,9),cex=.75)

  51. fitl = glmnet(x=X,y=Y,family="gaussian",alpha=1)
  52. plot(fitl,col=1:8,main="Lasso")
  53. legend(0,20,legend=names(ozone)[2:9],col=1:8,lty=rep(1,8),cex=.75)

  54. fitel = glmnet(x=X,y=Y,family="gaussian",alpha=.5)
  55. plot(fitel,col=1:8,main="EL alpha=.5")
  56. legend(0,20,legend=names(ozone)[2:9],col=1:8,lty=rep(1,8),cex=.75)

  57. fitel = glmnet(x=X,y=Y,family="gaussian",alpha=.25)
  58. plot(fitel,col=1:8,main="EL alpha=.25")
  59. legend(0,20,legend=names(ozone)[2:9],col=1:8,lty=rep(1,8),cex=.75)

  60. fitscad = ncvreg(X,Y,family="gaussian",penalty="SCAD")
  61. plot(fitscad,col=1:8,main="SCAD",shade=F)
  62. legend(6,30,legend=names(ozone)[2:9],col=1:8,lty=rep(1,8),cex=.75)

  63. fitmcp = ncvreg(X,Y,family="gaussian",penalty="MCP")
  64. plot(fitmcp,col=1:8,main="MC+",shade=F)
  65. legend(6,30,legend=names(ozone)[2:9],col=1:8,lty=rep(1,8),cex=.75)
复制代码

使用道具

8
Lisrelchen 发表于 2016-5-29 22:15:56 |只看作者 |坛友微信交流群
  1. %%%%%%%%%%%%%%%%%%%%%%%
  2. %lecture 8 code
  3. %LDA

  4. dat = dlmread('zip.train');
  5. x = dat(:,2:end);
  6. [n,p] = size(x);

  7. imshow(flipud(rot90(reshape(dat(1,2:end),16,16))))


  8. nclass = length(unique(dat(:,1)));
  9. Y = zeros(n,nclass);
  10. nK = 0;
  11. for i=1:nclass
  12.     Y(dat(:,1)==(i-1),i) = 1;
  13.     nK(i) = sum(Y(:,i));
  14. end


  15. X = x - ones(n,1)*mean(x);
  16. mvecs = zeros(p,nclass);
  17. for i=1:nclass
  18.     mvecs(:,i) = mean(X(Y(:,i)==1,:));
  19. end

  20. SigB = zeros(p,p);
  21. for i=1:nclass
  22.     SigB = SigB + nK(i)*mvecs(:,i)*mvecs(:,i)';
  23. end

  24. SigT = X'*X;
  25. SigW = SigT - SigB;

  26. [V,D] = eig(SigB,SigW);


  27. scatter(X*V(:,1),X*V(:,2),'SizeData',1)
  28. axis tight
  29. text(X*V(:,1),X*V(:,2),num2str(dat(:,1)))


  30. scatter(X*V(:,3),X*V(:,4),'SizeData',1)
  31. axis tight
  32. text(X*V(:,3),X*V(:,4),num2str(dat(:,1)))


  33. scatter(X*V(:,5),X*V(:,6),'SizeData',1)
  34. axis tight
  35. text(X*V(:,5),X*V(:,6),num2str(dat(:,1)))


  36. [classNB,errNB] = classify(X,X,dat(:,1),'diaglinear');
  37. [classLDA,errLDA] = classify(X,X,dat(:,1),'linear');
  38. [classGLDA,errGLDA] = classify(X*V(:,1:10),X*V(:,1:10),dat(:,1),'diaglinear');

  39. [errNB errLDA  errGLDA]


  40. tdat = dlmread('zip.test');
  41. [tn,tp] = size(tdat);
  42. tx = tdat(:,2:end);
  43. tX = tx - ones(size(tdat,1),1)*mean(tx);

  44. [classNB,errNB] = classify(tX,X,dat(:,1),'diaglinear');
  45. [classLDA,errLDA] = classify(tX,X,dat(:,1),'linear');
  46. [classGLDA,errGLDA] = classify(tX*V(:,1:10),X*V(:,1:10),dat(:,1),'diaglinear');

  47. [sum(tdat(:,1)~=classNB)/tn,  sum(tdat(:,1)~=classLDA)/tn,  sum(tdat(:,1)~=classGLDA)/tn]
复制代码

使用道具

9
Lisrelchen 发表于 2016-5-29 22:18:21 |只看作者 |坛友微信交流群
  1. ###############
  2. #lecture 9 code

  3. #univariate logistic regression - simulated example
  4. #note: change  coefficients to understand logistic function
  5. n = 250; p = 1;
  6. x = matrix(rnorm(n*p),n,p)
  7. beta0 = 0
  8. beta = 1
  9. eps = matrix(rnorm(n),n,1)
  10. probs = exp(beta0 + x*beta + eps)/(1 + exp(beta0 + x*beta + eps))
  11. Y = as.numeric(probs>.5)

  12. #fitting logistic
  13. fit = glm(Y ~ x,family="binomial")
  14. summary(fit)
  15. plot(x,Y)
  16. xs = seq(min(x),max(x),l=1000)
  17. pihat = exp(fit$coefficients[1] + xs*fit$coefficients[2])/(1 + exp(fit$coefficients[1] + xs*fit$coefficients[2]))
  18. lines(xs,pihat)
  19. lines(c(min(x),max(x)),c(.5,.5),lty=2)


  20. ###############
  21. #looking at spam data

  22. #reading in data
  23. data = read.csv("spam_dat.csv",header=FALSE)
  24. ndat = read.delim("spam_vars.txt",header=FALSE)

  25. #parsing variable names
  26. nams = NULL
  27. for(i in 1:nrow(ndat))
  28. {
  29.   vec = strsplit(as.character(ndat[i,]),split="_")
  30.   for(j in 1:length(vec[[1]]))
  31.     {
  32.       if(length(grep(":",vec[[1]][j]))>0)
  33.         {
  34.           vars = strsplit(vec[[1]][j],split=":")
  35.           nams = c(nams,vars[[1]][1])
  36.         }
  37.     }
  38. }

  39. Y = data[,58]
  40. n = length(Y)
  41. sum(Y)/n
  42. X = as.matrix(log(1 + data[,1:57]))
  43. colnames(X) = nams
  44. X = scale(X)/sqrt(n-1)
  45. dat = data.frame(Y,X)

  46. #taking a subset of data
  47. sdat = data.frame(Y,dat$george,dat$meeting,dat$total,dat$re,dat$edu,dat$free,dat$your)
  48. plot(sdat,pch=16,cex=.5)

  49. #########
  50. #fit logistic model

  51. fits = glm(Y~.,data=sdat,family="binomial",maxit=50)
  52. summary(fits)

  53. #individual variable total
  54. fit = glm(Y~dat.total,data=sdat,family="binomial")
  55. summary(fit)
  56. plot(sdat$dat.total,sdat$Y)
  57. xs = seq(min(sdat$dat.total),max(sdat$dat.total),l=1000)
  58. pihat = exp(fit$coefficients[1] + xs*fit$coefficients[2])/(1 + exp(fit$coefficients[1] + xs*fit$coefficients[2]))
  59. lines(xs,pihat)
  60. lines(c(min(xs),max(xs)),c(.5,.5),lty=2)


  61. #individual variable george
  62. fit = glm(Y~dat.george,data=sdat,family="binomial")
  63. summary(fit)
  64. plot(sdat$dat.george,sdat$Y)
  65. xs = seq(min(sdat$dat.george),max(sdat$dat.george),l=1000)
  66. pihat = exp(fit$coefficients[1] + xs*fit$coefficients[2])/(1 + exp(fit$coefficients[1] + xs*fit$coefficients[2]))
  67. lines(xs,pihat)
  68. lines(c(min(xs),max(xs)),c(.5,.5),lty=2)


  69. #individual variable free
  70. fit = glm(Y~dat.free,data=sdat,family="binomial")
  71. summary(fit)
  72. plot(sdat$dat.free,sdat$Y)
  73. xs = seq(min(sdat$dat.free),max(sdat$dat.free),l=1000)
  74. pihat = exp(fit$coefficients[1] + xs*fit$coefficients[2])/(1 + exp(fit$coefficients[1] + xs*fit$coefficients[2]))
  75. lines(xs,pihat)
  76. lines(c(min(xs),max(xs)),c(.5,.5),lty=2)
复制代码

使用道具

10
Lisrelchen 发表于 2016-5-29 22:20:05 |只看作者 |坛友微信交流群
  1. ###############
  2. #lecture 9 code
  3. #classifying spam emails - HP data set

  4. library(glmnet)
  5. library(ncvreg)


  6. #reading in data
  7. data = read.csv("spam_dat.csv",header=FALSE)
  8. ndat = read.delim("spam_vars.txt",header=FALSE)

  9. #parsing variable names
  10. nams = NULL
  11. for(i in 1:nrow(ndat))
  12. {
  13.   vec = strsplit(as.character(ndat[i,]),split="_")
  14.   for(j in 1:length(vec[[1]]))
  15.     {
  16.       if(length(grep(":",vec[[1]][j]))>0)
  17.         {
  18.           vars = strsplit(vec[[1]][j],split=":")
  19.           nams = c(nams,vars[[1]][1])
  20.         }
  21.     }
  22. }

  23. Y = data[,58]
  24. n = length(Y)
  25. sum(Y)/n
  26. X = as.matrix(log(1 + data[,1:57]))
  27. colnames(X) = nams
  28. X = scale(X)/sqrt(n-1)
  29. dat = data.frame(Y,X)

  30. #taking a subset of data
  31. sdat = data.frame(Y,dat$george,dat$meeting,dat$total,dat$re,dat$edu,dat$free,dat$your)
  32. plot(sdat,pch=16,cex=.5)


  33. #########
  34. #fit logistic model

  35. fits = glm(Y~.,data=sdat,family="binomial",maxit=50)
  36. summary(fits)


  37. ###########
  38. #sparse logistic regression

  39. #penalized logistic - coefficients
  40. lam = .1

  41. Xs = as.matrix(sdat[,2:8])
  42. sfitl = glmnet(x=Xs,y=Y,family="binomial",lambda=lam,alpha=1)
  43. sfitr = glmnet(x=Xs,y=Y,family="binomial",lambda=lam,alpha=0)
  44. sfitel = glmnet(x=Xs,y=Y,family="binomial",lambda=lam,alpha=.5)
  45. sfitscad = ncvreg(Xs,Y,family="binomial",penalty="SCAD",lambda=lam)
  46. mat = cbind(fits$coefficients[-1],as.matrix(sfitl$beta),as.matrix(sfitr$beta),as.matrix(sfitel$beta),sfitscad$beta[-1])
  47. colnames(mat) = c("Logistic","Lasso","Ridge","EL","SCAD")
  48. mat

  49. #penalized logistic - regularization paths
  50. sfitl = glmnet(x=Xs,y=Y,family="binomial",alpha=1)
  51. sfitr = glmnet(x=Xs,y=Y,family="binomial",alpha=0)
  52. sfitel = glmnet(x=Xs,y=Y,family="binomial",alpha=.5)
  53. sfitscad = ncvreg(Xs,Y,family="binomial",,penalty="SCAD")

  54. par(mfrow=c(2,2))
  55. plot(sfitl,col=1:7,main="L1")
  56. legend(0,-100,legend=names(sdat)[2:8],col=1:7,lty=rep(1,7),cex=.75)

  57. plot(sfitr,col=1:7,main="Ridge")
  58. legend(0,-10,legend=names(sdat)[2:8],col=1:7,lty=rep(1,7),cex=.75)

  59. plot(sfitel,col=1:7,main="Elastic Net")
  60. legend(0,-100,legend=names(sdat)[2:8],col=1:7,lty=rep(1,7),cex=.75)

  61. plot(sfitscad,col=1:7,main="SCAD",shade=F)
  62. legend(.22,-100,legend=names(sdat)[2:8],col=1:7,lty=rep(1,7),cex=.75)


  63. #note: this takes a while to run
  64. #full data L1 regularizaiton paths
  65. fit1 = glmnet(x=X,y=Y,family="binomial")
  66. plot(fit1)
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-11-6 00:44