楼主: 牛尾巴
4283 59

【独家发布】【kindle】R Data Analysis Cookbook - More Than 80 Recipes to Help You Delive [推广有奖]

41
Lisrelchen 发表于 2016-7-25 20:37:48

Classification Trees using R

  1. Building, plotting, and evaluating – classification trees
  2. How to do it...
  3. This recipe shows you how you can use the rpart package to build classification trees and
  4. the rpart.plot package to generate nice-looking tree diagrams:
  5. 1. Load the rpart, rpart.plot, and caret packages:
  6. > library(rpart)
  7. > library(rpart.plot)
  8. > library(caret)
  9. 2. Read the data:
  10. > bn <- read.csv("banknote-authentication.csv")
  11. 3. Create data partitions. We need two partitions—training and validation. Rather than
  12. copying the data into the partitions, we will just keep the indices of the cases that
  13. represent the training cases and subset as and when needed:
  14. > set.seed(1000)
  15. > train.idx <- createDataPartition(bn$class, p = 0.7, list =
  16. FALSE)
  17. 4. Build the tree:
  18. > mod <- rpart(class ~ ., data = bn[train.idx, ], method =
  19. "class", control = rpart.control(minsplit = 20, cp = 0.01))
  20. 5. View the text output (your result could differ if you did not set the random seed as in
  21. step 3):
  22. > mod
  23. 6. Generate a diagram of the tree (your tree might differ if you did not set the random
  24. seed as in step 3):
  25. > prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE,
  26. faclen = 4, varlen = 8, shadow.col = "gray")
  27. 7. Prune the tree:
  28. > # First see the cptable
  29. > # !!Note!!: Your table can be different because of the
  30. > # random aspect in cross-validation
  31. > mod$cptable

  32. > # Choose CP value as the highest value whose
  33. > # xerror is not greater than minimum xerror + xstd
  34. > # With the above data that happens to be
  35. > # the fifth one, 0.01182033
  36. > # Your values could be different because of random
  37. > # sampling
  38. > mod.pruned = prune(mod, mod$cptable[5, "CP"])

  39. 8. View the pruned tree (your tree will look different):
  40. > prp(mod.pruned, type = 2, extra = 104, nn = TRUE, fallen.leaves
  41. = TRUE, faclen = 4, varlen = 8, shadow.col = "gray")

  42. 9. Use the pruned model to predict for the validation partition (note the minus sign
  43. before train.idx to consider the cases in the validation partition):
  44. > pred.pruned <- predict(mod, bn[-train.idx,], type = "class")
  45. 10. Generate the error/classification-confusion matrix:
  46. > table(bn[-train.idx,]$class, pred.pruned, dnn = c("Actual",
  47. "Predicted"))
复制代码

42
Lisrelchen 发表于 2016-7-25 21:43:32

Random Forest Models using R

  1. Random Forest Models using R
  2. 1. Load the randomForest and caret packages:
  3. > library(randomForest)
  4. > library(caret)
  5. 2. Read the data and convert the response variable to a factor:
  6. > bn <- read.csv("banknote-authentication.csv")
  7. > bn$class <- factor(bn$class)
  8. 3. Select a subset of the data for building the model. In Random Forests, we do not
  9. need to actually partition the data for model evaluation since the tree construction
  10. process has partitioning inherent in every step. However, we keep aside some of the
  11. data here just to illustrate the process of using the model for prediction and also to
  12. get an idea of the model's performance:
  13. > set.seed(1000)
  14. > sub.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
  15. 4. Build the random forest model
  16. > mod <- randomForest(x = bn[sub.idx,1:4], y=bn[sub.
  17. idx,5],ntree=500, keep.forest=TRUE)
  18. 5. Use the model to predict for cases that we set aside in step 3:
  19. > pred <- predict(mod, bn[-sub.idx,])
  20. 6. Build the error matrix:
  21. > table(bn[-sub.idx,"class"], pred, dnn = c("Actual",
  22. "Predicted"))
复制代码

43
Lisrelchen 发表于 2016-7-25 21:46:14

Support Vector Machine using R

  1. Support Vector Machine using R
  2. To classify using SVM, follow these steps:
  3. 1. Load the e1071 and caret packages:
  4. > library(e1071)
  5. > library(caret)
  6. 2. Read the data:
  7. > bn <- read.csv("banknote-authentication.csv")
  8. 3. Convert the outcome variable class to a factor:
  9. > bn$class <- factor(bn$class)
  10. 4. Partition the data:
  11. > set.seed(1000)
  12. > t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
  13. 5. Build the model:
  14. > mod <- svm(class ~ ., data = bn[t.idx,])
  15. 6. Check model performance on training data by generating an
  16. error/classification-confusion matrix:
  17. > table(bn[t.idx,"class"], fitted(mod), dnn = c("Actual",
  18. "Predicted"))
复制代码

44
Lisrelchen 发表于 2016-7-25 21:53:22

Naïve Bayes Modeling using R

  1. Naïve Bayes Modeling using R
  2. To classify using the Naïve Bayes method, follow these steps:
  3. 1. Load the e1071 and caret packages:
  4. > library(e1071)
  5. > library(caret)
  6. 2. Read the data:
  7. > ep <- read.csv("electronics-purchase.csv")
  8. 3. Partition the data:
  9. > set.seed(1000)
  10. > train.idx <- createDataPartition(ep$Purchase, p = 0.67, list =
  11. FALSE)
  12. 4. Build the model:
  13. > epmod <- naiveBayes(Purchase ~ . , data = ep[train.idx,])
  14. 5. Look at the model:
  15. > epmod
  16. 6. Predict for each case of the validation partition:
  17. > pred <- predict(epmod, ep[-train.idx,])
  18. 7. Generate and view the error matrix/classification confusion matrix for the validation
  19. partition:
  20. > tab <- table(ep[-train.idx,]$Purchase, pred, dnn = c("Actual",
  21. "Predicted"))
  22. > tab
复制代码

45
Lisrelchen 发表于 2016-7-25 21:57:07

K-Nearest Neighbours Medoling using R

  1. K-Nearest Neighbours Medoling using R
  2. 1. Load the class and caret packages:
  3. > library(class)
  4. > library(caret)
  5. 2. Read the data:
  6. > vac <- read.csv("vacation-trip-classification.csv")
  7. 3. Standardize the predictor variables Income and Family_size:
  8. > vac$Income.z <- scale(vac$Income)
  9. > vac$Family_size.z <- scale(vac$Family_size)
  10. 4. Partition the data. You need three partitions for KNN:
  11. > set.seed(1000)
  12. > train.idx <- createDataPartition(vac$Result, p = 0.5, list =
  13. FALSE)
  14. > train <- vac[train.idx, ]
  15. > temp <- vac[-train.idx, ]
  16. > val.idx <- createDataPartition(temp$Result, p = 0.5, list =
  17. FALSE)
  18. > val <- temp[val.idx, ]
  19. > test <- temp[-val.idx, ]
  20. 5. Generate predictions for validation cases with k=1:
  21. > pred1 <- knn(,train[4:5], val[,4:5], train[,3], 1)
  22. 6. Generate an error matrix for k=1:
  23. > errmat1 <- table(val$Result, pred1, dnn = c("Actual",
  24. "Predicted"))
  25. 7. Repeat the preceding process for many values of k and choose the best value for k.
  26. Look under the following There's more… section for a way to automate this process.
  27. 8. Use that value of k to generate predictions and the error matrix for the cases in the
  28. test partition (in the following code, we assume that k=1 was preferred):
  29. > pred.test <- knn(,train[4:5], test[,4:5], train[,3], 1)
  30. > errmat.test = table(test$Result, pred.test, dnn = c("Actual",
  31. "Predicted"))
复制代码

46
Lisrelchen 发表于 2016-7-25 22:02:03

Neural Networks Classification Modeling using R

  1. Neural Networks Classification Modeling using R
  2. 1. Load the nnet and caret packages:
  3. > library(nnet)
  4. > library(caret)
  5. 2. Read the data:
  6. > bn <- read.csv("banknote-authentication.csv")
  7. 3. Convert the outcome variable class to a factor:
  8. > bn$class <- factor(bn$class)
  9. 4. Partition the data.
  10. > train.idx <- createDataPartition(bn$class, p=0.7, list = FALSE)
  11. 5. Build the neural network model:
  12. > mod <- nnet(class ~., data=bn[train.idx,],size=3,maxit=10000,dec
  13. ay=.001, rang = 0.05)
  14. 6. Use model to predict for validation partition:
  15. > pred <- predict(mod, newdata=bn[-train.idx,], type="class")
  16. 7. Build and display the error/classification-confusion matrix on the validation partition:
  17. > table(bn[-train.idx,]$class, pred)
复制代码

47
Lisrelchen 发表于 2016-7-25 22:06:17

Linear Discriminant Function Analysis using R

  1. Linear Discriminant Function Analysis using R
  2. 1. Load the MASS and caret packages:
  3. > library(MASS)
  4. > library(caret)
  5. 2. Read the data:
  6. > bn <- read.csv("banknote-authentication.csv")
  7. 3. Convert the outcome variable class to a factor:
  8. > bn$class <- factor(bn$class)
  9. 4. Partition the data.
  10. > set.seed(1000)
  11. > t.idx <- createDataPartition(bn$class, p = 0.7, list=FALSE)
  12. 5. Build the Linear Discriminant Function model:
  13. > ldamod <- lda(bn[t.idx, 1:4], bn[t.idx, 5])
  14. 6. Check how the model performs on the training partition.
  15. > bn[t.idx,"Pred"] <- predict(ldamod, bn[t.idx, 1:4])$class
  16. > table(bn[t.idx, "class"], bn[t.idx, "Pred"], dnn = c("Actual",
  17. "Predicted"))
  18. 7. Generate predictions on the validation partition and check performance.
  19. > bn[-t.idx,"Pred"] <- predict(ldamod, bn[-t.idx, 1:4])$class
  20. > table(bn[-t.idx, "class"], bn[-t.idx, "Pred"], dnn = c("Actual",
  21. "Predicted"))
复制代码

48
Lisrelchen 发表于 2016-7-25 22:09:40

Logistic Regression using R

  1. Logistic Regression using R
  2. 1. Load the caret package:
  3. > library(caret)
  4. 2. Read the data:
  5. > bh <- read.csv("boston-housing-logistic.csv")
  6. 4. Partition the data.
  7. > set.seed(1000)
  8. > train.idx <- createDataPartition(bh$CLASS, p=0.7, list = FALSE)
  9. 5. Build the logistic regression model:
  10. > logit <- glm(CLASS~., data = bh[train.idx,], family=binomial)
  11. 6. Examine the model.
  12. > summary(logit)
  13. 7. Compute the probabilities of "success" for cases in the validation partition and store
  14. them in a variable called PROB_SUCC:
  15. > bh[-train.idx,"PROB_SUCC"] <- predict(logit, newdata = bh[-
  16. train.idx,], type="response")
  17. 8. Classify the cases using a cutoff probability of 0.5:
  18. > bh[-train.idx,"PRED_50"] <- ifelse(bh[-train.idx, "PROB_SUCC"]>=
  19. 0.5, 1, 0)
  20. 9. Generate the error/classification-confusion matrix (your results could differ):
  21. > table(bh[-train.idx, "CLASS"], bh[-train.idx, "PRED_50"],
  22. dnn=c("Actual", "Predicted"))
复制代码

49
Lisrelchen 发表于 2016-7-25 22:13:35

Combining Classification Tree Models(AdaBoost)

  1. Combining Classification Tree Models(AdaBoost)
  2. 1. Load the caret and ada packages:
  3. > library(caret)
  4. > library(ada)
  5. 2. Read the data:
  6. > bn <- read.csv("banknote-authentication.csv")
  7. 3. Convert the outcome variable class to a factor:
  8. > bn$class <- factor(bn$class)
  9. 4. Create partitions:
  10. > set.seed(1000)
  11. > t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
  12. 5. Create an rpart.control object:
  13. > cont <- rpart.control()
  14. 6. Build the model:
  15. > mod <- ada(class ~ ., data = bn[t.idx,], iter=50, loss="e",
  16. type="discrete", control = cont)
  17. 7.View model result.
  18. > mod
  19. 8. Generate predictions on the validation partition:
  20. > pred <- predict(mod, newdata = bn[-t.idx,], type = "vector")
  21. 9. Build the error/classification-confusion matrix on the validation partition:
  22. > table(bn[-t.idx, "class"], pred, dnn = c("Actual", "Predicted"))
复制代码

50
Lisrelchen 发表于 2016-7-25 22:21:50
  1. KNN Regressions Models using R
  2. 1. Load the dummies, FNN, scales, and caret packages as follows:
  3. > library(dummies)
  4. > library(FNN)
  5. > library(scales)
  6. 2. Read the data:
  7. > educ <- read.csv("education.csv")
  8. 3. Generate dummies for the categorical variable region and add them to educ as
  9. follows:
  10. > dums <- dummy(educ$region, sep="_")
  11. > educ <- cbind(educ, dums)
  12. 4. Because KNN performs distance computations, we should either rescale or
  13. standardize the predictors. In the present example, we have three numeric predictors
  14. and a categorical predictor in the form of three dummy variables. Standardizing
  15. dummy variables is tricky, and hence we will scale the numeric ones to [0, 1] and
  16. leave the dummies alone because they are already in the 0-1 range:
  17. > educ$urban.s <- rescale(educ$urban)
  18. > educ$income.s <- rescale(educ$income)
  19. > educ$under18.s <- rescale(educ$under18)
  20. 5. Create three partitions (because we are creating random partitions, your results
  21. can differ) as follows:
  22. > set.seed(1000)
  23. > t.idx <- createDataPartition(educ$expense, p = 0.6,
  24. list = FALSE)
  25. > trg <- educ[t.idx,]
  26. > rest <- educ[-t.idx,]
  27. > set.seed(2000)
  28. > v.idx <- createDataPartition(rest$expense, p=0.5,
  29. list=FALSE)
  30. > val <- rest[v.idx,]
  31. > test <- rest[-v.idx,]
  32. 6. Build the model for several values of k. In the following code, we show how
  33. to compute the RMS error from scratch. You can also use the convenience
  34. rdacb.rmse function, which was shown in the recipe Computing the root
  35. mean squared error earlier in this chapter:
  36. > # for k=1
  37. > res1 <- knn.reg(trg[, 7:12], val[,7:12], trg[,6], 1,
  38. algorithm="brute")
  39. > rmse1 = sqrt(mean((res1$pred-val[,6])^2))
  40. > rmse1
  41. 7. We obtained the lowest RMS error for k=2. Evaluate the model on the test partition
  42. as follows:
  43. > res.test <- knn.reg(trg[, 7:12], test[,7:12], trg[,6], 2,
  44. algorithm="brute")
  45. > rmse.test = sqrt(mean((res.test$pred-test[,6])^2))
  46. rmse.test
复制代码

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-23 15:14