【独家发布】【kindle】R Data Analysis Cookbook - More Than 80 Recipes to Help You Delive - 第5页

41楼

Lisrelchen 发表于 2016-7-25 20:37:48

Classification Trees using R

Building, plotting, and evaluating – classification trees
How to do it...
This recipe shows you how you can use the rpart package to build classification trees and
the rpart.plot package to generate nice-looking tree diagrams:
1. Load the rpart, rpart.plot, and caret packages:
> library(rpart)
> library(rpart.plot)
> library(caret)
2. Read the data:
> bn <- read.csv("banknote-authentication.csv")
3. Create data partitions. We need two partitions—training and validation. Rather than
copying the data into the partitions, we will just keep the indices of the cases that
represent the training cases and subset as and when needed:
> set.seed(1000)
> train.idx <- createDataPartition(bn$class, p = 0.7, list =
FALSE)
4. Build the tree:
> mod <- rpart(class ~ ., data = bn[train.idx, ], method =
"class", control = rpart.control(minsplit = 20, cp = 0.01))
5. View the text output (your result could differ if you did not set the random seed as in
step 3):
> mod
6. Generate a diagram of the tree (your tree might differ if you did not set the random
seed as in step 3):
> prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE,
faclen = 4, varlen = 8, shadow.col = "gray")
7. Prune the tree:
> # First see the cptable
> # !!Note!!: Your table can be different because of the
> # random aspect in cross-validation
> mod$cptable
> # Choose CP value as the highest value whose
> # xerror is not greater than minimum xerror + xstd
> # With the above data that happens to be
> # the fifth one, 0.01182033
> # Your values could be different because of random
> # sampling
> mod.pruned = prune(mod, mod$cptable[5, "CP"])
8. View the pruned tree (your tree will look different):
> prp(mod.pruned, type = 2, extra = 104, nn = TRUE, fallen.leaves
= TRUE, faclen = 4, varlen = 8, shadow.col = "gray")
9. Use the pruned model to predict for the validation partition (note the minus sign
before train.idx to consider the cases in the validation partition):
> pred.pruned <- predict(mod, bn[-train.idx,], type = "class")
10. Generate the error/classification-confusion matrix:
> table(bn[-train.idx,]$class, pred.pruned, dnn = c("Actual",
"Predicted"))

复制代码

42楼

Lisrelchen 发表于 2016-7-25 21:43:32

Random Forest Models using R

Random Forest Models using R
1. Load the randomForest and caret packages:
> library(randomForest)
> library(caret)
2. Read the data and convert the response variable to a factor:
> bn <- read.csv("banknote-authentication.csv")
> bn$class <- factor(bn$class)
3. Select a subset of the data for building the model. In Random Forests, we do not
need to actually partition the data for model evaluation since the tree construction
process has partitioning inherent in every step. However, we keep aside some of the
data here just to illustrate the process of using the model for prediction and also to
get an idea of the model's performance:
> set.seed(1000)
> sub.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
4. Build the random forest model
> mod <- randomForest(x = bn[sub.idx,1:4], y=bn[sub.
idx,5],ntree=500, keep.forest=TRUE)
5. Use the model to predict for cases that we set aside in step 3:
> pred <- predict(mod, bn[-sub.idx,])
6. Build the error matrix:
> table(bn[-sub.idx,"class"], pred, dnn = c("Actual",
"Predicted"))

复制代码

43楼

Lisrelchen 发表于 2016-7-25 21:46:14

Support Vector Machine using R

Support Vector Machine using R
To classify using SVM, follow these steps:
1. Load the e1071 and caret packages:
> library(e1071)
> library(caret)
2. Read the data:
> bn <- read.csv("banknote-authentication.csv")
3. Convert the outcome variable class to a factor:
> bn$class <- factor(bn$class)
4. Partition the data:
> set.seed(1000)
> t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
5. Build the model:
> mod <- svm(class ~ ., data = bn[t.idx,])
6. Check model performance on training data by generating an
error/classification-confusion matrix:
> table(bn[t.idx,"class"], fitted(mod), dnn = c("Actual",
"Predicted"))

复制代码

44楼

Lisrelchen 发表于 2016-7-25 21:53:22

Naïve Bayes Modeling using R

Naïve Bayes Modeling using R
To classify using the Naïve Bayes method, follow these steps:
1. Load the e1071 and caret packages:
> library(e1071)
> library(caret)
2. Read the data:
> ep <- read.csv("electronics-purchase.csv")
3. Partition the data:
> set.seed(1000)
> train.idx <- createDataPartition(ep$Purchase, p = 0.67, list =
FALSE)
4. Build the model:
> epmod <- naiveBayes(Purchase ~ . , data = ep[train.idx,])
5. Look at the model:
> epmod
6. Predict for each case of the validation partition:
> pred <- predict(epmod, ep[-train.idx,])
7. Generate and view the error matrix/classification confusion matrix for the validation
partition:
> tab <- table(ep[-train.idx,]$Purchase, pred, dnn = c("Actual",
"Predicted"))
> tab

复制代码

45楼

Lisrelchen 发表于 2016-7-25 21:57:07

K-Nearest Neighbours Medoling using R

K-Nearest Neighbours Medoling using R
1. Load the class and caret packages:
> library(class)
> library(caret)
2. Read the data:
> vac <- read.csv("vacation-trip-classification.csv")
3. Standardize the predictor variables Income and Family_size:
> vac$Income.z <- scale(vac$Income)
> vac$Family_size.z <- scale(vac$Family_size)
4. Partition the data. You need three partitions for KNN:
> set.seed(1000)
> train.idx <- createDataPartition(vac$Result, p = 0.5, list =
FALSE)
> train <- vac[train.idx, ]
> temp <- vac[-train.idx, ]
> val.idx <- createDataPartition(temp$Result, p = 0.5, list =
FALSE)
> val <- temp[val.idx, ]
> test <- temp[-val.idx, ]
5. Generate predictions for validation cases with k=1:
> pred1 <- knn(,train[4:5], val[,4:5], train[,3], 1)
6. Generate an error matrix for k=1:
> errmat1 <- table(val$Result, pred1, dnn = c("Actual",
"Predicted"))
7. Repeat the preceding process for many values of k and choose the best value for k.
Look under the following There's more… section for a way to automate this process.
8. Use that value of k to generate predictions and the error matrix for the cases in the
test partition (in the following code, we assume that k=1 was preferred):
> pred.test <- knn(,train[4:5], test[,4:5], train[,3], 1)
> errmat.test = table(test$Result, pred.test, dnn = c("Actual",
"Predicted"))

复制代码

46楼

Lisrelchen 发表于 2016-7-25 22:02:03

Neural Networks Classification Modeling using R

Neural Networks Classification Modeling using R
1. Load the nnet and caret packages:
> library(nnet)
> library(caret)
2. Read the data:
> bn <- read.csv("banknote-authentication.csv")
3. Convert the outcome variable class to a factor:
> bn$class <- factor(bn$class)
4. Partition the data.
> train.idx <- createDataPartition(bn$class, p=0.7, list = FALSE)
5. Build the neural network model:
> mod <- nnet(class ~., data=bn[train.idx,],size=3,maxit=10000,dec
ay=.001, rang = 0.05)
6. Use model to predict for validation partition:
> pred <- predict(mod, newdata=bn[-train.idx,], type="class")
7. Build and display the error/classification-confusion matrix on the validation partition:
> table(bn[-train.idx,]$class, pred)

复制代码

47楼

Lisrelchen 发表于 2016-7-25 22:06:17

Linear Discriminant Function Analysis using R

Linear Discriminant Function Analysis using R
1. Load the MASS and caret packages:
> library(MASS)
> library(caret)
2. Read the data:
> bn <- read.csv("banknote-authentication.csv")
3. Convert the outcome variable class to a factor:
> bn$class <- factor(bn$class)
4. Partition the data.
> set.seed(1000)
> t.idx <- createDataPartition(bn$class, p = 0.7, list=FALSE)
5. Build the Linear Discriminant Function model:
> ldamod <- lda(bn[t.idx, 1:4], bn[t.idx, 5])
6. Check how the model performs on the training partition.
> bn[t.idx,"Pred"] <- predict(ldamod, bn[t.idx, 1:4])$class
> table(bn[t.idx, "class"], bn[t.idx, "Pred"], dnn = c("Actual",
"Predicted"))
7. Generate predictions on the validation partition and check performance.
> bn[-t.idx,"Pred"] <- predict(ldamod, bn[-t.idx, 1:4])$class
> table(bn[-t.idx, "class"], bn[-t.idx, "Pred"], dnn = c("Actual",
"Predicted"))

复制代码

48楼

Lisrelchen 发表于 2016-7-25 22:09:40

Logistic Regression using R

Logistic Regression using R
1. Load the caret package:
> library(caret)
2. Read the data:
> bh <- read.csv("boston-housing-logistic.csv")
4. Partition the data.
> set.seed(1000)
> train.idx <- createDataPartition(bh$CLASS, p=0.7, list = FALSE)
5. Build the logistic regression model:
> logit <- glm(CLASS~., data = bh[train.idx,], family=binomial)
6. Examine the model.
> summary(logit)
7. Compute the probabilities of "success" for cases in the validation partition and store
them in a variable called PROB_SUCC:
> bh[-train.idx,"PROB_SUCC"] <- predict(logit, newdata = bh[-
train.idx,], type="response")
8. Classify the cases using a cutoff probability of 0.5:
> bh[-train.idx,"PRED_50"] <- ifelse(bh[-train.idx, "PROB_SUCC"]>=
0.5, 1, 0)
9. Generate the error/classification-confusion matrix (your results could differ):
> table(bh[-train.idx, "CLASS"], bh[-train.idx, "PRED_50"],
dnn=c("Actual", "Predicted"))

复制代码

49楼

Lisrelchen 发表于 2016-7-25 22:13:35

Combining Classification Tree Models(AdaBoost)

Combining Classification Tree Models(AdaBoost)
1. Load the caret and ada packages:
> library(caret)
> library(ada)
2. Read the data:
> bn <- read.csv("banknote-authentication.csv")
3. Convert the outcome variable class to a factor:
> bn$class <- factor(bn$class)
4. Create partitions:
> set.seed(1000)
> t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
5. Create an rpart.control object:
> cont <- rpart.control()
6. Build the model:
> mod <- ada(class ~ ., data = bn[t.idx,], iter=50, loss="e",
type="discrete", control = cont)
7.View model result.
> mod
8. Generate predictions on the validation partition:
> pred <- predict(mod, newdata = bn[-t.idx,], type = "vector")
9. Build the error/classification-confusion matrix on the validation partition:
> table(bn[-t.idx, "class"], pred, dnn = c("Actual", "Predicted"))

复制代码

50楼

Lisrelchen 发表于 2016-7-25 22:21:50

KNN Regressions Models using R
1. Load the dummies, FNN, scales, and caret packages as follows:
> library(dummies)
> library(FNN)
> library(scales)
2. Read the data:
> educ <- read.csv("education.csv")
3. Generate dummies for the categorical variable region and add them to educ as
follows:
> dums <- dummy(educ$region, sep="_")
> educ <- cbind(educ, dums)
4. Because KNN performs distance computations, we should either rescale or
standardize the predictors. In the present example, we have three numeric predictors
and a categorical predictor in the form of three dummy variables. Standardizing
dummy variables is tricky, and hence we will scale the numeric ones to [0, 1] and
leave the dummies alone because they are already in the 0-1 range:
> educ$urban.s <- rescale(educ$urban)
> educ$income.s <- rescale(educ$income)
> educ$under18.s <- rescale(educ$under18)
5. Create three partitions (because we are creating random partitions, your results
can differ) as follows:
> set.seed(1000)
> t.idx <- createDataPartition(educ$expense, p = 0.6,
list = FALSE)
> trg <- educ[t.idx,]
> rest <- educ[-t.idx,]
> set.seed(2000)
> v.idx <- createDataPartition(rest$expense, p=0.5,
list=FALSE)
> val <- rest[v.idx,]
> test <- rest[-v.idx,]
6. Build the model for several values of k. In the following code, we show how
to compute the RMS error from scratch. You can also use the convenience
rdacb.rmse function, which was shown in the recipe Computing the root
mean squared error earlier in this chapter:
> # for k=1
> res1 <- knn.reg(trg[, 7:12], val[,7:12], trg[,6], 1,
algorithm="brute")
> rmse1 = sqrt(mean((res1$pred-val[,6])^2))
> rmse1
7. We obtained the lowest RMS error for k=2. Evaluate the model on the test partition
as follows:
> res.test <- knn.reg(trg[, 7:12], test[,7:12], trg[,6], 2,
algorithm="brute")
> rmse.test = sqrt(mean((res.test$pred-test[,6])^2))
rmse.test