楼主: ipple7
5564 42

R Data Analysis Cookbook and R Data Visualization Cookbook [推广有奖]

31
Lisrelchen(未真实交易用户) 发表于 2015-9-10 08:42:04
  1. > library(jsonlite)
  2. > g <- fromJSON("groups.json")
  3. > groups <- g$results
  4. > head(groups)
  5. For each group, we will now use the Meetup.com API to download member information into a data frame called users. Among the code files that you downloaded for this chapter is a file called rdacb.getusers.R. Source this file into your R environment now and run the following code. For each group from our list of groups, this code uses the Meetup.com API to get the group's members. It generates a data frame with a set of group_id, user_id pairs. In the following command, replace <<apikey>> with your actual API key from the Getting ready section. Be sure to enclose the key in double quotes and also be sure that you do not have any angle brackets in the command. This command can take a while to execute because of the sheer number of web requests and the volume of data involved. If you get an error message, see the How it works... section for this recipe:
  6. > source("rdacb.getusers.R")
  7. > # in command below, substitute your api key for
  8. > # <<apikey>> and enclose it in double-quotes
  9. > members <- rdacb.getusers(groups, <<apikey>>)
  10. This creates a data frame with the variables (group_id, user_id).

  11. The members data frame now has information about the social network, and normally we will use it for all further processing. However, since it is very large, many of the steps in subsequent recipes will take a lot of processing time. For convenience, we reduce the size of the social network by retaining only members who belong to more than 16 groups. This step uses data tables; see Chapter 9, Work Smarter, Not harder – Efficient and Elegant R code for more details. If you would like to work with the complete network, execute the users <- members command and skip to step 8 without executing these two code lines:
  12. > library(data.table)
  13. > users <- setDT(members)[,.SD[.N > 16], by = user_id]
  14. Save the members data before further processing:
  15. > save(users,file="meetup_users.Rdata")
复制代码

32
Lisrelchen(未真实交易用户) 发表于 2015-9-10 08:46:20
  1. How to do it...
  2. > load("users_edgelist_upper.Rdata")
  3. >  edgelist.filtered <-
  4. > edgelist.filtered  # Your results could differ
  5. > nrow(edgelist.filtered)
  6. > uids <- unique(c(edgelist.filtered$i, edgelist.filtered$j))
  7. > i <- match(edgelist.filtered$i, uids)
  8. > j <- match(edgelist.filtered$j, uids)
  9. > nw.new <- data.frame(i, j, x = edgelist.filtered$x)
  10. Create the graph object and plot the network:
  11. > library(igraph)
  12. > g <- graph.data.frame(nw.new, directed=FALSE)
  13. > g
  14. IGRAPH UN-- 18 19 --
  15. + attr: name (v/c), x (e/n)
  16. > # Save the graph for use in later recipes:
  17. > save(g, file = "undirected-graph.Rdata")
  18. > plot.igraph(g, vertex.size = 20)
  19. > plot.igraph(g,layout=layout.circle, vertex.size = 20)
  20. > plot.igraph(g,edge.curved=TRUE,vertex.color="pink", edge.color="black")
  21. > V(g)$size=degree(g) * 4
  22. > plot.igraph(g,edge.curved=TRUE,vertex.color="pink", edge.color="black")
  23. > color <- ifelse(degree(g) > 5,"red","blue")
  24. > size <- degree(g)*4
  25. > plot.igraph(g,vertex.label=NA,layout= layout.fruchterman.reingold,vertex.color=color,vertex.size=size)
  26. > E(g)$x
  27. > plot.igraph(g,edge.curved=TRUE,edge.color="black", edge.width=E(g)$x/5)
复制代码

33
ReneeBK(未真实交易用户) 发表于 2015-9-13 08:03:27
  1. Generating error/classification-confusion matrices

  2. You might build a classification model and want to evaluate the model by comparing the model's predictions with the actual outcomes. You will typically do this on the holdout data. Getting an idea of how the model does in training data itself is also useful, but you should never use that as an objective measure.

  3. Getting ready

  4. If you have not already downloaded the files for this chapter, do so now and ensure that the college-perf.csv file is in your R working directory. The file has data about a set of college students. The Perf variable has their college performance classified as High, Medium, or Low. The Pred variable contains a classification model's predictions of the performance level. The following code reads the data and converts the factor levels to a meaningful order—by default R orders factors alphabetically:

  5. > cp <- read.csv("college-perf.csv")
  6. > cp$Perf <- ordered(cp$Perf, levels =
  7. +             c("Low", "Medium", "High"))

  8. > cp$Pred <- ordered(cp$Pred, levels =
  9. +             c("Low", "Medium", "High"))
  10. How to do it...

  11. To generate error/classification-confusion matrices, follow these steps:

  12. First create and display a two-way table based on the actual and predicted values:
  13. > tab <- table(cp$Perf, cp$Pred,
  14. +             dnn = c("Actual", "Predicted"))
  15. > tab
  16. This yields:

  17.         Predicted
  18. Actual    Low Medium High
  19.   Low    1150     84   98
  20.   Medium  166   1801  170
  21.   High     35     38  458
  22. Display the raw numbers as proportions or percentages. To get overall table-level proportions use:
  23. > prop.table(tab)

  24.         Predicted
  25. Actual       Low  Medium    High
  26.   Low    0.28750 0.02100 0.02450
  27.   Medium 0.04150 0.45025 0.04250
  28.   High   0.00875 0.00950 0.11450
  29. We often find it more convenient to interpret row-wise or column-wise percentages. To get row-wise percentages rounded to one decimal place, you can pass a second argument as 1:
  30. > round(prop.table(tab, 1)*100, 1)

  31.         Predicted
  32. Actual    Low Medium High
  33.   Low    86.3    6.3  7.4
  34.   Medium  7.8   84.3  8.0
  35.   High    6.6    7.2 86.3
复制代码

34
ReneeBK(未真实交易用户) 发表于 2015-9-13 08:04:53

Building, plotting, and evaluating – classification trees

  1. How to do it...

  2. This recipe shows you how you can use the rpart package to build classification trees and the rpart.plot package to generate nice-looking tree diagrams:

  3. Load the rpart, rpart.plot, and caret packages:
  4. > library(rpart)
  5. > library(rpart.plot)
  6. > library(caret)
  7. Read the data:
  8. > bn <- read.csv("banknote-authentication.csv")
  9. Create data partitions. We need two partitions—training and validation. Rather than copying the data into the partitions, we will just keep the indices of the cases that represent the training cases and subset as and when needed:
  10. > set.seed(1000)
  11. > train.idx <- createDataPartition(bn$class, p = 0.7, list = FALSE)
  12. Build the tree:
  13. > mod <- rpart(class ~ ., data = bn[train.idx, ], method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
  14. View the text output (your result could differ if you did not set the random seed as in step 3):
  15. > mod
复制代码

35
ReneeBK(未真实交易用户) 发表于 2015-9-13 08:09:42
  1. How to do it...

  2. To classify using SVM, follow these steps:

  3. Load the e1071 and caret packages:
  4. > library(e1071)
  5. > library(caret)
  6. Read the data:
  7. > bn <- read.csv("banknote-authentication.csv")
  8. Convert the outcome variable class to a factor:
  9. > bn$class <- factor(bn$class)
  10. Partition the data:
  11. > set.seed(1000)
  12. > t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
  13. Build the model:
  14. > mod <- svm(class ~ ., data = bn[t.idx,])
  15. Check model performance on training data by generating an error/classification-confusion matrix:
  16. > table(bn[t.idx,"class"], fitted(mod), dnn = c("Actual", "Predicted"))
复制代码

36
ReneeBK(未真实交易用户) 发表于 2015-9-13 08:10:42

Classifying using Support Vector Machine

  1. How to do it...

  2. To classify using SVM, follow these steps:

  3. Load the e1071 and caret packages:
  4. > library(e1071)
  5. > library(caret)
  6. Read the data:
  7. > bn <- read.csv("banknote-authentication.csv")
  8. Convert the outcome variable class to a factor:
  9. > bn$class <- factor(bn$class)
  10. Partition the data:
  11. > set.seed(1000)
  12. > t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
  13. Build the model:
  14. > mod <- svm(class ~ ., data = bn[t.idx,])
  15. Check model performance on training data by generating an error/classification-confusion matrix:
  16. > table(bn[t.idx,"class"], fitted(mod), dnn = c("Actual", "Predicted"))
复制代码

37
ReneeBK(未真实交易用户) 发表于 2015-9-13 08:11:54

Classifying using the Naïve Bayes approach

  1. How to do it...

  2. To classify using the Naïve Bayes method, follow these steps:

  3. Load the e1071 and caret packages:
  4. > library(e1071)
  5. > library(caret)
  6. Read the data:
  7. > ep <- read.csv("electronics-purchase.csv")
  8. Partition the data:
  9. > set.seed(1000)
  10. > train.idx <- createDataPartition(ep$Purchase, p = 0.67, list = FALSE)
  11. Build the model:
  12. > epmod <- naiveBayes(Purchase ~ . , data = ep[train.idx,])
  13. Look at the model:
  14. > epmod
  15. Predict for each case of the validation partition:
  16. > pred <- predict(epmod, ep[-train.idx,])
  17. Generate and view the error matrix/classification confusion matrix for the validation partition:
  18. > tab <- table(ep[-train.idx,]$Purchase, pred, dnn = c("Actual", "Predicted"))
  19. > tab
复制代码

38
ReneeBK(未真实交易用户) 发表于 2015-9-13 08:13:18
  1. Classifying using the KNN approach
复制代码

39
奇渥温·沙加(真实交易用户) 发表于 2016-3-4 19:30:57
被坑了~~~

40
blowing136(真实交易用户) 发表于 2016-3-22 01:28:12
被坑了

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2025-12-25 03:10