R Data Analysis Cookbook and R Data Visualization Cookbook - 第4页

31楼

Lisrelchen(未真实交易用户) 发表于 2015-9-10 08:42:04

> library(jsonlite)
> g <- fromJSON("groups.json")
> groups <- g$results
> head(groups)
For each group, we will now use the Meetup.com API to download member information into a data frame called users. Among the code files that you downloaded for this chapter is a file called rdacb.getusers.R. Source this file into your R environment now and run the following code. For each group from our list of groups, this code uses the Meetup.com API to get the group's members. It generates a data frame with a set of group_id, user_id pairs. In the following command, replace <<apikey>> with your actual API key from the Getting ready section. Be sure to enclose the key in double quotes and also be sure that you do not have any angle brackets in the command. This command can take a while to execute because of the sheer number of web requests and the volume of data involved. If you get an error message, see the How it works... section for this recipe:
> source("rdacb.getusers.R")
> # in command below, substitute your api key for
> # <<apikey>> and enclose it in double-quotes
> members <- rdacb.getusers(groups, <<apikey>>)
This creates a data frame with the variables (group_id, user_id).
The members data frame now has information about the social network, and normally we will use it for all further processing. However, since it is very large, many of the steps in subsequent recipes will take a lot of processing time. For convenience, we reduce the size of the social network by retaining only members who belong to more than 16 groups. This step uses data tables; see Chapter 9, Work Smarter, Not harder – Efficient and Elegant R code for more details. If you would like to work with the complete network, execute the users <- members command and skip to step 8 without executing these two code lines:
> library(data.table)
> users <- setDT(members)[,.SD[.N > 16], by = user_id]
Save the members data before further processing:
> save(users,file="meetup_users.Rdata")

复制代码

32楼

Lisrelchen(未真实交易用户) 发表于 2015-9-10 08:46:20

How to do it...
> load("users_edgelist_upper.Rdata")
> edgelist.filtered <-
> edgelist.filtered # Your results could differ
> nrow(edgelist.filtered)
> uids <- unique(c(edgelist.filtered$i, edgelist.filtered$j))
> i <- match(edgelist.filtered$i, uids)
> j <- match(edgelist.filtered$j, uids)
> nw.new <- data.frame(i, j, x = edgelist.filtered$x)
Create the graph object and plot the network:
> library(igraph)
> g <- graph.data.frame(nw.new, directed=FALSE)
> g
IGRAPH UN-- 18 19 --
+ attr: name (v/c), x (e/n)
> # Save the graph for use in later recipes:
> save(g, file = "undirected-graph.Rdata")
> plot.igraph(g, vertex.size = 20)
> plot.igraph(g,layout=layout.circle, vertex.size = 20)
> plot.igraph(g,edge.curved=TRUE,vertex.color="pink", edge.color="black")
> V(g)$size=degree(g) * 4
> plot.igraph(g,edge.curved=TRUE,vertex.color="pink", edge.color="black")
> color <- ifelse(degree(g) > 5,"red","blue")
> size <- degree(g)*4
> plot.igraph(g,vertex.label=NA,layout= layout.fruchterman.reingold,vertex.color=color,vertex.size=size)
> E(g)$x
> plot.igraph(g,edge.curved=TRUE,edge.color="black", edge.width=E(g)$x/5)

复制代码

33楼

ReneeBK(未真实交易用户) 发表于 2015-9-13 08:03:27

Generating error/classification-confusion matrices
You might build a classification model and want to evaluate the model by comparing the model's predictions with the actual outcomes. You will typically do this on the holdout data. Getting an idea of how the model does in training data itself is also useful, but you should never use that as an objective measure.
Getting ready
If you have not already downloaded the files for this chapter, do so now and ensure that the college-perf.csv file is in your R working directory. The file has data about a set of college students. The Perf variable has their college performance classified as High, Medium, or Low. The Pred variable contains a classification model's predictions of the performance level. The following code reads the data and converts the factor levels to a meaningful order—by default R orders factors alphabetically:
> cp <- read.csv("college-perf.csv")
> cp$Perf <- ordered(cp$Perf, levels =
+ c("Low", "Medium", "High"))
> cp$Pred <- ordered(cp$Pred, levels =
+ c("Low", "Medium", "High"))
How to do it...
To generate error/classification-confusion matrices, follow these steps:
First create and display a two-way table based on the actual and predicted values:
> tab <- table(cp$Perf, cp$Pred,
+ dnn = c("Actual", "Predicted"))
> tab
This yields:
Predicted
Actual Low Medium High
Low 1150 84 98
Medium 166 1801 170
High 35 38 458
Display the raw numbers as proportions or percentages. To get overall table-level proportions use:
> prop.table(tab)
Predicted
Actual Low Medium High
Low 0.28750 0.02100 0.02450
Medium 0.04150 0.45025 0.04250
High 0.00875 0.00950 0.11450
We often find it more convenient to interpret row-wise or column-wise percentages. To get row-wise percentages rounded to one decimal place, you can pass a second argument as 1:
> round(prop.table(tab, 1)*100, 1)
Predicted
Actual Low Medium High
Low 86.3 6.3 7.4
Medium 7.8 84.3 8.0
High 6.6 7.2 86.3

复制代码

34楼

ReneeBK(未真实交易用户) 发表于 2015-9-13 08:04:53

Building, plotting, and evaluating – classification trees

How to do it...
This recipe shows you how you can use the rpart package to build classification trees and the rpart.plot package to generate nice-looking tree diagrams:
Load the rpart, rpart.plot, and caret packages:
> library(rpart)
> library(rpart.plot)
> library(caret)
Read the data:
> bn <- read.csv("banknote-authentication.csv")
Create data partitions. We need two partitions—training and validation. Rather than copying the data into the partitions, we will just keep the indices of the cases that represent the training cases and subset as and when needed:
> set.seed(1000)
> train.idx <- createDataPartition(bn$class, p = 0.7, list = FALSE)
Build the tree:
> mod <- rpart(class ~ ., data = bn[train.idx, ], method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
View the text output (your result could differ if you did not set the random seed as in step 3):
> mod

复制代码

35楼

ReneeBK(未真实交易用户) 发表于 2015-9-13 08:09:42

How to do it...
To classify using SVM, follow these steps:
Load the e1071 and caret packages:
> library(e1071)
> library(caret)
Read the data:
> bn <- read.csv("banknote-authentication.csv")
Convert the outcome variable class to a factor:
> bn$class <- factor(bn$class)
Partition the data:
> set.seed(1000)
> t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
Build the model:
> mod <- svm(class ~ ., data = bn[t.idx,])
Check model performance on training data by generating an error/classification-confusion matrix:
> table(bn[t.idx,"class"], fitted(mod), dnn = c("Actual", "Predicted"))

复制代码

36楼

ReneeBK(未真实交易用户) 发表于 2015-9-13 08:10:42

Classifying using Support Vector Machine

How to do it...
To classify using SVM, follow these steps:
Load the e1071 and caret packages:
> library(e1071)
> library(caret)
Read the data:
> bn <- read.csv("banknote-authentication.csv")
Convert the outcome variable class to a factor:
> bn$class <- factor(bn$class)
Partition the data:
> set.seed(1000)
> t.idx <- createDataPartition(bn$class, p=0.7, list=FALSE)
Build the model:
> mod <- svm(class ~ ., data = bn[t.idx,])
Check model performance on training data by generating an error/classification-confusion matrix:
> table(bn[t.idx,"class"], fitted(mod), dnn = c("Actual", "Predicted"))

复制代码

37楼

ReneeBK(未真实交易用户) 发表于 2015-9-13 08:11:54

Classifying using the Naïve Bayes approach

How to do it...
To classify using the Naïve Bayes method, follow these steps:
Load the e1071 and caret packages:
> library(e1071)
> library(caret)
Read the data:
> ep <- read.csv("electronics-purchase.csv")
Partition the data:
> set.seed(1000)
> train.idx <- createDataPartition(ep$Purchase, p = 0.67, list = FALSE)
Build the model:
> epmod <- naiveBayes(Purchase ~ . , data = ep[train.idx,])
Look at the model:
> epmod
Predict for each case of the validation partition:
> pred <- predict(epmod, ep[-train.idx,])
Generate and view the error matrix/classification confusion matrix for the validation partition:
> tab <- table(ep[-train.idx,]$Purchase, pred, dnn = c("Actual", "Predicted"))
> tab