For each group, we will now use the Meetup.com API to download member information into a data frame called users. Among the code files that you downloaded for this chapter is a file called rdacb.getusers.R. Source this file into your R environment now and run the following code. For each group from our list of groups, this code uses the Meetup.com API to get the group's members. It generates a data frame with a set of group_id, user_id pairs. In the following command, replace <<apikey>> with your actual API key from the Getting ready section. Be sure to enclose the key in double quotes and also be sure that you do not have any angle brackets in the command. This command can take a while to execute because of the sheer number of web requests and the volume of data involved. If you get an error message, see the How it works... section for this recipe:
> source("rdacb.getusers.R")
> # in command below, substitute your api key for
> # <<apikey>> and enclose it in double-quotes
> members <- rdacb.getusers(groups, <<apikey>>)
This creates a data frame with the variables (group_id, user_id).
The members data frame now has information about the social network, and normally we will use it for all further processing. However, since it is very large, many of the steps in subsequent recipes will take a lot of processing time. For convenience, we reduce the size of the social network by retaining only members who belong to more than 16 groups. This step uses data tables; see Chapter 9, Work Smarter, Not harder – Efficient and Elegant R code for more details. If you would like to work with the complete network, execute the users <- members command and skip to step 8 without executing these two code lines:
> library(data.table)
> users <- setDT(members)[,.SD[.N > 16], by = user_id]
You might build a classification model and want to evaluate the model by comparing the model's predictions with the actual outcomes. You will typically do this on the holdout data. Getting an idea of how the model does in training data itself is also useful, but you should never use that as an objective measure.
Getting ready
If you have not already downloaded the files for this chapter, do so now and ensure that the college-perf.csv file is in your R working directory. The file has data about a set of college students. The Perf variable has their college performance classified as High, Medium, or Low. The Pred variable contains a classification model's predictions of the performance level. The following code reads the data and converts the factor levels to a meaningful order—by default R orders factors alphabetically:
> cp <- read.csv("college-perf.csv")
> cp$Perf <- ordered(cp$Perf, levels =
+ c("Low", "Medium", "High"))
> cp$Pred <- ordered(cp$Pred, levels =
+ c("Low", "Medium", "High"))
How to do it...
To generate error/classification-confusion matrices, follow these steps:
First create and display a two-way table based on the actual and predicted values:
> tab <- table(cp$Perf, cp$Pred,
+ dnn = c("Actual", "Predicted"))
> tab
This yields:
Predicted
Actual Low Medium High
Low 1150 84 98
Medium 166 1801 170
High 35 38 458
Display the raw numbers as proportions or percentages. To get overall table-level proportions use:
> prop.table(tab)
Predicted
Actual Low Medium High
Low 0.28750 0.02100 0.02450
Medium 0.04150 0.45025 0.04250
High 0.00875 0.00950 0.11450
We often find it more convenient to interpret row-wise or column-wise percentages. To get row-wise percentages rounded to one decimal place, you can pass a second argument as 1:
Building, plotting, and evaluating – classification trees
How to do it...
This recipe shows you how you can use the rpart package to build classification trees and the rpart.plot package to generate nice-looking tree diagrams:
Load the rpart, rpart.plot, and caret packages:
> library(rpart)
> library(rpart.plot)
> library(caret)
Read the data:
> bn <- read.csv("banknote-authentication.csv")
Create data partitions. We need two partitions—training and validation. Rather than copying the data into the partitions, we will just keep the indices of the cases that represent the training cases and subset as and when needed:
> set.seed(1000)
> train.idx <- createDataPartition(bn$class, p = 0.7, list = FALSE)
Build the tree:
> mod <- rpart(class ~ ., data = bn[train.idx, ], method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
View the text output (your result could differ if you did not set the random seed as in step 3):