R Data Analysis Cookbook and R Data Visualization Cookbook - 第2页

11楼

ReneeBK(未真实交易用户) 发表于 2015-9-6 10:32:25

Creating standard data summaries
In this recipe we summarize the data using the summary function.
Getting ready
If you have not already downloaded the files for this chapter, do it now and ensure that the auto-mpg.csv file is in your R working directory.
How to do it...
Read the data from auto-mpg.csv, which includes a header row and columns separated by the default "," symbol.
Read the data from auto-mpg.csv and convert cylinders to factor:
> auto <- read.csv("auto-mpg.csv", header = TRUE, stringsAsFactors = FALSE)
> # Convert cylinders to factor
> auto$cylinders <- factor(auto$cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
Get the summary statistics:
summary(auto)
No mpg cylinders displacement
Min. : 1.0 Min. : 9.00 3cyl: 4 Min. : 68.0
1st Qu.:100.2 1st Qu.:17.50 4cyl:204 1st Qu.:104.2
Median :199.5 Median :23.00 5cyl: 3 Median :148.5
Mean :199.5 Mean :23.51 6cyl: 84 Mean :193.4
3rd Qu.:298.8 3rd Qu.:29.00 8cyl:103 3rd Qu.:262.0
Max. :398.0 Max. :46.60 Max. :455.0
horsepower weight acceleration model_year
Min. : 46.0 Min. :1613 Min. : 8.00 Min. :70.00
1st Qu.: 76.0 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00
Median : 92.0 Median :2804 Median :15.50 Median :76.00
Mean :104.1 Mean :2970 Mean :15.57 Mean :76.01
3rd Qu.:125.0 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00
Max. :230.0 Max. :5140 Max. :24.80 Max. :82.00
car_name
Length:398
Class :character
Mode :character

复制代码

12楼

ReneeBK(未真实交易用户) 发表于 2015-9-6 10:34:13

Extracting a subset of a dataset
In this recipe, we discuss two ways to subset data. The first approach uses the row and column indices/names, and the other uses the subset() function.
Getting ready
Download the files for this chapter and store the auto-mpg.csv file in your R working directory. Read the data using the following command:
> auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE)
The same subsetting principles apply for vectors, lists, arrays, matrices, and data frames. We illustrate with data frames.
How to do it...
The following steps extract a subset of a dataset:
Index by position. Get model_year and car_name for the first three cars:
> auto[1:3, 8:9]
> auto[1:3, c(8,9)]
Index by name. Get model_year and car_name for the first three cars:
> auto[1:3,c("model_year", "car_name")]
Retrieve all details for cars with the highest or lowest mpg, using the following code:
> auto[auto$mpg == max(auto$mpg) | auto$mpg == min(auto$mpg),]
Get mpg and car_name for all cars with mpg > 30 and cylinders == 6:
> auto[auto$mpg>30 & auto$cylinders==6, c("car_name","mpg")]
Get mpg and car_name for all cars with mpg > 30 and cylinders == 6 using partial name match for cylinders:
> auto[auto$mpg >30 & auto$cyl==6, c("car_name","mpg")]
Using the subset() function, get mpg and car_name for all cars with mpg > 30 and cylinders == 6:
> subset(auto, mpg > 30 & cylinders == 6, select=c("car_name","mpg"))

复制代码

13楼

ReneeBK(未真实交易用户) 发表于 2015-9-6 10:36:05

Splitting a dataset
When we have categorical variables, we often want to create groups corresponding to each level and to analyze each group separately to reveal some significant similarities and differences between groups.
The split function divides data into groups based on a factor or vector. The unsplit() function reverses the effect of split.
Getting ready
Download the files for this chapter and store the auto-mpg.csv file in your R working directory. Read the file using the read.csv command and save in the auto variable:
> auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE)
How to do it...
Split cylinders using the following command:
> carslist <- split(auto, auto$cylinders)
How it works...
The split(auto, auto$cylinders) function returns a list of data frames with each data frame corresponding to the cases for a particular level of cylinders. To reference a data frame from the list, use the [ notation. Here, carslist[1] is a list of length 1 consisting of the first data frame that corresponds to three cylinder cars, and carslist[[1]] is the associated data frame for three cylinder cars.
> str(carslist[1])
List of 1
$ 3:'data.frame': 4 obs. of 9 variables:
..$ No : int [1:4] 2 199 251 365
..$ mpg : num [1:4] 19 18 23.7 21.5
..$ cylinders : int [1:4] 3 3 3 3
..$ displacement: num [1:4] 70 70 70 80
..$ horsepower : int [1:4] 97 90 100 110
..$ weight : int [1:4] 2330 2124 2420 2720
..$ acceleration: num [1:4] 13.5 13.5 12.5 13.5
..$ model_year : int [1:4] 72 73 80 77
..$ car_name : chr [1:4] "mazda rx2 coupe" "maxda rx3" "mazda rx-7 gs" "mazda rx-4"
> names(carslist[[1]])
[1] "No" "mpg" "cylinders" "displacement"
[5] "horsepower" "weight" "acceleration" "model_year"
[9] "car_name"

复制代码

14楼

ReneeBK(未真实交易用户) 发表于 2015-9-6 10:38:51

Creating random data partitions
Analysts need an unbiased evaluation of the quality of their machine learning models. To get this, they partition the available data into two parts. They use one part to build the machine learning model and retain the remaining data as "hold out" data. After building the model, they evaluate the model's performance on the hold out data. This recipe shows you how to partition data. It separately addresses the situation when the target variable is numeric and when it is categorical. It also covers the process of creating two partitions or three.
Getting ready
If you have not already done so, make sure that the BostonHousing.csv and boston-housing-classification.csv files from the code files of this chapter are in your R working directory. You should also install the caret package using the following command:
> install.packages("caret")
> library(caret)
> bh <- read.csv("BostonHousing.csv")
How to do it…
You may want to develop a model using some machine learning technique (like linear regression or KNN) to predict the value of the median of a home in Boston neighborhoods using the data in the BostonHousing.csv file. The MEDV variable will serve as the target variable.
Case 1 – numerical target variable and two partitions
To create a training partition with 80 percent of the cases and a validation partition with the rest, use the following code:
> trg.idx <- createDataPartition(bh$MEDV, p = 0.8, list = FALSE)
> trg.part <- bh[trg.idx, ]
> val.part <- bh[-trg.idx, ]
After this, the trg.part and val.part variables contain the training and validation partitions, respectively.
Case 2 – numerical target variable and three partitions
Some machine learning techniques require three partitions because they use two partitions just for building the model. Therefore, a third (test) partition contains the "hold-out" data for model evaluation.
Suppose we want a training partition with 70 percent of the cases, and the rest divided equally among validation and test partitions, use the following commands:
> trg.idx <- createDataPartition(bh$MEDV, p = 0.7, list = FALSE)
> trg.part <- bh[trg.idx, ]
> temp <- bh[-trg.idx, ]
> val.idx <- createDataPartition(temp$MEDV, p = 0.5, list = FALSE)
> val.part <- temp[val.idx, ]
> test.part <- temp[-val.idx, ]
Case 3 – categorical target variable and two partitions
Instead of a model to predict a numerical value like MEDV, you may need to create partitions for a classification application. The boston-housing-classification.csv file has a MEDV_CAT variable that categorizes the median values into HIGH or LOW and is suitable for a classification algorithm.
For a 70–30 split use the following commands:
> bh2 <- read.csv("boston-housing-classification.csv")
> trg.idx <- createDataPartition(bh2$MEDV_CAT, p=0.7, list = FALSE)
> trg.part <- bh2[trg.idx, ]
> val.part <- bh2[-trg.idx, ]
Case 4 – categorical target variable and three partitions
For a 70–15–15 split (training, validation, test) use the following commands:
> bh3 <- read.csv("boston-housing-classification.csv")
> trg.idx <- createDataPartition(bh3$MEDV_CAT, p=0.7, list = FALSE)
> trg.part <- bh3[trg.idx, ]
> temp <- bh3[-trg.idx, ]
> val.idx <- createDataPartition(temp$MEDV_CAT, p=0.5,list = FALSE)
> val.part <- temp[val.idx, ]
> test.part <- temp[-val.idx, ]

复制代码

15楼

ReneeBK(未真实交易用户) 发表于 2015-9-6 10:42:59

Generating standard plots such as histograms, boxplots, and scatterplots
Before even embarking on any numerical analyses, you may want to get a good idea about the data through a few quick plots. Although the base R system supports powerful graphics, we will generally turn to other plotting options like lattice and ggplot for more advanced plots. Therefore, we cover only the simplest forms of basic graphs.
Getting ready
If you have not already done so, download the data files for this chapter and ensure that they are available in your R environment's working directory and run the following commands:
> auto <- read.csv("auto-mpg.csv")
>
> auto$cylinders <- factor(auto$cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
> attach(auto)
How to do it...
In this recipe, we cover histograms, boxplots, scatterplots and scatterplot matrices.
> hist(acceleration)
> boxplot(mpg, xlab = "Miles per gallon")
> plot(mpg ~ horsepower)
> pairs(~mpg+displacement+horsepower+weight)

复制代码

16楼

ReneeBK(未真实交易用户) 发表于 2015-9-6 10:48:34

Generating multiple plots on a grid

We often want to see plots side by side for comparisons. This recipe shows how we can achieve this.
Getting ready
If you have not already done so, download the data files for this chapter and ensure that they are available in your R environment's working directory. Once this is done, run the following commands:
> auto <- read.csv("auto-mpg.csv")
> cylinders <- factor(cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
> attach(auto)
How to do it...
You may want to generate two side-by-side scatterplots from the data in auto-mpg.csv. Run the following commands:
> # first get old graphical parameter settings
> old.par = par()
> # create a grid of one row and two columns
> par(mfrow = c(1,2))
> with(auto, {
plot(mpg ~ weight, main = "Weight vs. mpg")
plot(mpg ~ acceleration, main = "Acceleration vs. mpg")
}
)
> # reset par back to old value so that subsequent
> # graphic operations are unaffected by our settings
> par(old.par)

复制代码

17楼

ReneeBK(未真实交易用户) 发表于 2015-9-6 10:50:38

Selecting a graphics device
R can send its output to several different graphic devices to display graphics in different formats. By default, R prints to the screen. However, we can save graphs in the following file formats as well: PostScript, PDF, PNG, JPEG, Windows metafile, Windows BMP, and so on.
Getting ready
If you have not already done so, download the data files for this chapter and ensure that the auto-mpg.csv file is available in your R environment's working directory and run the following commands:
> auto <- read.csv("auto-mpg.csv")
>
> cylinders <- factor(cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
> attach(auto)
How to do it...
To send the graphic output to the computer screen, you have to do nothing special. For other devices, you first open the device, send your graphical output to it, and then close the device to close the corresponding file.
To create a PostScript file use:
> postscript(file = "auto-scatter.ps")
> boxplot(mpg)
> dev.off()
> pdf(file = "auto-scatter.pdf")
> boxplot(mpg)
> dev.off()

复制代码

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2732 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	18楼 Nicolle(未真实交易用户) 发表于 2015-9-8 06:06:52 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2732 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	19楼 Nicolle(未真实交易用户) 发表于 2015-9-8 06:07:32 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2732 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	20楼 Nicolle(未真实交易用户) 发表于 2015-9-8 06:12:39 Classifying using Support Vector Machine 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

R Data Analysis Cookbook and R Data Visualization Cookbook [推广有奖]

Generating multiple plots on a grid

Classifying using Support Vector Machine

浏览过的帖子

浏览过的版块

本版微信群