楼主: ipple7
5563 42

R Data Analysis Cookbook and R Data Visualization Cookbook [推广有奖]

11
ReneeBK(未真实交易用户) 发表于 2015-9-6 10:32:25
  1. Creating standard data summaries

  2. In this recipe we summarize the data using the summary function.

  3. Getting ready

  4. If you have not already downloaded the files for this chapter, do it now and ensure that the auto-mpg.csv file is in your R working directory.

  5. How to do it...

  6. Read the data from auto-mpg.csv, which includes a header row and columns separated by the default "," symbol.

  7. Read the data from auto-mpg.csv and convert cylinders to factor:
  8. > auto  <- read.csv("auto-mpg.csv", header = TRUE, stringsAsFactors = FALSE)
  9. > # Convert cylinders to factor
  10. > auto$cylinders <- factor(auto$cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
  11. Get the summary statistics:
  12. summary(auto)

  13.        No             mpg        cylinders   displacement
  14. Min.   :  1.0   Min.   : 9.00   3cyl:  4   Min.   : 68.0
  15. 1st Qu.:100.2   1st Qu.:17.50   4cyl:204   1st Qu.:104.2
  16. Median :199.5   Median :23.00   5cyl:  3   Median :148.5
  17. Mean   :199.5   Mean   :23.51   6cyl: 84   Mean   :193.4
  18. 3rd Qu.:298.8   3rd Qu.:29.00   8cyl:103   3rd Qu.:262.0
  19. Max.   :398.0   Max.   :46.60              Max.   :455.0
  20.    horsepower        weight      acceleration     model_year
  21. Min.   : 46.0   Min.   :1613   Min.   : 8.00   Min.   :70.00
  22. 1st Qu.: 76.0   1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00
  23. Median : 92.0   Median :2804   Median :15.50   Median :76.00
  24. Mean   :104.1   Mean   :2970   Mean   :15.57   Mean   :76.01
  25. 3rd Qu.:125.0   3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00
  26. Max.   :230.0   Max.   :5140   Max.   :24.80   Max.   :82.00
  27.    car_name
  28. Length:398
  29. Class :character
  30. Mode  :character
复制代码

12
ReneeBK(未真实交易用户) 发表于 2015-9-6 10:34:13
  1. Extracting a subset of a dataset

  2. In this recipe, we discuss two ways to subset data. The first approach uses the row and column indices/names, and the other uses the subset() function.

  3. Getting ready

  4. Download the files for this chapter and store the auto-mpg.csv file in your R working directory. Read the data using the following command:

  5. > auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE)
  6. The same subsetting principles apply for vectors, lists, arrays, matrices, and data frames. We illustrate with data frames.

  7. How to do it...

  8. The following steps extract a subset of a dataset:

  9. Index by position. Get model_year and car_name for the first three cars:
  10. > auto[1:3, 8:9]
  11. > auto[1:3, c(8,9)]
  12. Index by name. Get model_year and car_name for the first three cars:
  13. > auto[1:3,c("model_year", "car_name")]
  14. Retrieve all details for cars with the highest or lowest mpg, using the following code:
  15. > auto[auto$mpg == max(auto$mpg) | auto$mpg == min(auto$mpg),]
  16. Get mpg and car_name for all cars with mpg > 30 and cylinders == 6:
  17. > auto[auto$mpg>30 & auto$cylinders==6, c("car_name","mpg")]
  18. Get mpg and car_name for all cars with mpg > 30 and cylinders == 6 using partial name match for cylinders:
  19. > auto[auto$mpg >30 & auto$cyl==6, c("car_name","mpg")]
  20. Using the subset() function, get mpg and car_name for all cars with mpg > 30 and cylinders == 6:
  21. > subset(auto, mpg > 30 & cylinders == 6, select=c("car_name","mpg"))
复制代码

13
ReneeBK(未真实交易用户) 发表于 2015-9-6 10:36:05
  1. Splitting a dataset

  2. When we have categorical variables, we often want to create groups corresponding to each level and to analyze each group separately to reveal some significant similarities and differences between groups.

  3. The split function divides data into groups based on a factor or vector. The unsplit() function reverses the effect of split.

  4. Getting ready

  5. Download the files for this chapter and store the auto-mpg.csv file in your R working directory. Read the file using the read.csv command and save in the auto variable:

  6. > auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE)
  7. How to do it...

  8. Split cylinders using the following command:

  9. > carslist <- split(auto, auto$cylinders)
  10. How it works...

  11. The split(auto, auto$cylinders) function returns a list of data frames with each data frame corresponding to the cases for a particular level of cylinders. To reference a data frame from the list, use the [ notation. Here, carslist[1] is a list of length 1 consisting of the first data frame that corresponds to three cylinder cars, and carslist[[1]] is the associated data frame for three cylinder cars.

  12. > str(carslist[1])
  13. List of 1
  14. $ 3:'data.frame': 4 obs. of  9 variables:
  15.    ..$ No          : int [1:4] 2 199 251 365
  16.    ..$ mpg         : num [1:4] 19 18 23.7 21.5
  17.    ..$ cylinders   : int [1:4] 3 3 3 3
  18.    ..$ displacement: num [1:4] 70 70 70 80
  19.    ..$ horsepower  : int [1:4] 97 90 100 110
  20.    ..$ weight      : int [1:4] 2330 2124 2420 2720
  21.    ..$ acceleration: num [1:4] 13.5 13.5 12.5 13.5
  22.    ..$ model_year  : int [1:4] 72 73 80 77
  23.    ..$ car_name    : chr [1:4] "mazda rx2 coupe" "maxda rx3" "mazda rx-7 gs" "mazda rx-4"

  24. > names(carslist[[1]])

  25. [1] "No"           "mpg"          "cylinders"    "displacement"
  26. [5] "horsepower"   "weight"       "acceleration" "model_year"
  27. [9] "car_name"
复制代码

14
ReneeBK(未真实交易用户) 发表于 2015-9-6 10:38:51
  1. Creating random data partitions

  2. Analysts need an unbiased evaluation of the quality of their machine learning models. To get this, they partition the available data into two parts. They use one part to build the machine learning model and retain the remaining data as "hold out" data. After building the model, they evaluate the model's performance on the hold out data. This recipe shows you how to partition data. It separately addresses the situation when the target variable is numeric and when it is categorical. It also covers the process of creating two partitions or three.

  3. Getting ready

  4. If you have not already done so, make sure that the BostonHousing.csv and boston-housing-classification.csv files from the code files of this chapter are in your R working directory. You should also install the caret package using the following command:

  5. > install.packages("caret")
  6. > library(caret)
  7. > bh <- read.csv("BostonHousing.csv")
  8. How to do it…

  9. You may want to develop a model using some machine learning technique (like linear regression or KNN) to predict the value of the median of a home in Boston neighborhoods using the data in the BostonHousing.csv file. The MEDV variable will serve as the target variable.

  10. Case 1 – numerical target variable and two partitions
  11. To create a training partition with 80 percent of the cases and a validation partition with the rest, use the following code:

  12. > trg.idx <- createDataPartition(bh$MEDV, p = 0.8, list = FALSE)
  13. > trg.part <- bh[trg.idx, ]
  14. > val.part <- bh[-trg.idx, ]
  15. After this, the trg.part and val.part variables contain the training and validation partitions, respectively.

  16. Case 2 – numerical target variable and three partitions
  17. Some machine learning techniques require three partitions because they use two partitions just for building the model. Therefore, a third (test) partition contains the "hold-out" data for model evaluation.

  18. Suppose we want a training partition with 70 percent of the cases, and the rest divided equally among validation and test partitions, use the following commands:

  19. > trg.idx <- createDataPartition(bh$MEDV, p = 0.7, list = FALSE)
  20. > trg.part <- bh[trg.idx, ]
  21. > temp <- bh[-trg.idx, ]
  22. > val.idx <- createDataPartition(temp$MEDV, p = 0.5, list = FALSE)
  23. > val.part <- temp[val.idx, ]
  24. > test.part <- temp[-val.idx, ]
  25. Case 3 – categorical target variable and two partitions
  26. Instead of a model to predict a numerical value like MEDV, you may need to create partitions for a classification application. The boston-housing-classification.csv file has a MEDV_CAT variable that categorizes the median values into HIGH or LOW and is suitable for a classification algorithm.

  27. For a 70–30 split use the following commands:

  28. > bh2 <- read.csv("boston-housing-classification.csv")
  29. > trg.idx <- createDataPartition(bh2$MEDV_CAT, p=0.7, list = FALSE)
  30. > trg.part <- bh2[trg.idx, ]
  31. > val.part <- bh2[-trg.idx, ]
  32. Case 4 – categorical target variable and three partitions
  33. For a 70–15–15 split (training, validation, test) use the following commands:

  34. > bh3 <- read.csv("boston-housing-classification.csv")
  35. > trg.idx <- createDataPartition(bh3$MEDV_CAT, p=0.7, list = FALSE)
  36. > trg.part <- bh3[trg.idx, ]
  37. > temp <- bh3[-trg.idx, ]
  38. > val.idx <- createDataPartition(temp$MEDV_CAT, p=0.5,list = FALSE)
  39. > val.part <- temp[val.idx, ]
  40. > test.part <- temp[-val.idx, ]
复制代码

15
ReneeBK(未真实交易用户) 发表于 2015-9-6 10:42:59
  1. Generating standard plots such as histograms, boxplots, and scatterplots

  2. Before even embarking on any numerical analyses, you may want to get a good idea about the data through a few quick plots. Although the base R system supports powerful graphics, we will generally turn to other plotting options like lattice and ggplot for more advanced plots. Therefore, we cover only the simplest forms of basic graphs.

  3. Getting ready

  4. If you have not already done so, download the data files for this chapter and ensure that they are available in your R environment's working directory and run the following commands:

  5. > auto <- read.csv("auto-mpg.csv")
  6. >
  7. > auto$cylinders <- factor(auto$cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
  8. > attach(auto)
  9. How to do it...

  10. In this recipe, we cover histograms, boxplots, scatterplots and scatterplot matrices.

  11. > hist(acceleration)

  12. > boxplot(mpg, xlab = "Miles per gallon")


  13. > plot(mpg ~ horsepower)

  14. > pairs(~mpg+displacement+horsepower+weight)
复制代码

16
ReneeBK(未真实交易用户) 发表于 2015-9-6 10:48:34

Generating multiple plots on a grid

  1. We often want to see plots side by side for comparisons. This recipe shows how we can achieve this.

  2. Getting ready

  3. If you have not already done so, download the data files for this chapter and ensure that they are available in your R environment's working directory. Once this is done, run the following commands:

  4. > auto <- read.csv("auto-mpg.csv")
  5. > cylinders <- factor(cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
  6. > attach(auto)
  7. How to do it...

  8. You may want to generate two side-by-side scatterplots from the data in auto-mpg.csv. Run the following commands:

  9. > # first get old graphical parameter settings
  10. > old.par = par()
  11. > # create a grid of one row and two columns
  12. > par(mfrow = c(1,2))
  13. > with(auto, {
  14.    plot(mpg ~ weight, main = "Weight vs. mpg")
  15.    plot(mpg ~ acceleration, main = "Acceleration vs. mpg")
  16.   }
  17. )
  18. > # reset par back to old value so that subsequent
  19. > # graphic operations are unaffected by our settings
  20. > par(old.par)
复制代码

17
ReneeBK(未真实交易用户) 发表于 2015-9-6 10:50:38
  1. Selecting a graphics device

  2. R can send its output to several different graphic devices to display graphics in different formats. By default, R prints to the screen. However, we can save graphs in the following file formats as well: PostScript, PDF, PNG, JPEG, Windows metafile, Windows BMP, and so on.

  3. Getting ready

  4. If you have not already done so, download the data files for this chapter and ensure that the auto-mpg.csv file is available in your R environment's working directory and run the following commands:

  5. > auto <- read.csv("auto-mpg.csv")
  6. >
  7. > cylinders <- factor(cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
  8. > attach(auto)
  9. How to do it...

  10. To send the graphic output to the computer screen, you have to do nothing special. For other devices, you first open the device, send your graphical output to it, and then close the device to close the corresponding file.

  11. To create a PostScript file use:

  12. > postscript(file = "auto-scatter.ps")
  13. > boxplot(mpg)
  14. > dev.off()

  15. > pdf(file = "auto-scatter.pdf")
  16. > boxplot(mpg)
  17. > dev.off()
复制代码

18
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-8 06:06:52
提示: 作者被禁止或删除 内容自动屏蔽

19
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-8 06:07:32
提示: 作者被禁止或删除 内容自动屏蔽

20
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-8 06:12:39

Classifying using Support Vector Machine

提示: 作者被禁止或删除 内容自动屏蔽

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2025-12-25 03:07