楼主: NewOccidental
8954 109

[读者文库]Mastering Scientific Computing with R   [推广有奖]

101
igs816 在职认证  发表于 2015-5-9 16:12:58 |只看作者 |坛友微信交流群
Nicolle 发表于 2015-5-9 08:37
我已经上传过了

使用道具

102
Lisrelchen 发表于 2015-5-18 00:51:59 |只看作者 |坛友微信交流群

使用道具

使用道具

104
ReneeBK 发表于 2017-2-27 07:51:43 |只看作者 |坛友微信交流群
  1. Analysis of variance

  2. Analysis of variance (Anova) is used to fit data to a linear model when all explanatory variables are categorical. Each of these categorical explanatory variables is known as a factor, and each factor can have two or more levels. When a single factor is present with three or more levels, we use a one-way Anova to analyze the data. If we had a single factor with two levels, we would use a student's t-test to analyze the data. When there are two or more factors, we would use a two-way or three-way Anova. You can easily perform an Anova in R using the aov() function.

  3. In the first example, we will look at the effect of the dose of drugA on the level of fatigue reported by the 20 patients as follows:

  4. > patient.fatigue <- read.table(header=TRUE, text='
  5.    patients fatigue drugA_dose
  6. 1         1     low        0.2
  7. 2         2     low        0.2
  8. 3         3     med        0.2
  9. 4         4     med        0.2
  10. 5         5     med        0.2
  11. 6         6     low        0.4
  12. 7         7     low        0.4
  13. 8         8     low        0.4
  14. 9         9     med        0.4
  15. 10       10     med        0.4
  16. 11       11     med        0.8
  17. 12       12    high        0.8
  18. 13       13     med        0.8
  19. 14       14     med        0.8
  20. 15       15    high        0.8
  21. 16       16    high        1.2
  22. 17       17    high        1.2
  23. 18       18    high        1.2
  24. 19       19    high        1.2
  25. 20       20     med        1.2 ')

  26. >attach(patient.fatigue)
  27. > aov(drugA_dose ~ fatigue)
复制代码

使用道具

105
ReneeBK 发表于 2017-2-27 07:53:50 |只看作者 |坛友微信交流群
  1. To use an alternative link function when fitting our data to a GLM model, we can change the link argument as follows:

  2. glm(y, family=binomial(link=probit))

  3. Now, let's go over a detailed example of how to fit your data to a GLM in R. In the first example, we will look at the effect of the dose of a compound on the death of 20 male and female mice. Let's take a look at this in the following lines of code:

  4. > cmp1.ld <- read.table(header=TRUE, text='
  5.    lethaldose sex numdead numalive
  6. 1           0   M       1       19
  7. 2           1   M       3       17
  8. 3           2   M       9       11
  9. 4           3   M      14        6
  10. 5           4   M      17        3
  11. 6           5   M      20        0
  12. 7           0   F       0       20
  13. 8           1   F       2       18
  14. 9           2   F       2       18
  15. 10          3   F       3       17
  16. 11          4   F       4       16
  17. 12          5   F       6       14
  18. ')
  19. > attach(cmp1.ld)
  20. We can plot the data to take a look at the relationship between the dose of the compound and the proportions of deaths by mouse gender. Let's take a look at this in the following lines of code:

  21. > proportion_dead <- numdead/20
  22. > plot(proportion_dead ~ lethaldose, pch=as.character(sex))
  23. The result is shown in the following plot:

  24. Generalized linear models
  25. Now we will fit our data to a GLM model. First, we combine the number of dead and alive mice into a counts matrix as follows:

  26. > counts <- cbind(numdead, numalive)
  27. > cmp1.ld.model <- glm( counts ~ sex * lethaldose, family=binomial)
  28. We can summarize the results for the GLM model with the summary() function as follows:

  29. > summary(cmp1.ld.model)
复制代码

使用道具

106
ReneeBK 发表于 2017-2-27 07:55:04 |只看作者 |坛友微信交流群
  1. Generalized additive models

  2. Generalized additive models (GAMs) are a non-parametric extension of GLMs in which the linear predictor depends linearly on unknown smooth functions of some predictor variables. GAMs are typically used to let the data "speak for themselves" since you don't need to specify the functional form of the relationship, the response, and the continuous explanatory variables beforehand. To fit your data to a GAM, you will need to obtain the gam() function from the mgcv package. It is similar to the glm() function except that you add s() to each explanatory variable you wish to add smooths. For example, if you wish to describe the relationship between y and to smooth three continuous explanatory variables x, z, and w, you would enter model <- gam(y ~ s(x) + s(z) + s(w)). Now let's go through a detailed example in R.

  3. In this example, we will explore the relationship between the total number of pregnancies and a variety of measurements taken from 300 mice.

  4. Let's first simulate a dataset for 300 mice with pregnancies, glucose, pressure, insulin, and weight as explanatory variables. Since the sample() and rnorm() functions generate pseudorandom numbers, you will get different results when you run the following code in R. Let's simulate our explanatory variables with the following lines of code:

  5. > pregnancies <- sample(0:25, 300,replace=T)
  6. > glucose <- sample(65:200, 300,replace=T)
  7. > pressure <-  sample(50:120, 300,replace=T)
  8. > insulinD <- abs(rnorm(150, 450, 100))
  9. > insulinN <- abs(rnorm(150, 65, 75))
  10. > insulin <- c(insulinD, insulinN)
  11. > weight <- sample(20:70, 300,replace=T)
  12. Now let's use the gam() function part of the mgcv package to explore the relationship between diabetes and pregnancies, glucose, pressure, insulin, and weight as follows:

  13. > library("mgcv")
  14. > mouse.data.gam <- gam(pregnancies ~ s(glucose) + s(pressure) + s(insulin) + s(weight))

  15. > summary(mouse.data.gam)
复制代码

使用道具

107
ReneeBK 发表于 2017-2-27 07:56:30 |只看作者 |坛友微信交流群
  1. Linear discriminant analysis

  2. Linear discriminant analysis (LDA) is used to find the linear combinations of explanatory variables that give the best possible separation between the groups in our dataset. More specifically, LDA is used as a linear supervised classifier to form a classification model from the data provided. For example, we can use LDA to classify fish based on their length, weight, and speed underwater. Let's simulate a dataset based on the Ontario warm water fish species Bluegill, Bowfin, Carp, Goldeye, and Largemouth Bass.

  3. > set.seed(459)
  4. > Bluegill.length <- sample(seq(15, 22.5, by=0.5), 50, replace=T)
  5. > Bluegill.weight <- sample(seq(0.2, 0.8, by=0.05), 50, replace=T)
  6. > Bowfin.length <- sample(seq(46, 61, by=0.5), 50, replace=T)
  7. > Bowfin.weight <- sample(seq(1.36, 3.2, by=0.5), 50, replace=T)
  8. > Carp.length <- sample(seq(30, 75, by=1), 50, replace=T)
  9. > Carp.weight <- sample(seq(0.2, 3.5, by=0.1), 50, replace=T)
  10. > Goldeye.length <- sample(seq(25, 38, by=0.5), 50, replace=T)
  11. > Goldeye.weight <- sample(seq(0.4, 0.54, by=0.01), 50, replace=T)
  12. > Largemouth_Bass.length <- sample(seq(22, 55, by=0.5), 50, replace=T)
  13. > Largemouth_Bass.weight <- sample(seq(0.68, 1.8, by=0.01), 50, replace=T)

  14. > weight <-c(Bluegill.weight, Bowfin.weight, Carp.weight, Goldeye.weight, Largemouth_Bass.weight)

  15. > length <-c(Bluegill.length, Bowfin.length, Carp.length, Goldeye.length, Largemouth_Bass.length)

  16. > speed <- rnorm(50*5, 7.2, sd=1.8)

  17. > fish <- c(rep("Bluegill", 50), rep("Bowfin", 50), rep("Carp", 50), rep("Goldeye", 50), rep("Largemouth_Bass", 50))
  18. > fish.data <- data.frame(length, weight, speed, fish)

  19. > str(fish.data)
复制代码

使用道具

108
ReneeBK 发表于 2017-2-27 07:59:41 |只看作者 |坛友微信交流群
  1. Principal component analysis

  2. Principal component analysis (PCA) is another exploratory method you can use to separate your samples into groups. PCA converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA is widely used as a dimension reduction technique to help visualize data. PCA is different from LDA because it doesn't rely on class information to explore the relationship between the variable values and the sample group numbers. For example, let's perform a PCA to explore our simulated fish.data dataset. Before performing PCA, it is important to remember that the magnitude of the variables and any skews in the data will influence the resulting principal components. So, we need to scale and transform our data.

  3. First, we recommend you to log transform the data (if necessary). Then, run PCA using the prcomp() function as follows:

  4. > fish.data.mx <- as.matrix(fish.data[, 1:3])
  5. > fish.data.log <- log(fish.data.mx)
  6. > fish.log.pca <- prcomp(fish.data.log, scale=T, center=T)
  7. > summary(fish.log.pca)

  8. Instead of the summary, you might want to see the standard deviation of each principal component, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables. To get this information, we can use the print() function as follows:
  9. > print(fish.log.pca)

  10. > plot(fish.log.pca, ylim=c(0, 2)) #plot not shown

  11. > library("ggplot2")
  12. > loadings <- as.data.frame(fish.log.pca$rotation)

  13. > # Add a column with the name of each variable to the loadings data frame.
  14. > loadings$variables <- colnames(fish.data[,1:3])

  15. > # Plot figure with qplot()
  16. > q <- qplot(x = variables, y = PC1, data = loadings, geom = "bar", stat="identity")

  17. > # Adjust axis label sizes
  18. > q + theme(axis.title = element_text(face="bold", size=20), axis.text  = element_text(size=18))
复制代码

使用道具

109
ReneeBK 发表于 2017-2-27 08:02:24 |只看作者 |坛友微信交流群
  1. Clustering

  2. An alternative approach to PCA is k-means (unsupervised) clustering, which partitions the data into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. We can perform k-means clustering with the kmeans() function and plot the results with plot3d() as follows:

  3. > set.seed(44)
  4. > cl <- kmeans(fish.data[,1:3],5)
  5. > fish.data$cluster <- as.factor(cl$cluster)
  6. > plot3d(fish.log.pca$x[,1:3], col=fish.data$cluster, main="k-means clusters")

  7. > with(fish.data, table(cluster, fish))
  8.       
  9. > di <- dist(fish.data[,1:3], method="euclidean")
  10. > tree <- hclust(di, method="ward")

  11. > fish.data$hcluster <- as.factor((cutree(tree, k=5)-2) %% 3 +1)
  12. > plot(tree, xlab="", cex=0.2)
  13. > rect.hclust(tree, k=5, border="red")
  14. > with(fish.data, table(hcluster, fish))
复制代码

使用道具

110
ReneeBK 发表于 2017-2-27 08:05:20 |只看作者 |坛友微信交流群
  1. Structural Equation Modeling

  2. We went over the matrix representation of SEM models in theory. Now, we will show how this can be practically done in R using the RAM matrix specifications mentioned earlier. We will use the political democracy dataset available in the lavaan package for this example. Notably, this dataset is ordered such that the first eight columns are the y variables and the last three are the x variables. Let's take a look at the following example:

  3. library(lavaan)
  4. data(PoliticalDemocracy)
  5. We will create the covariance matrix of this dataset, as follows:

  6. pd.cov <- cov(PoliticalDemocracy)
  7. Now, we will create each of our matrices A, S, F, and I. In a real SEM example, we would iteratively create new A and S matrices in an attempt to find the best values to match the implied to the observed covariance matrix. Think of this as starting values for the matrices. In this case, we will choose starting values that pretty closely match a final solution, so that we do not have to iterate through dozens of times to find a good solution. Remember that we have 11 observed and three unobserved variables for a total of 14 variables.

  8. First, let's take a look at the A matrix of paths, which is a matrix of 14*14:

  9. mat.A <- matrix(
  10.   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,  0,
  11.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,  0,
  12.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,  0,
  13.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,  0,
  14.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  1,
  15.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  1,
  16.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  1,
  17.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  1,
  18.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,  0,
  19.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,  0,
  20.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,  0,
  21.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  0,
  22.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,1.5,0,  0,
  23.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0.5,0.5,0
  24.   ), nrow = 14, byrow = TRUE
  25. )
  26. Then, let's take a look at the S matrix of residual variances or covariances, which is also a matrix of 14*14:

  27. mat.S <- matrix(
  28.   c(2, 0, 0, 0,.5, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  29.     0, 7, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
  30.     0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
  31.     0, 1, 0, 3, 0, 0, 0,.5, 0, 0, 0, 0, 0, 0,
  32.    .5, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  33.     0, 2, 0, 0, 0, 5, 0, 1, 0, 0, 0, 0, 0, 0,
  34.     0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0,
  35.     0, 0, 0,.5, 0, 1, 0, 3, 0, 0, 0, 0, 0, 0,
  36.     0, 0, 0, 0, 0, 0, 0, 0,.1, 0, 0, 0, 0, 0,
  37.     0, 0, 0, 0, 0, 0, 0, 0, 0,.1, 0, 0, 0, 0,
  38.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.5, 0, 0, 0,
  39.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.5, 0, 0,
  40.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0,
  41.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.2
  42.   ), nrow = 14, byrow = TRUE
  43. )
  44. Next, let's take a look at the filter matrix, which selects the observed variables is a matrix of 11*14:

  45. mat.F <- matrix(
  46.   c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  47.     0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  48.     0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  49.     0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  50.     0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  51.     0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
  52.     0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
  53.     0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
  54.     0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  55.     0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
  56.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0
  57.   ), nrow = 11, byrow = TRUE
  58. )
  59. Finally the 14 x 14 identity matrix:
  60. mat.I <- diag(rep(1, 14), nrow = 14)
  61. We can write a function that will perform all of the matrix operations of the McArdle McDonald equation (variables have a 0 in the name), as follows:

  62. RAM.implied.covariance <- function(A.0, S.0, F.0, I.0) {
  63.   implied.covariance <- F.0 %*% solve(I.0-A.0) %*% S.0 %*% t(solve(I.0-A.0)) %*% t(F.0)
  64.   return(implied.covariance)
  65. }
  66. And, finally, we can estimate an implied covariance matrix based on our starting A and S matrices, which is shown in the following matrix rounded to 2 decimal places:

  67. > round(RAM.implied.covariance(mat.A, mat.S, mat.F, mat.I), 2)
  68.       [,1]  [,2]  [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
  69. [1,] 7.12  5.12  5.12 5.12 3.44 2.94 2.94 2.94 0.75  1.50  1.50
  70. [2,] 5.12 12.12  5.12 6.12 2.94 4.94 2.94 2.94 0.75  1.50  1.50
  71. [3,] 5.12  5.12 10.12 5.12 2.94 2.94 3.94 2.94 0.75  1.50  1.50
  72. [4,] 5.12  6.12  5.12 8.12 2.94 2.94 2.94 3.44 0.75  1.50  1.50
  73. [5,] 3.44  2.94  2.94 2.94 3.98 1.98 1.98 1.98 0.63  1.25  1.25
  74. [6,] 2.94  4.94  2.94 2.94 1.98 6.98 1.98 2.98 0.63  1.25  1.25
  75. [7,] 2.94  2.94  3.94 2.94 1.98 1.98 4.98 1.98 0.63  1.25  1.25
  76. [8,] 2.94  2.94  2.94 3.44 1.98 2.98 1.98 4.98 0.63  1.25  1.25
  77. [9,] 0.75  0.75  0.75 0.75 0.63 0.63 0.63 0.63 0.60  1.00  1.00
  78. [10,] 1.50  1.50  1.50 1.50 1.25 1.25 1.25 1.25 1.00  2.10  2.00
  79. [11,] 1.50  1.50  1.50 1.50 1.25 1.25 1.25 1.25 1.00  2.00  2.50
  80. Compare the preceding matrix to the observed covariance matrix as follows:

  81. > round(pd.cov, 2)
  82.      y1    y2    y3    y4   y5    y6    y7    y8   x1   x2   x3
  83. y1 6.88  6.25  5.84  6.09 5.06  5.75  5.81  5.67 0.73 1.27 0.91
  84. y2 6.25 15.58  5.84  9.51 5.60  9.39  7.54  7.76 0.62 1.49 1.17
  85. y3 5.84  5.84 10.76  6.69 4.94  4.73  7.01  5.64 0.79 1.55 1.04
  86. y4 6.09  9.51  6.69 11.22 5.70  7.44  7.49  8.01 1.15 2.24 1.84
  87. y5 5.06  5.60  4.94  5.70 6.83  4.98  5.82  5.34 1.08 2.06 1.58
  88. y6 5.75  9.39  4.73  7.44 4.98 11.38  6.75  8.25 0.85 1.81 1.57
  89. y7 5.81  7.54  7.01  7.49 5.82  6.75 10.80  7.59 0.94 2.00 1.63
  90. y8 5.67  7.76  5.64  8.01 5.34  8.25  7.59 10.53 1.10 2.23 1.69
  91. x1 0.73  0.62  0.79  1.15 1.08  0.85  0.94  1.10 0.54 0.99 0.82
  92. x2 1.27  1.49  1.55  2.24 2.06  1.81  2.00  2.23 0.99 2.28 1.81
  93. x3 0.91  1.17  1.04  1.84 1.58  1.57  1.63  1.69 0.82 1.81 1.98
  94. As we can see, it is pretty close. Can we improve it? Yes, but it will take some trial and error with new values in the A (and subsequently the S) matrix. This is what optimizers do in statistical software, and the R packages that we described here do just that. Notably, there are actually an infinite number of possible solutions for many SEM models, so some constraining of values is often needed. In this case, we would want to constrain a few particular path values. In this case, constraining one path going from each of the latent variables to a manifest variable will do the trick.
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-28 13:07