签到
- 苹果/安卓/wp
- 苹果/安卓/wp
客户端
0.0

0.00

人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › winbugs及其他软件专版 › [读者文库]Mastering Scientific Computing with R

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

1 ...上一页 3 4 5 6 7 8 9 1011

提升主题| 本版置顶| 关闭主题| 变更主题颜色| 抢沙发| 顶贴| 道具中心

楼主: NewOccidental

8954 109

[读者文库]Mastering Scientific Computing with R [推广有奖]

101楼

在职认证

发表于 2015-5-9 16:12:58 |只看作者 |坛友微信交流群

Nicolle 发表于 2015-5-9 08:37

我已经上传过了

回复

使用道具举报

102楼

Lisrelchen 发表于 2015-5-18 00:51:59 |只看作者 |坛友微信交流群

回复

使用道具举报

103楼

农村固定观察点 发表于 2015-5-18 08:47:48 |只看作者 |坛友微信交流群

回复

使用道具举报

104楼

ReneeBK 发表于 2017-2-27 07:51:43 |只看作者 |坛友微信交流群

Analysis of variance
Analysis of variance (Anova) is used to fit data to a linear model when all explanatory variables are categorical. Each of these categorical explanatory variables is known as a factor, and each factor can have two or more levels. When a single factor is present with three or more levels, we use a one-way Anova to analyze the data. If we had a single factor with two levels, we would use a student's t-test to analyze the data. When there are two or more factors, we would use a two-way or three-way Anova. You can easily perform an Anova in R using the aov() function.
In the first example, we will look at the effect of the dose of drugA on the level of fatigue reported by the 20 patients as follows:
> patient.fatigue <- read.table(header=TRUE, text='
patients fatigue drugA_dose
1 1 low 0.2
2 2 low 0.2
3 3 med 0.2
4 4 med 0.2
5 5 med 0.2
6 6 low 0.4
7 7 low 0.4
8 8 low 0.4
9 9 med 0.4
10 10 med 0.4
11 11 med 0.8
12 12 high 0.8
13 13 med 0.8
14 14 med 0.8
15 15 high 0.8
16 16 high 1.2
17 17 high 1.2
18 18 high 1.2
19 19 high 1.2
20 20 med 1.2 ')
>attach(patient.fatigue)
> aov(drugA_dose ~ fatigue)

复制代码

回复

使用道具举报

105楼

ReneeBK 发表于 2017-2-27 07:53:50 |只看作者 |坛友微信交流群

To use an alternative link function when fitting our data to a GLM model, we can change the link argument as follows:
glm(y, family=binomial(link=probit))
Now, let's go over a detailed example of how to fit your data to a GLM in R. In the first example, we will look at the effect of the dose of a compound on the death of 20 male and female mice. Let's take a look at this in the following lines of code:
> cmp1.ld <- read.table(header=TRUE, text='
lethaldose sex numdead numalive
1 0 M 1 19
2 1 M 3 17
3 2 M 9 11
4 3 M 14 6
5 4 M 17 3
6 5 M 20 0
7 0 F 0 20
8 1 F 2 18
9 2 F 2 18
10 3 F 3 17
11 4 F 4 16
12 5 F 6 14
')
> attach(cmp1.ld)
We can plot the data to take a look at the relationship between the dose of the compound and the proportions of deaths by mouse gender. Let's take a look at this in the following lines of code:
> proportion_dead <- numdead/20
> plot(proportion_dead ~ lethaldose, pch=as.character(sex))
The result is shown in the following plot:
Generalized linear models
Now we will fit our data to a GLM model. First, we combine the number of dead and alive mice into a counts matrix as follows:
> counts <- cbind(numdead, numalive)
> cmp1.ld.model <- glm( counts ~ sex * lethaldose, family=binomial)
We can summarize the results for the GLM model with the summary() function as follows:
> summary(cmp1.ld.model)

复制代码

回复

使用道具举报

106楼

ReneeBK 发表于 2017-2-27 07:55:04 |只看作者 |坛友微信交流群

Generalized additive models
Generalized additive models (GAMs) are a non-parametric extension of GLMs in which the linear predictor depends linearly on unknown smooth functions of some predictor variables. GAMs are typically used to let the data "speak for themselves" since you don't need to specify the functional form of the relationship, the response, and the continuous explanatory variables beforehand. To fit your data to a GAM, you will need to obtain the gam() function from the mgcv package. It is similar to the glm() function except that you add s() to each explanatory variable you wish to add smooths. For example, if you wish to describe the relationship between y and to smooth three continuous explanatory variables x, z, and w, you would enter model <- gam(y ~ s(x) + s(z) + s(w)). Now let's go through a detailed example in R.
In this example, we will explore the relationship between the total number of pregnancies and a variety of measurements taken from 300 mice.
Let's first simulate a dataset for 300 mice with pregnancies, glucose, pressure, insulin, and weight as explanatory variables. Since the sample() and rnorm() functions generate pseudorandom numbers, you will get different results when you run the following code in R. Let's simulate our explanatory variables with the following lines of code:
> pregnancies <- sample(0:25, 300,replace=T)
> glucose <- sample(65:200, 300,replace=T)
> pressure <- sample(50:120, 300,replace=T)
> insulinD <- abs(rnorm(150, 450, 100))
> insulinN <- abs(rnorm(150, 65, 75))
> insulin <- c(insulinD, insulinN)
> weight <- sample(20:70, 300,replace=T)
Now let's use the gam() function part of the mgcv package to explore the relationship between diabetes and pregnancies, glucose, pressure, insulin, and weight as follows:
> library("mgcv")
> mouse.data.gam <- gam(pregnancies ~ s(glucose) + s(pressure) + s(insulin) + s(weight))
> summary(mouse.data.gam)

复制代码

回复

使用道具举报

107楼

ReneeBK 发表于 2017-2-27 07:56:30 |只看作者 |坛友微信交流群

Linear discriminant analysis
Linear discriminant analysis (LDA) is used to find the linear combinations of explanatory variables that give the best possible separation between the groups in our dataset. More specifically, LDA is used as a linear supervised classifier to form a classification model from the data provided. For example, we can use LDA to classify fish based on their length, weight, and speed underwater. Let's simulate a dataset based on the Ontario warm water fish species Bluegill, Bowfin, Carp, Goldeye, and Largemouth Bass.
> set.seed(459)
> Bluegill.length <- sample(seq(15, 22.5, by=0.5), 50, replace=T)
> Bluegill.weight <- sample(seq(0.2, 0.8, by=0.05), 50, replace=T)
> Bowfin.length <- sample(seq(46, 61, by=0.5), 50, replace=T)
> Bowfin.weight <- sample(seq(1.36, 3.2, by=0.5), 50, replace=T)
> Carp.length <- sample(seq(30, 75, by=1), 50, replace=T)
> Carp.weight <- sample(seq(0.2, 3.5, by=0.1), 50, replace=T)
> Goldeye.length <- sample(seq(25, 38, by=0.5), 50, replace=T)
> Goldeye.weight <- sample(seq(0.4, 0.54, by=0.01), 50, replace=T)
> Largemouth_Bass.length <- sample(seq(22, 55, by=0.5), 50, replace=T)
> Largemouth_Bass.weight <- sample(seq(0.68, 1.8, by=0.01), 50, replace=T)
> weight <-c(Bluegill.weight, Bowfin.weight, Carp.weight, Goldeye.weight, Largemouth_Bass.weight)
> length <-c(Bluegill.length, Bowfin.length, Carp.length, Goldeye.length, Largemouth_Bass.length)
> speed <- rnorm(50*5, 7.2, sd=1.8)
> fish <- c(rep("Bluegill", 50), rep("Bowfin", 50), rep("Carp", 50), rep("Goldeye", 50), rep("Largemouth_Bass", 50))
> fish.data <- data.frame(length, weight, speed, fish)
> str(fish.data)

复制代码

回复

使用道具举报

108楼

ReneeBK 发表于 2017-2-27 07:59:41 |只看作者 |坛友微信交流群

Principal component analysis
Principal component analysis (PCA) is another exploratory method you can use to separate your samples into groups. PCA converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA is widely used as a dimension reduction technique to help visualize data. PCA is different from LDA because it doesn't rely on class information to explore the relationship between the variable values and the sample group numbers. For example, let's perform a PCA to explore our simulated fish.data dataset. Before performing PCA, it is important to remember that the magnitude of the variables and any skews in the data will influence the resulting principal components. So, we need to scale and transform our data.
First, we recommend you to log transform the data (if necessary). Then, run PCA using the prcomp() function as follows:
> fish.data.mx <- as.matrix(fish.data[, 1:3])
> fish.data.log <- log(fish.data.mx)
> fish.log.pca <- prcomp(fish.data.log, scale=T, center=T)
> summary(fish.log.pca)
Instead of the summary, you might want to see the standard deviation of each principal component, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables. To get this information, we can use the print() function as follows:
> print(fish.log.pca)
> plot(fish.log.pca, ylim=c(0, 2)) #plot not shown
> library("ggplot2")
> loadings <- as.data.frame(fish.log.pca$rotation)
> # Add a column with the name of each variable to the loadings data frame.
> loadings$variables <- colnames(fish.data[,1:3])
> # Plot figure with qplot()
> q <- qplot(x = variables, y = PC1, data = loadings, geom = "bar", stat="identity")
> # Adjust axis label sizes
> q + theme(axis.title = element_text(face="bold", size=20), axis.text = element_text(size=18))

复制代码

回复

使用道具举报

109楼

ReneeBK 发表于 2017-2-27 08:02:24 |只看作者 |坛友微信交流群

Clustering
An alternative approach to PCA is k-means (unsupervised) clustering, which partitions the data into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. We can perform k-means clustering with the kmeans() function and plot the results with plot3d() as follows:
> set.seed(44)
> cl <- kmeans(fish.data[,1:3],5)
> fish.data$cluster <- as.factor(cl$cluster)
> plot3d(fish.log.pca$x[,1:3], col=fish.data$cluster, main="k-means clusters")
> with(fish.data, table(cluster, fish))
> di <- dist(fish.data[,1:3], method="euclidean")
> tree <- hclust(di, method="ward")
> fish.data$hcluster <- as.factor((cutree(tree, k=5)-2) %% 3 +1)
> plot(tree, xlab="", cex=0.2)
> rect.hclust(tree, k=5, border="red")
> with(fish.data, table(hcluster, fish))

复制代码

回复

使用道具举报

110楼

ReneeBK 发表于 2017-2-27 08:05:20 |只看作者 |坛友微信交流群

Structural Equation Modeling
We went over the matrix representation of SEM models in theory. Now, we will show how this can be practically done in R using the RAM matrix specifications mentioned earlier. We will use the political democracy dataset available in the lavaan package for this example. Notably, this dataset is ordered such that the first eight columns are the y variables and the last three are the x variables. Let's take a look at the following example:
library(lavaan)
data(PoliticalDemocracy)
We will create the covariance matrix of this dataset, as follows:
pd.cov <- cov(PoliticalDemocracy)
Now, we will create each of our matrices A, S, F, and I. In a real SEM example, we would iteratively create new A and S matrices in an attempt to find the best values to match the implied to the observed covariance matrix. Think of this as starting values for the matrices. In this case, we will choose starting values that pretty closely match a final solution, so that we do not have to iterate through dozens of times to find a good solution. Remember that we have 11 observed and three unobserved variables for a total of 14 variables.
First, let's take a look at the A matrix of paths, which is a matrix of 14*14:
mat.A <- matrix(
c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,1.5,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0.5,0.5,0
), nrow = 14, byrow = TRUE
)
Then, let's take a look at the S matrix of residual variances or covariances, which is also a matrix of 14*14:
mat.S <- matrix(
c(2, 0, 0, 0,.5, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 7, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 3, 0, 0, 0,.5, 0, 0, 0, 0, 0, 0,
.5, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 2, 0, 0, 0, 5, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0,.5, 0, 1, 0, 3, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,.1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,.1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.5, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.5, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.2
), nrow = 14, byrow = TRUE
)
Next, let's take a look at the filter matrix, which selects the observed variables is a matrix of 11*14:
mat.F <- matrix(
c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0
), nrow = 11, byrow = TRUE
)
Finally the 14 x 14 identity matrix:
mat.I <- diag(rep(1, 14), nrow = 14)
We can write a function that will perform all of the matrix operations of the McArdle McDonald equation (variables have a 0 in the name), as follows:
RAM.implied.covariance <- function(A.0, S.0, F.0, I.0) {
implied.covariance <- F.0 %*% solve(I.0-A.0) %*% S.0 %*% t(solve(I.0-A.0)) %*% t(F.0)
return(implied.covariance)
}
And, finally, we can estimate an implied covariance matrix based on our starting A and S matrices, which is shown in the following matrix rounded to 2 decimal places:
> round(RAM.implied.covariance(mat.A, mat.S, mat.F, mat.I), 2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 7.12 5.12 5.12 5.12 3.44 2.94 2.94 2.94 0.75 1.50 1.50
[2,] 5.12 12.12 5.12 6.12 2.94 4.94 2.94 2.94 0.75 1.50 1.50
[3,] 5.12 5.12 10.12 5.12 2.94 2.94 3.94 2.94 0.75 1.50 1.50
[4,] 5.12 6.12 5.12 8.12 2.94 2.94 2.94 3.44 0.75 1.50 1.50
[5,] 3.44 2.94 2.94 2.94 3.98 1.98 1.98 1.98 0.63 1.25 1.25
[6,] 2.94 4.94 2.94 2.94 1.98 6.98 1.98 2.98 0.63 1.25 1.25
[7,] 2.94 2.94 3.94 2.94 1.98 1.98 4.98 1.98 0.63 1.25 1.25
[8,] 2.94 2.94 2.94 3.44 1.98 2.98 1.98 4.98 0.63 1.25 1.25
[9,] 0.75 0.75 0.75 0.75 0.63 0.63 0.63 0.63 0.60 1.00 1.00
[10,] 1.50 1.50 1.50 1.50 1.25 1.25 1.25 1.25 1.00 2.10 2.00
[11,] 1.50 1.50 1.50 1.50 1.25 1.25 1.25 1.25 1.00 2.00 2.50
Compare the preceding matrix to the observed covariance matrix as follows:
> round(pd.cov, 2)
y1 y2 y3 y4 y5 y6 y7 y8 x1 x2 x3
y1 6.88 6.25 5.84 6.09 5.06 5.75 5.81 5.67 0.73 1.27 0.91
y2 6.25 15.58 5.84 9.51 5.60 9.39 7.54 7.76 0.62 1.49 1.17
y3 5.84 5.84 10.76 6.69 4.94 4.73 7.01 5.64 0.79 1.55 1.04
y4 6.09 9.51 6.69 11.22 5.70 7.44 7.49 8.01 1.15 2.24 1.84
y5 5.06 5.60 4.94 5.70 6.83 4.98 5.82 5.34 1.08 2.06 1.58
y6 5.75 9.39 4.73 7.44 4.98 11.38 6.75 8.25 0.85 1.81 1.57
y7 5.81 7.54 7.01 7.49 5.82 6.75 10.80 7.59 0.94 2.00 1.63
y8 5.67 7.76 5.64 8.01 5.34 8.25 7.59 10.53 1.10 2.23 1.69
x1 0.73 0.62 0.79 1.15 1.08 0.85 0.94 1.10 0.54 0.99 0.82
x2 1.27 1.49 1.55 2.24 2.06 1.81 2.00 2.23 0.99 2.28 1.81
x3 0.91 1.17 1.04 1.84 1.58 1.57 1.63 1.69 0.82 1.81 1.98
As we can see, it is pretty close. Can we improve it? Yes, but it will take some trial and error with new values in the A (and subsequently the S) matrix. This is what optimizers do in statistical software, and the R packages that we described here do just that. Notably, there are actually an infinite number of possible solutions for many SEM models, so some constraining of values is often needed. In this case, we would want to constrain a few particular path values. In this case, constraining one path going from each of the latent variables to a manifest variable will do the trick.

复制代码

回复

使用道具举报

1 ...上一页 3 4 5 6 7 8 9 1011

发帖

本版微信群

加好友,备注jltj
拉您入交流群

如有投资本站、合作意向或投放广告，请联系：13661292478（刘老师）

联系客服

邮箱：service@pinggu.org 投诉或不良信息处理：（010-68466864）

京ICP备16021002-2号京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明