Analysis of variance (Anova) is used to fit data to a linear model when all explanatory variables are categorical. Each of these categorical explanatory variables is known as a factor, and each factor can have two or more levels. When a single factor is present with three or more levels, we use a one-way Anova to analyze the data. If we had a single factor with two levels, we would use a student's t-test to analyze the data. When there are two or more factors, we would use a two-way or three-way Anova. You can easily perform an Anova in R using the aov() function.
In the first example, we will look at the effect of the dose of drugA on the level of fatigue reported by the 20 patients as follows:
To use an alternative link function when fitting our data to a GLM model, we can change the link argument as follows:
glm(y, family=binomial(link=probit))
Now, let's go over a detailed example of how to fit your data to a GLM in R. In the first example, we will look at the effect of the dose of a compound on the death of 20 male and female mice. Let's take a look at this in the following lines of code:
> cmp1.ld <- read.table(header=TRUE, text='
lethaldose sex numdead numalive
1 0 M 1 19
2 1 M 3 17
3 2 M 9 11
4 3 M 14 6
5 4 M 17 3
6 5 M 20 0
7 0 F 0 20
8 1 F 2 18
9 2 F 2 18
10 3 F 3 17
11 4 F 4 16
12 5 F 6 14
')
> attach(cmp1.ld)
We can plot the data to take a look at the relationship between the dose of the compound and the proportions of deaths by mouse gender. Let's take a look at this in the following lines of code:
Generalized additive models (GAMs) are a non-parametric extension of GLMs in which the linear predictor depends linearly on unknown smooth functions of some predictor variables. GAMs are typically used to let the data "speak for themselves" since you don't need to specify the functional form of the relationship, the response, and the continuous explanatory variables beforehand. To fit your data to a GAM, you will need to obtain the gam() function from the mgcv package. It is similar to the glm() function except that you add s() to each explanatory variable you wish to add smooths. For example, if you wish to describe the relationship between y and to smooth three continuous explanatory variables x, z, and w, you would enter model <- gam(y ~ s(x) + s(z) + s(w)). Now let's go through a detailed example in R.
In this example, we will explore the relationship between the total number of pregnancies and a variety of measurements taken from 300 mice.
Let's first simulate a dataset for 300 mice with pregnancies, glucose, pressure, insulin, and weight as explanatory variables. Since the sample() and rnorm() functions generate pseudorandom numbers, you will get different results when you run the following code in R. Let's simulate our explanatory variables with the following lines of code:
> pregnancies <- sample(0:25, 300,replace=T)
> glucose <- sample(65:200, 300,replace=T)
> pressure <- sample(50:120, 300,replace=T)
> insulinD <- abs(rnorm(150, 450, 100))
> insulinN <- abs(rnorm(150, 65, 75))
> insulin <- c(insulinD, insulinN)
> weight <- sample(20:70, 300,replace=T)
Now let's use the gam() function part of the mgcv package to explore the relationship between diabetes and pregnancies, glucose, pressure, insulin, and weight as follows:
Linear discriminant analysis (LDA) is used to find the linear combinations of explanatory variables that give the best possible separation between the groups in our dataset. More specifically, LDA is used as a linear supervised classifier to form a classification model from the data provided. For example, we can use LDA to classify fish based on their length, weight, and speed underwater. Let's simulate a dataset based on the Ontario warm water fish species Bluegill, Bowfin, Carp, Goldeye, and Largemouth Bass.
Principal component analysis (PCA) is another exploratory method you can use to separate your samples into groups. PCA converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA is widely used as a dimension reduction technique to help visualize data. PCA is different from LDA because it doesn't rely on class information to explore the relationship between the variable values and the sample group numbers. For example, let's perform a PCA to explore our simulated fish.data dataset. Before performing PCA, it is important to remember that the magnitude of the variables and any skews in the data will influence the resulting principal components. So, we need to scale and transform our data.
First, we recommend you to log transform the data (if necessary). Then, run PCA using the prcomp() function as follows:
Instead of the summary, you might want to see the standard deviation of each principal component, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables. To get this information, we can use the print() function as follows:
> print(fish.log.pca)
> plot(fish.log.pca, ylim=c(0, 2)) #plot not shown
An alternative approach to PCA is k-means (unsupervised) clustering, which partitions the data into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. We can perform k-means clustering with the kmeans() function and plot the results with plot3d() as follows:
We went over the matrix representation of SEM models in theory. Now, we will show how this can be practically done in R using the RAM matrix specifications mentioned earlier. We will use the political democracy dataset available in the lavaan package for this example. Notably, this dataset is ordered such that the first eight columns are the y variables and the last three are the x variables. Let's take a look at the following example:
library(lavaan)
data(PoliticalDemocracy)
We will create the covariance matrix of this dataset, as follows:
pd.cov <- cov(PoliticalDemocracy)
Now, we will create each of our matrices A, S, F, and I. In a real SEM example, we would iteratively create new A and S matrices in an attempt to find the best values to match the implied to the observed covariance matrix. Think of this as starting values for the matrices. In this case, we will choose starting values that pretty closely match a final solution, so that we do not have to iterate through dozens of times to find a good solution. Remember that we have 11 observed and three unobserved variables for a total of 14 variables.
First, let's take a look at the A matrix of paths, which is a matrix of 14*14:
mat.A <- matrix(
c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,1.5,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0.5,0.5,0
), nrow = 14, byrow = TRUE
)
Then, let's take a look at the S matrix of residual variances or covariances, which is also a matrix of 14*14:
mat.S <- matrix(
c(2, 0, 0, 0,.5, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 7, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 3, 0, 0, 0,.5, 0, 0, 0, 0, 0, 0,
.5, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 2, 0, 0, 0, 5, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0,.5, 0, 1, 0, 3, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,.1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,.1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.5, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.5, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,.2
), nrow = 14, byrow = TRUE
)
Next, let's take a look at the filter matrix, which selects the observed variables is a matrix of 11*14:
mat.F <- matrix(
c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0
), nrow = 11, byrow = TRUE
)
Finally the 14 x 14 identity matrix:
mat.I <- diag(rep(1, 14), nrow = 14)
We can write a function that will perform all of the matrix operations of the McArdle McDonald equation (variables have a 0 in the name), as follows:
And, finally, we can estimate an implied covariance matrix based on our starting A and S matrices, which is shown in the following matrix rounded to 2 decimal places:
As we can see, it is pretty close. Can we improve it? Yes, but it will take some trial and error with new values in the A (and subsequently the S) matrix. This is what optimizers do in statistical software, and the R packages that we described here do just that. Notably, there are actually an infinite number of possible solutions for many SEM models, so some constraining of values is often needed. In this case, we would want to constrain a few particular path values. In this case, constraining one path going from each of the latent variables to a manifest variable will do the trick.