R is a simple, effective, and comprehensive programming language and environment that is gaining ever-increasing popularity among data analysts.
This book provides you with the necessary skills to successfully carry out complete geospatial data analyses, from data import to presentation of results.
Learning R for Geospatial Analysis is composed of step-by-step tutorials, starting with the language basics before proceeding to cover the main GIS operations and data types. Visualization of spatial data is vital either during the various analysis steps and/or as the final product, and this book shows you how to get the most out of R's visualization capabilities. The book culminates with examples of cutting-edge applications utilizing R's strengths as a statistical and graphical tool.
The data.frame class is the basic class to represent tabular data in R. A data.frame object is essentially a collection of vectors, all with the same length. However, the vectors do not have to be of the same type. They may also include one-dimensional objects that are not strictly vectors, such as Date or factor objects (see the previous chapter). Therefore, data.frame objects are particularly suitable to represent data with different variables in columns and different cases in rows. Thus, variables may be of different types; for example, a table storing climatic data may have one character variable to store meteorological station names, another Date variable to represent measurement dates, and a third numeric variable to represent the measured values such as rainfall amounts or temperatures.
One way to create a data.frame object is to combine several vectors that are already present in the R environment. This can be achieved with the data.frame function with the arguments being the names of the vector objects we would like to combine. Let's take a look at the following examples:
> num = 1:4
> lower = c("a","b","c","d")
> upper = c("A","B","C","D")
> df = data.frame(num, lower, upper)
> df
num lower upper
1 1 a A
2 2 b B
3 3 c C
4 4 d D
Here, we created a data.frame object named df by combining the vectors num, lower, and upper. The previously independent vectors now comprise columns in df. As we can see, the names of the columns appear on the first line of the printed output of a data.frame object. These are the names of the original vectors, num, lower, and upper. Rows have names as well; these are automatically assigned with the characters 1, 2, 3, and 4 (as it appears to the left of the first column in the printed output).
Another common method to create a data.frame object is to read tabular data from the disk. For example, we can read a CSV file using the read.csv function (which was briefly mentioned earlier). The first parameter of this function, and the one with no defaults, is a file indicating the path to the CSV file. For example, the following expression reads the contents of the 343452.csv file and assigns it to a data.frame object called dat (remember that directories should be separated with \\ or /):
> dat = read.csv("C:\\Data\\343452.csv")
The 343452.csv file contains monthly records of precipitation, minimum temperature, and maximum temperature from Spain for a period of 30 years. It was downloaded from the NOAA climatic archive and provided as is. Since we will use data from this file in several of our examples, in this and the upcoming chapters, let's examine its contents. Because the table is very large, to see what it looks like, we can print only the first several rows with the head function, as follows (similarly, with the tail function, we can print the several last rows):
We can get the number of rows and columns in our data.frame object using the nrow and ncol functions, respectively. For example, our small table df has four rows and three columns, while dat (containing the monthly climatic data) has 28,536 rows and nine columns:
> nrow(df)
[1] 4
> ncol(df)
[1] 3
> nrow(dat)
[1] 28536
> ncol(dat)
[1] 9
We can, if the table is not too long, print the table's contents and see how many columns (or rows) are there, according to the row names. However, it is generally advisable to get the properties of an object using functions (such as ncol), rather than typing a specific number manually (such as 9). This way, our code is going to be transferable to an analysis of any object and not just the specific object we are currently working on.
We can get the lengths of both row and column dimensions using the dim function. If our argument is a data.frame object (we will see later that the dim function works with other classes as well; such a function is called a generic function in R terminology), a vector of length 2 is returned with the first element being the number of rows and the second being the number of columns, as follows:
> dim(dat)
[1] 28536 9
We can also get the names of the rows and columns (getting column names is often more useful) as a character vector using the functions rownames and colnames, as shown in the following example:
Assignment into column names can be made to replace the existing names with new ones. For example, to change the name of the third column from ELEVATION to Elev, we can use the colnames(dat)[3]="Elev" expression. Similarly, we can convert all column names of the data.frame object from uppercase to lowercase using the tolower function so that it will be easier to type:
It is frequently useful to examine the structure of a given object using the str function. This function (which is also generic) prints the structure of its argument showing the data types of its components and the relations between them. In the case of a data.frame object, a list of the column names and types is printed, along with the table dimensions, and the first several values (or all values, if the table is very short). For example, the output for the small table df shows that we have a table with three columns (the variables) and four rows (the observations). It also shows that the first column is numeric and the last two are characters. Here is how the output looks like:
There are two principal ways to create a subset of a data.frame object. The first involves accessing separate columns, using the column names, with the $ operator. The second involves providing the two vectors of indices, names or logical values, with the [ operator.
Using the $ operator, we can gain access to separate columns in a data.frame object. To do this, we simply insert the name of the data.frame to the left of the $ operator and the name of the required column to the right, as follows:
> df$num
[1] 1 2 3 4
> df$lower
[1] "a" "b" "c" "d"
> df$upper
[1] "A" "B" "C" "D"
Since the columns of a data.frame object are basically vectors, we can employ all the previously presented vector methods in columns of a data.frame object the same way we would in independent vectors. For example, we can replace the -9999 values (which mark the missing data) with NA, for each of the three measured variables in dat, as follows:
> dat$tpcp[dat$tpcp == -9999] = NA
> dat$mmxt[dat$mmxt == -9999] = NA
> dat$mmnt[dat$mmnt == -9999] = NA
The only difference from how we did this operation in the previous chapter is the dat$ part. This means that we refer to columns of the data.frame object (dat), rather than independent vectors. Now, let's convert the tpcp values to mm units and mmxt and mmnt values to degree Celsius units by dividing each value in the respective columns by 10, as follows:
As previously shown, we can assign new values to a column of a table (or to a subset of a column) using the $ operator. If the column name we assign does not exist in the table, a new column will be created to accommodate the data. Let's take a look at the following examples:
A data.frame object can be written to a CSV file with the write.csv function. The two first (and most important) parameters for this function indicate the name of the data.frame object, which we would like to save, and the path to the new file (including the new filename). These parameters have no defaults, so we need to specify them. For example, the following expression writes the data.frame object df to the df.csv file in the C:\Data directory:
A matrix object is a two-dimensional collection of elements, all of the same type (as opposed to a data.frame object; see the previous chapter), where the number of elements in all rows (and, naturally, all columns) is identical. Matrix objects have many uses in R. For example, certain functions take matrices as their arguments (such as the focal function to filter rasters) or return matrices (such as the extract function to extract raster values; we will meet both these functions in the subsequent chapters).
A matrix object can be created with the matrix function by specifying its values (in the form of a vector) and dimensions as follows:
> matrix(1:6, ncol = 3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
The first four parameters of the matrix function are as follows:
data: The vector of values for the matrix (for example, 1:6)
nrow: The number of rows
ncol: The number of columns (for example, 3)
byrow: Whether the matrix is filled column by column (FALSE, which is the default value) or row by row (TRUE)
The nrow and ncol parameters determine the number of rows and columns, respectively. We can specify either one of these parameters, and the other will be calculated taking into account the overall number of elements. Let's take a look at the following example: