Using R for Introductory Statistics
Second Edition
By John Verzani
CUNY/College of Staten Island New York, USA
About this book
This is a second edition of a book that introduces R alongside the introduc- tory statistics curriculum. The first edition found its niche with individuals looking to get started with both areas outside of a classroom environment. It is the hope, that this second edition will be even more useful for that task.
The book was first published in 2004, when R was at version 2.0.0. Now, as of writing, R is past version 3.0.0 (3.1.0 and climbing). In that time so much has changed. For example:
• The number of R users has grown enormously. A recent survey ranked R the 15th most used programming language.
• The number of add-on packages for R has grown four- or five-fold to over 5,000. The depth and range of applications has grown consider- ably.
• The number of books including material on R has grown at least ten- fold.1
• The internet has developed many additional R communities beyond the initial mailing list. Two key additions are the question and answer site stackoverflow.com which has nearly
50,000 questions tagged with “r” and the blog aggregator r-bloggers.com which has over 13,000 blog entries related to R.
Basically, the amount of material out there related to learning and using R is now enormous. This book doesn’t try to canvas even a sliver, rather it tries to guide the reader through the learning of the basics of R so that it is possible to take advantage of the contributions made by the R community. Though R—like other programming languages—has a reputation of having
Preface
a steep learning curve, we try to break this down into small, task-oriented steps.
In this edition we place a greater emphasis on more idiomatic R. For a small example, despite the greater familiarity of using = for the assignment operator, we now use the <- operator. Another example comes in Chapter 4, where we resist the temptation to illustrate some data manipulations with the widely used plyr package and instead utilize similar functions from base R. For our limited demands, the corner cases that led to the desire for a plyr- type approach are not present, and we have the belief that it is good to start with a grounding in the functionality provided by base R.
We also try to avoid as many of the pitfalls as possible for new R users by encouraging the use of RStudio, a feature-rich, cross-platform development environment for interacting with R. RStudio has very good integration with R’s help system and its administrative tools; it has an integrated debugger, a powerful editor, and much more. Though relatively new to the R community, the company has already made an enormous contribution.
This book was written using the excellent knitr package for R. This package allows one to embed R code into a document with ease. The formatting of code blocks follows a convention championed by the knitr author. We think it makes the code much easier to read, and hence, reason about. It also encourages thinking of interacting with R using a script, rather than the command line directly. This style of usage is facilitated by RStudio.
In addition to changes with R, the teaching of introductory statistics (by which we mean a non-calculus approach to inferential statistics) has changed in the last decade, or so. For example, primarily due to the widespread availability of computational resources but also for pedagogical reasons, there have been pushes to include resampling approaches, permutation methods, and Bayesian analysis into the first-year course. The topics of this text hew closely to the traditional ones, be we have added a bit on these computer intensive approaches, in particular, to motivate the more traditional approach. We continue with an emphasis on realistic data and examples (which required updating some now not-so-topical examples) and we rely on visualization techniques to gather insight. Fortunately, the R language makes such inclusion quite easy.
Organization The text has three main parts. The first five chapters introduce the basics of exploratory data analysis and data manipulation in R. The approach is a little slower than it need be. We postpone until Chapter 4 the details of using R’s data frames. These are the primary means to store multivariate data in R, and in Chapters 4 and 5 we demonstrate many tools that can act with data frames to make data investigation very convenient. How- ever, most of these techniques are a bit more abstract, so in the first chapters we emphasize a more direct, easier-to-learn approach, albeit sometimes more tedious. Most all of this material was rewritten for the second edition.
Chapters 6 through 10 cover the core of statistical inference. We added the material in Chapter 7 to introduce the major themes of inference using computation, rather than probability calculations, to give insight into questions on inference.
Chapters 11 through 13 introduce the topic of analyzing statistical models with R, covering the regression model and its specialization to the analysis of variance, before ending with a brief introduction to the logistic model and non-linear models. The goal is to cover the main introduction to this topic and to show that the basic interface R provides extends naturally to cover a wide variety of models.
The appendix on programming discusses some of the details of writing programs in the R language. In the main part of the text, user-written functions are fairly straightforward, so this material is just supplemental.
The UsingR package The book has an accompanying package, UsingR. This package is available from CRAN, R’s repository of user-contributed packages. Installation should be painless. The package contains the data sets mentioned in the text (data(package="UsingR")), answers to selected problems (answers()), a few demonstrations (demo()), the errata (errata()), and sample code from the text.
Thanks The author would like to thank Chapman & Hall/CRC Press. Not just the editors who have pushed for this new edition, but the company as a whole for its work on numerous titles on R-related topics. In a similar manner, the author would like to thank statistics.com. They offer a variety of R-related courses, including one that features this text. The feedback from the students of that course has been important guidance in the redrafting of parts of this text. Finally and most importantly, the author would like to warmly acknowledge the continued support he has received from his family on this and other projects.
John Verzani
February, 2014