终于碰到懂得人了,这么好的书竟然没人下载,难道没有人看外文的书籍的?
1 Comparison of Batches
Multivariate statistical analysis is concerned with analyzing and understanding data in high
dimensions. We suppose that we are given a set {xi}n
i=1 of n observations of a variable vector
X in Rp. That is, we suppose that each observation xi has p dimensions:
xi = (xi1, xi2, ..., xip),
and that it is an observed value of a variable vector X ∈ Rp. Therefore, X is composed of p
random variables:
X = (X1,X2, ...,Xp)
where Xj, for j = 1, . . . , p, is a one-dimensional random variable. How do we begin to
analyze this kind of data? Before we investigate questions on what inferences we can reach
from the data, we should think about how to look at the data. This involves descriptive
techniques. Questions that we could answer by descriptive techniques are:
Are there components of X that are more spread out than others?
Are there some elements of X that indicate subgroups of the data?
Are there outliers in the components of X?
How “normal” is the distribution of the data?
Are there “low-dimensional” linear combinations of X that show “non-normal” behavior?
One difficulty of descriptive methods for high dimensional data is the human perceptional
system. Point clouds in two dimensions are easy to understand and to interpret. With
modern interactive computing techniques we have the possibility to see real time 3D rotations
and thus to perceive also three-dimensional data. A “sliding technique” as described in
H¨ardle and Scott (1992) may give insight into four-dimensional structures by presenting
dynamic 3D density contours as the fourth variable is changed over its range.
A qualitative jump in presentation difficulties occurs for dimensions greater than or equal to
5, unless the high-dimensional structure can be mapped into lower-dimensional components
Klinke and Polzehl (1995). Features like clustered subgroups or outliers, however, can be
detected using a purely graphical analysis.
In this chapter, we investigate the basic descriptive and graphical techniques allowing simple
exploratory data analysis. We begin the exploration of a data set using boxplots. A boxplot
is a simple univariate device that detects outliers component by component and that can
compare distributions of the data among different groups. Next several multivariate techniques
are introduced (Flury faces, Andrews’ curves and parallel coordinate plots) which
provide graphical
|