The Analysis of Superlarge Datasets
David Banks (Statistical Engineering Division, NIST)
Datasets have grown large and multivariate. Automated process
monitors in the semiconductor industry typically produce records with
hundreds of thousands of observations on dozens of variables.
Similarly, a satellite can transmit hundreds of images each day, the
IRS must process millions of complex tax forms, and supermarket
scanners record nearly all grocery purchases in most large cities.
This colossal scale poses serious obstacles to statistical analysis.
In particular, it creates four new problem areas:
1) preanalysis of superlarge datasets
2) compression and summarization
3) triage to determine which datasets repay the
cost of analysis
4) index creation.
This article reviews the issues in first, third, and fourth of
these areas, and sometimes makes suggestions for solution strategies.
Preanalysis
Banks and Parmigiani (1992) define preanalysis as all the things that
must be done before the data can be submitted to the scrutiny that the
researcher orginally planned. In actuality, there is no sharp
division between late preanalysis and early conventional analysis, and
much of EDA might fall under the preanalysis umbrella.
In superlarge datasets, the preanalysis must be automated
or semiautomated. This is because no human eye can scan
such datasets to make a sanity check.
Although good preanalysis ultimately depends upon the specific
data one has in hand, nonetheless, a common preanalysis strategy
applies to many situations.
Banks and Parmigiani (1992) suggest a twelve-step program for
the preanalysis of multivariate time series data.
Some of steps are:
1. Put all data into common format.
2. Create a time stamp for each set of observations.
3. Classify missing data (e.g., intentionally missing,
missing for a known cause, missing for an unknown cause, etc.)
4. Check the sample sizes against the values that should be
present; this can discover missing data that were missed
in the previous step.
5. Look for impossible values or values inconsistent with other
values.
6. Synchronize the data, so that all measurements pertain
to the same product (e.g., in plate glass manufacture,
the features of the product made at noon today depend
upon the tank temperature 24 hours earlier; thus the
tank temperature needs to be lagged forward 24 hours,
to correspond to the current glass).
7. Create a missing value chart, to show patterns of missing
data that may be present.
8. Use imputation or some other approach to "fill-in" the
data that are missing (note: this will tend to cause one
to underestimate the uncertainty in the analysis). I
recommend local linear interpolation over more clever
imputation methods for this task.
9. Create an extreme value chart, showing data that are
peculiar (say three standard deviations away from the
average value)
10. Outlier detection determines which of the extreme values
will be deleted and replaced by an imputation. One does
this for fear that the outliers might make the analysis
unrobust. If possible, look also for data that are
outliers in a multivariate sense (e.g., large Mahalanobis
distance).
11. Descriptive statistics enable one to review summaries of the
data that will guide new analysis. For example, one might
look at the maximum and minimum values of each variable,
or use Q-Q plots to assess normality.
12. Begin elementary EDA. Make boxplots, scatterplots, and
so forth to determine what kinds of more sophisticated
analyses will be warranted.
Finally, there should always be a thirteenth rule:
Get an area expert to review all that's been done, to ensure
that no damage has been done to the data by these various
tinkerings.
Compression
Compression may necessary because the dataset is too large, and
would swamp the computer analysis one wanted to perform.
Also, compression is useful when there is too much data to
permanently store, so one seeks a summary. Barnsley (1988)
is a point-of-entry to the image compression literature.
Triage
The last 15 years have seen an enormous number of new statistical
techniques proposed for multivariate nonparametric analysis.
Each of these techniques performs well in some cases, but none
is dominant. One reason for this is that each of the new techniques
is tuned to notice some special kind of locally-low dimensional
structure.
In order to use these methods, one should first check whether,
locally, one's data have simple structure. For example, suppose one
drew points on a piece of paper, crumpled it up, and then handed it to
Persi Diaconis, who made the paper disappear, leaving only the points
visible. If one looked at the points casually, the crumpling would
have made them seem a three-dimensional blob. But if one looked more
microscopically, one would notice that in small regions, the points
lie almost exactly upon a two-dimensional surface.
An approach to the problem of assessing average local dimensionality
is to take a hypersphere of radius r (for small r) and place it at
random in the data cloud. Then one does a principal components
analysis of the data, and counts how many axes are needed to account
for, say, 80% of the total variation. This number is p_1. Then
one finds a new random location for the hypersphere, repeats the
process m times, and ultimately averages the p_1, ..., p_m to
estimate the local dimensionality. See Banks and Olszewski (1997).
If the average local dimensionality is relatively small, even though
the apparent dimensionality may be large, then there is a chance that
one of the new analytical tools will be pertinent. But when the
average local dimensionality is not small, then it is hard to imagine
that any statistical analysis will have much success in uncovering a
complex model with highly multivariate interactions.
Indexing
When one is faced with too much data, then a useful thing to do
is to find some way of organizing the data to reflect
its variation. For example, if one were given all of the IRS
returns for 1997 and asked to make some kind of statistical sense
of the data, then it would be enormously useful to begin by getting
a sense of the possible range of variation in the data.
In particular, one might want to look at 20 returns that are
widely spaced (with respect to a user-defined metric) in the
space of all returns---presumably one would see the family with
25 children, the single mother, the two-paycheck household, and
so forth.
To make index creation work well, the metric must be chosen with
an eye to a human's sense of distance. Ideally, one would have an
area expert go through a preliminary sample, declaring rough
distances based on experience, and then develop a mathematical
program to find the metric that best accords with the expert's
judgments.
In terms of getting a rapid understanding of complex data, it
is enormously useful to have a list of observations whose
ambit includes virtually all kinds of behaviour found in the
superlarge dataset. The construction of such an index has not
been, to my knowledge, seriously addressed.
References
Banks, D. L. and Olszewski, R. (1997). "Estimating Local
Dimensionality," to appear in _Proceedings of the Statistical
Computing Section of the American Statistical Association_.
Banks, D. L. and Parmgiani, G. (1992). "Preanalysis of Superlarge
Datasets," _Journal of Quality Technology, 24, 115-129.
Barnsley, M. (1988). _Fractals Everywhere_. Academic Press, NY.