The Analysis of Superlarge Datasets

David Banks (Statistical Engineering Division, NIST)


Datasets have grown large and multivariate.  Automated process
monitors in the semiconductor industry typically produce records with
hundreds of thousands of observations on dozens of variables.
Similarly, a satellite can transmit hundreds of images each day, the
IRS must process millions of complex tax forms, and supermarket
scanners record nearly all grocery purchases in most large cities.

This colossal scale poses serious obstacles to statistical analysis.
In particular, it creates four new problem areas:

    1)  preanalysis of superlarge datasets
    2)  compression and summarization
    3)  triage to determine which datasets repay the
        cost of analysis
    4)  index creation.

This article reviews the issues in first, third, and fourth of
these areas, and sometimes makes suggestions for solution strategies.


Preanalysis

Banks and Parmigiani (1992) define preanalysis as all the things that
must be done before the data can be submitted to the scrutiny that the
researcher orginally planned.  In actuality, there is no sharp
division between late preanalysis and early conventional analysis, and
much of EDA might fall under the preanalysis umbrella.

In superlarge datasets, the preanalysis must be automated
or semiautomated.  This is because no human eye can scan 
such datasets to make a sanity check.
Although good preanalysis ultimately depends upon the specific 
data one has in hand, nonetheless, a common preanalysis strategy
applies to many situations.

Banks and Parmigiani (1992) suggest a twelve-step program for
the preanalysis of multivariate time series data.
Some of steps are:

  1.  Put all data into common format.
  2.  Create a time stamp for each set of observations.
  3.  Classify missing data (e.g., intentionally missing,
      missing for a known cause, missing for an unknown cause, etc.)
  4.  Check the sample sizes against the values that should be
      present; this can discover missing data that were missed
      in the previous step.
  5.  Look for impossible values or values inconsistent with other
      values.
  6.  Synchronize the data, so that all measurements pertain
      to the same product (e.g., in plate glass manufacture,
      the features of the product made at noon today depend
      upon the tank temperature 24 hours earlier; thus the
      tank temperature needs to be lagged forward 24 hours,
      to correspond to the current glass).
  7.  Create a missing value chart, to show patterns of missing
      data that may be present.
  8.  Use imputation or some other approach to "fill-in" the  
      data that are missing (note:  this will tend to cause one
      to underestimate the uncertainty in the analysis).  I 
      recommend local linear interpolation over more clever
      imputation methods for this task.
  9.  Create an extreme value chart, showing data that are
      peculiar (say three standard deviations away from the
      average value)
 10.  Outlier detection determines which of the extreme values
      will be deleted and replaced by an imputation.  One does
      this for fear that the outliers might make the analysis
      unrobust.  If possible, look also for data that are
      outliers in a multivariate sense (e.g., large Mahalanobis
      distance).
 11.  Descriptive statistics enable one to review summaries of the
      data that will guide new analysis.  For example, one might
      look at the maximum and minimum values of each variable,
      or use Q-Q plots to assess normality.
 12.  Begin elementary EDA.  Make boxplots, scatterplots, and
      so forth to determine what kinds of more sophisticated 
      analyses will be warranted.

Finally, there should always be a thirteenth rule:
Get an area expert to review all that's been done, to ensure
that no damage has been done to the data by these various
tinkerings.
     

Compression

Compression may necessary because the dataset is too large, and 
would swamp the computer analysis one wanted to perform.
Also, compression is useful when there is too much data to
permanently store, so one seeks a summary.  Barnsley (1988)
is a point-of-entry to the image compression literature.


Triage

The last 15 years have seen an enormous number of new statistical
techniques proposed for multivariate nonparametric analysis.
Each of these techniques performs well in some cases, but none
is dominant.  One reason for this is that each of the new techniques
is tuned to notice some special kind of locally-low dimensional
structure.

In order to use these methods, one should first check whether,
locally, one's data have simple structure.  For example, suppose one
drew points on a piece of paper, crumpled it up, and then handed it to
Persi Diaconis, who made the paper disappear, leaving only the points
visible.  If one looked at the points casually, the crumpling would
have made them seem a three-dimensional blob.  But if one looked more
microscopically, one would notice that in small regions, the points
lie almost exactly upon a two-dimensional surface.

An approach to the problem of assessing average local dimensionality
is to take a hypersphere of radius r (for small r) and place it at 
random in the data cloud.  Then one does a principal components 
analysis of the data, and counts how many axes are needed to account
for, say, 80% of the total variation.  This number is p_1.  Then
one finds a new random location for the hypersphere, repeats the
process m times, and ultimately averages the p_1, ..., p_m to
estimate  the  local dimensionality.  See Banks and Olszewski (1997).

If the average local dimensionality is relatively small, even though
the apparent dimensionality may be large, then there is a chance that
one of the new analytical tools will be pertinent.  But when the
average local dimensionality is not small, then it is hard to imagine
that any statistical analysis will have much success in uncovering a
complex model with highly multivariate interactions.


Indexing

When one is faced with too much data, then a useful thing to do
is to find some way of organizing the data to reflect
its variation.  For example, if one were given all of the IRS
returns for 1997 and asked to make some kind of statistical sense
of the data, then it would be enormously useful to begin by getting
a sense of the possible range of variation in the data.
In particular, one might want to look at 20 returns that are
widely spaced (with respect to a user-defined metric) in the
space of all returns---presumably one would see the family with
25 children, the single mother, the two-paycheck household, and
so forth.

To make index creation work well, the metric must be chosen with
an eye to a human's sense of distance.  Ideally, one would have an
area expert go through a preliminary sample, declaring rough
distances based on experience, and then develop a mathematical
program to find the metric that best accords with the expert's
judgments.  

In terms of getting a rapid understanding of complex data, it
is enormously useful to have a list of observations whose
ambit includes virtually all kinds of behaviour found in the
superlarge dataset.  The construction of such an index has not
been, to my knowledge, seriously addressed.


References

Banks, D. L. and Olszewski, R. (1997).  "Estimating Local
Dimensionality," to appear in _Proceedings of the Statistical
Computing Section of the American Statistical Association_.

Banks, D. L. and Parmgiani, G.  (1992).  "Preanalysis of Superlarge
Datasets," _Journal of Quality Technology, 24, 115-129.

Barnsley, M. (1988).  _Fractals Everywhere_.  Academic Press, NY.