Datasets have grown large and multivariate. Automated process monitors in the semiconductor industry typically produce records with hundreds of thousands of observations on dozens of variables. Similarly, a satellite can transmit hundreds of images each day, the IRS must process millions of complex tax forms, and supermarket scanners record nearly all grocery purchases in most large cities. This colossal scale poses serious obstacles to statistical analysis. In particular, it creates four new problem areas: 1) preanalysis of superlarge datasets 2) compression and summarization 3) triage to determine which datasets repay the cost of analysis 4) index creation. This article reviews the issues in first, third, and fourth of these areas, and sometimes makes suggestions for solution strategies. Preanalysis Banks and Parmigiani (1992) define preanalysis as all the things that must be done before the data can be submitted to the scrutiny that the researcher orginally planned. In actuality, there is no sharp division between late preanalysis and early conventional analysis, and much of EDA might fall under the preanalysis umbrella. In superlarge datasets, the preanalysis must be automated or semiautomated. This is because no human eye can scan such datasets to make a sanity check. Although good preanalysis ultimately depends upon the specific data one has in hand, nonetheless, a common preanalysis strategy applies to many situations. Banks and Parmigiani (1992) suggest a twelve-step program for the preanalysis of multivariate time series data. Some of steps are: 1. Put all data into common format. 2. Create a time stamp for each set of observations. 3. Classify missing data (e.g., intentionally missing, missing for a known cause, missing for an unknown cause, etc.) 4. Check the sample sizes against the values that should be present; this can discover missing data that were missed in the previous step. 5. Look for impossible values or values inconsistent with other values. 6. Synchronize the data, so that all measurements pertain to the same product (e.g., in plate glass manufacture, the features of the product made at noon today depend upon the tank temperature 24 hours earlier; thus the tank temperature needs to be lagged forward 24 hours, to correspond to the current glass). 7. Create a missing value chart, to show patterns of missing data that may be present. 8. Use imputation or some other approach to "fill-in" the data that are missing (note: this will tend to cause one to underestimate the uncertainty in the analysis). I recommend local linear interpolation over more clever imputation methods for this task. 9. Create an extreme value chart, showing data that are peculiar (say three standard deviations away from the average value) 10. Outlier detection determines which of the extreme values will be deleted and replaced by an imputation. One does this for fear that the outliers might make the analysis unrobust. If possible, look also for data that are outliers in a multivariate sense (e.g., large Mahalanobis distance). 11. Descriptive statistics enable one to review summaries of the data that will guide new analysis. For example, one might look at the maximum and minimum values of each variable, or use Q-Q plots to assess normality. 12. Begin elementary EDA. Make boxplots, scatterplots, and so forth to determine what kinds of more sophisticated analyses will be warranted. Finally, there should always be a thirteenth rule: Get an area expert to review all that's been done, to ensure that no damage has been done to the data by these various tinkerings. Compression Compression may necessary because the dataset is too large, and would swamp the computer analysis one wanted to perform. Also, compression is useful when there is too much data to permanently store, so one seeks a summary. Barnsley (1988) is a point-of-entry to the image compression literature. Triage The last 15 years have seen an enormous number of new statistical techniques proposed for multivariate nonparametric analysis. Each of these techniques performs well in some cases, but none is dominant. One reason for this is that each of the new techniques is tuned to notice some special kind of locally-low dimensional structure. In order to use these methods, one should first check whether, locally, one's data have simple structure. For example, suppose one drew points on a piece of paper, crumpled it up, and then handed it to Persi Diaconis, who made the paper disappear, leaving only the points visible. If one looked at the points casually, the crumpling would have made them seem a three-dimensional blob. But if one looked more microscopically, one would notice that in small regions, the points lie almost exactly upon a two-dimensional surface. An approach to the problem of assessing average local dimensionality is to take a hypersphere of radius r (for small r) and place it at random in the data cloud. Then one does a principal components analysis of the data, and counts how many axes are needed to account for, say, 80% of the total variation. This number is p_1. Then one finds a new random location for the hypersphere, repeats the process m times, and ultimately averages the p_1, ..., p_m to estimate the local dimensionality. See Banks and Olszewski (1997). If the average local dimensionality is relatively small, even though the apparent dimensionality may be large, then there is a chance that one of the new analytical tools will be pertinent. But when the average local dimensionality is not small, then it is hard to imagine that any statistical analysis will have much success in uncovering a complex model with highly multivariate interactions. Indexing When one is faced with too much data, then a useful thing to do is to find some way of organizing the data to reflect its variation. For example, if one were given all of the IRS returns for 1997 and asked to make some kind of statistical sense of the data, then it would be enormously useful to begin by getting a sense of the possible range of variation in the data. In particular, one might want to look at 20 returns that are widely spaced (with respect to a user-defined metric) in the space of all returns---presumably one would see the family with 25 children, the single mother, the two-paycheck household, and so forth. To make index creation work well, the metric must be chosen with an eye to a human's sense of distance. Ideally, one would have an area expert go through a preliminary sample, declaring rough distances based on experience, and then develop a mathematical program to find the metric that best accords with the expert's judgments. In terms of getting a rapid understanding of complex data, it is enormously useful to have a list of observations whose ambit includes virtually all kinds of behaviour found in the superlarge dataset. The construction of such an index has not been, to my knowledge, seriously addressed. References Banks, D. L. and Olszewski, R. (1997). "Estimating Local Dimensionality," to appear in _Proceedings of the Statistical Computing Section of the American Statistical Association_. Banks, D. L. and Parmgiani, G. (1992). "Preanalysis of Superlarge Datasets," _Journal of Quality Technology, 24, 115-129. Barnsley, M. (1988). _Fractals Everywhere_. Academic Press, NY.