DIMACS Special Focus on Data Analysis and Mining Seminar Series



Speaker: Amy Braverman
Affiliation: Jet Propulsion Laboratory, Caltech
Topic: Reducing Size and Complexity of Large
   Geophysical Data Sets


This talk discusses a procedure for compressing large data sets,
particularly geophysical ones like those obtained from remote sensing
satellite instruments. Data are partitioned by space and time, and a
penalized clustering algorithm applied to each subset independently.
The algorithm is based on the entropy-constrained vector quantizer
(ECVQ) of Chou, Lookabaugh and Gray (1989). In each subset ECVQ
trades off error against data reduction to produce a set of
representative points that stand in for the original observations.
Since data are voluminous, a preliminary set of representatives is
determined from a sample, then the full subset is clustered by
assigning each observation to the nearest representative point. After
replacing the initial representatives by the centroids of these final
clusters, the new representatives and their associated counts
constitute a compressed version, or summary, of the raw data. Since
the initial representatives are derived from a sample, the final
summary is subject to sampling variation. A statistical model for the
relationship between compressed and raw data provides a framework for
assessing this variability, and other aspects of summary quality. The
procedure is being used to produce low-volume summaries of
high-resolution data products obtained from the Multi-angle Imaging
SpectroRadiometer (MISR), one instrument aboard the NASA's Terra
satellite. MISR produces approximately 2 TB per month of radiance and
geophysical data. Practical considerations for this application are
discussed, and a sample analysis using compressed MISR data presented.

2. Johannes Gehrke, Cornell University Bias Correction in Classification Tree Construction We address the problem of bias in split variable selection in classification tree construction. A split criterion is unbiased if the selection of a split variable X is based only on the strength of the dependency between X and the class label, regardless of other characteristics (such as the size of the domain of X); otherwise the split criterion is biased. In the talk, we address the following four issues: (1) We give a definition that allows us to quantify the extent of the bias of a split criterion, (2) we show that the p-value of any split criterion is a nearly unbiased criterion, (3) we give theoretical and experimental evidence that the correction is successful, and (4) we demonstrate the power of our method by correcting the bias of the gini gain.
3. Data Depth: Nonparametric Multivariate Analysis and Its Applications Regina Liu, Rutgers University Abstract: The recent advances in computer technology have facilitated greatly the collection of large scale high dimensional data, and statisticians face increasingly the task of analyzing large multivariate datasets. The classical multivariate analysis is well developed, but its applicability is much restricted by its intrinsic elliptical nature. Recently, an alternative multivariate methodology based on the concept of data depth has been developed. This methodology is completely nonparametric. In other words, it requires no specification of distributions or models. A data depth is a measure of how deep or how central a given point is with respect to a multivariate distribution. It gives rise to a new set of parameters which can easily quantify the many complex multivariate features of the underlying distribution. These parameters can also be expressed by simple graphs. Furthermore, the center outward ranking of the sample points provided by a data depth leads to a systematic nonparametric multivariate inference scheme. Applications of data depth and the depth ordering include constructions of confidence regions, determinations of P-values for testing multivariate hypotheses, regression, multivariate rank tests, and multivariate statistical quality control. Some of these applications will be demonstrated in an aviation safety analysis with some airline performance data collected by the FAA from 1993 to 1998. Finally, some of the difficulties in constructing implementable algorithms for computing data depth and related estimators will be described.
4. Understanding High Dimensional Classifiers Ofer Melnik, DIMACS, Rutgers University Many models and classifiers can appear as black boxes. That is, once they are trained or fitted, it is hard to understand what they say about the data they were trained on. This difficulty stems from the number of model parameters, their interdependence and the typically high-dimensionality of the training data. Decision Region Connectivity Analysis (DRCA) is a method to analyze classifiers that use decision regions to generalize, to peer into their black-box. The method's strongest suits are its model independence and dimensionality independence. Thus, it can be used to compare completely different model types across their greatest commonality, their method of generalization. In this talk I'll start by presenting the core DRCA method and then continue with some newer research threads. These threads may include, non-visual analysis, the graph as a blue print for model construction, extensions to multi-class analysis, multi-class model construction, looking at the decision regions of boosting, and others.
5. Title: Algorithms for Document Retrieval Problems Speaker: S. Muthu Muthukrishnan Affiliation: AT&T Research Abstract: We are given a collection of text documents (strings) which may be preprocessed. Given a document retrieval query with a pattern string, consider 1. Occurrence Listing: List all locations in the documents where the pattern occurs. 2. Document Listing: List all distinct documents that contain (one or more occurrences of) the pattern. In 1973, Weiner solved the first problem in bounds linear in all parameters, hence optimally. Surprisingly, the second problem has been open even though it is closely related to the first. Arguably, the second problem is far more natural. I will give an optimal algorithm for the second problem. Also, I will describe algorithms for a number of problems of similar flavor: find documents that contain at least K occurrences of the pattern (motivation: mining significant patterns in text databases), find documents that contain occurrences of one or more patterns at most distance K apart (motivation: eg., NEAR query in SQL database, repeats in Computational Biology), etc. Techniques involve range searching data structures with colored objects.

Document last modified on January 8, 2002.