DIMACS Special Focus on Data Analysis and Mining Seminar Series

Abstracts:


1.

Speaker: Amy Braverman
Affiliation: Jet Propulsion Laboratory, Caltech
Topic: Reducing Size and Complexity of Large
   Geophysical Data Sets

ABSTRACT

This talk discusses a procedure for compressing large data sets,
particularly geophysical ones like those obtained from remote sensing
satellite instruments. Data are partitioned by space and time, and a
penalized clustering algorithm applied to each subset independently.
The algorithm is based on the entropy-constrained vector quantizer
(ECVQ) of Chou, Lookabaugh and Gray (1989). In each subset ECVQ
trades off error against data reduction to produce a set of
representative points that stand in for the original observations.
Since data are voluminous, a preliminary set of representatives is
determined from a sample, then the full subset is clustered by
assigning each observation to the nearest representative point. After
replacing the initial representatives by the centroids of these final
clusters, the new representatives and their associated counts
constitute a compressed version, or summary, of the raw data. Since
the initial representatives are derived from a sample, the final
summary is subject to sampling variation. A statistical model for the
relationship between compressed and raw data provides a framework for
assessing this variability, and other aspects of summary quality. The
procedure is being used to produce low-volume summaries of
high-resolution data products obtained from the Multi-angle Imaging
SpectroRadiometer (MISR), one instrument aboard the NASA's Terra
satellite. MISR produces approximately 2 TB per month of radiance and
geophysical data. Practical considerations for this application are
discussed, and a sample analysis using compressed MISR data presented.


2.

Johannes Gehrke, Cornell University
Bias Correction in Classification Tree Construction

We address the problem of bias in split variable selection in
classification tree construction. A split criterion is unbiased if the
selection of a split variable X is based only on the strength of
the dependency between X and the class label, regardless of other
characteristics (such as the size of the domain of X); otherwise
the split criterion is biased. In the talk, we address the following
four issues: (1) We give a definition that allows us to quantify
the extent of the bias of a split criterion, (2) we show that the
p-value of any split criterion is a nearly unbiased
criterion, (3) we give theoretical and experimental evidence that
the correction is successful, and (4) we demonstrate the power of
our method by correcting the bias of the gini gain.


3.

Data Depth:  Nonparametric Multivariate Analysis and 
                    Its Applications
Regina Liu, Rutgers University

Abstract:

The recent advances in computer technology have facilitated 
greatly the collection of large scale high dimensional data, 
and statisticians face increasingly the task of analyzing large 
multivariate datasets. The classical multivariate analysis 
is well developed, but its applicability is much
restricted by its intrinsic elliptical nature. Recently, an 
alternative multivariate methodology based on the concept of 
data depth has been developed. This methodology is completely 
nonparametric. In other words, it requires no specification of
distributions or models.

A data depth is a measure of how deep or how central a 
given point is with respect to a multivariate distribution. It 
gives rise to a new set of parameters which can easily quantify 
the many complex multivariate features of the underlying 
distribution. These parameters can also be expressed by simple 
graphs. Furthermore, the center outward ranking of the sample 
points provided by a data depth leads to a systematic nonparametric 
multivariate inference scheme. Applications of data depth and the 
depth ordering include constructions of confidence regions, 
determinations of P-values for testing multivariate hypotheses, 
regression, multivariate rank tests, and multivariate statistical 
quality control. Some of these applications will be demonstrated 
in an aviation safety analysis with some airline performance data 
collected by the FAA from 1993 to 1998.

Finally, some of the difficulties in constructing implementable
algorithms for computing data depth and related estimators will 
be described.


4.

Understanding High Dimensional Classifiers
Ofer Melnik, DIMACS, Rutgers University

Many models and classifiers can appear as black boxes. That is, once
they are trained or fitted, it is hard to understand what they say
about the data they were trained on. This difficulty stems from the
number of model parameters, their interdependence and the typically
high-dimensionality of the training data.

Decision Region Connectivity Analysis (DRCA) is a method to analyze
classifiers that use decision regions to generalize, to peer into
their black-box. The method's strongest suits are its model
independence and dimensionality independence. Thus, it can be used to
compare completely different model types across their greatest
commonality, their method of generalization.

In this talk I'll start by presenting the core DRCA method and then
continue with some newer research threads. These threads may include,
non-visual analysis, the graph as a blue print for model construction,
extensions to multi-class analysis, multi-class model construction,
looking at the decision regions of boosting, and others.


5.

Title: Algorithms for Document Retrieval Problems
Speaker: S. Muthu Muthukrishnan
Affiliation: AT&T Research

Abstract:

We are given a collection of text documents (strings)
which may be preprocessed. Given a document retrieval
query with a pattern string, consider

1. Occurrence Listing: List all locations in the
documents where the pattern occurs.
2. Document Listing: List all distinct documents that
contain (one or more occurrences of) the pattern.

In 1973, Weiner solved the first problem in bounds
linear in all parameters, hence optimally. Surprisingly,
the second problem has been open even though
it is closely related to the first. Arguably, the second
problem is far more natural.

I will give an optimal algorithm for the second problem.
Also, I will describe algorithms for a number of problems
of similar flavor: find documents that contain at least K
occurrences of the pattern (motivation: mining
significant patterns in text databases), find documents
that contain occurrences of one or more patterns at most
distance K apart (motivation: eg., NEAR query in SQL
database, repeats in Computational Biology), etc.
Techniques involve range searching data structures with
colored objects.
Document last modified on January 8, 2002.