DIMACS Mini Workshop:
Exploring Large Data Sets Using Classification, Consensus, and Pattern Recognition Techniques (May 29-30, 1997)

From Organizers

Data mining of massive data sets brings a great interests to combinatorial methods of data analysis and, particularly, for combinatorial clustering. Initial ideas of clustering so simple that in many field practitioners develop such models by themselves without any support from computer science experts. At the same time the theory of the methods is high development now and open a real new horizon for their application specially when data base is very large and non enough studied.

The main goal of the Mini Workshop "Exploring Large Data Sets Using Classification, Consensus, and Pattern Recognition Techniques" (May 29-30, 1997 DIMACS Center, Rutgers University) was to bring together for discussion methodological researchers in clustering and practitioners who need or/and use clustering methods. Some related models, for instance, consensus were presented on the Mini Workshop.

This Technical Report contains extended abstracts of talks from the Workshop (some of them are just original). It divided into two part. In the first one we put papers which focused mostly into methodological aspects of explorative analysis. The second one contains papers which are mostly describe concrete applications. We put in the volume also original abstracts from whom we didn't get an extended ones.


The fields of Classification Theory and Pattern Recognition have been maturing over the past 30 years into powerful collections of theory-based data analysis techniques. During this development, methods from discrete mathematics and theoretical computer science have had greater and greater impact. It has now become clear that combinatorial methods for data analysis, especially combinatorical clustering, have the potential to significantly affect data mining and other approaches to the analysis of massive data sets.

The main goal of the Miniworkshop "Exploring Large Data Sets Using Classification, Consensus, and Pattern Recognition Techniques" (May 29-30, 1997 at DIMACS Center, Rutgers University) was to bring methodological researchers together with practitioners to investigate problem areas where classification/consensus/pattern recognition might be developed more specifically for exploring various types of large data sets.

This Technical Report contains abstracts of the talks presented. Some of the abstracts were 'extended' just for this Report. It is divided into two sections. The first section contains papers focused mostly on methodological aspects of exploratory analysis. The section one contains papers which mostly describe applications.

Table of Contents

David Banks
The Analysis of Superlarge Datasets (abstract)

Moses Charikar
Incremental Clustering and Dynamic Information Retrieval (abstract)

Jaime Cohen and Martin Farach
Pivot Algorithms for Clustering (postscript file)

Corinna Cortes and Daryl Pregibon
Tracking STARS in the Universe (postscript file)

Lenore Cowen
Approximate Distance Methods for Clustering High-Dimensional Data (abstract)

Dan Daly and Anne. M. Chaka
Predicting Intake Valve Deposits: A Joint QSAR Project Between LZ and Purdue University Employing Neural Networks and First Principles Modeling (abstract)

Nate Dean and Kiran Chlakamarri
A Measure for Analyzing Group Interaction (abstract)

Oya Ekin, Peter Hammer and Alexander Kogan
Convexity in Logical Analysis of Data (LAD) (postscript file)

Saveli Goldberg
Inference Engine the Systems of the Dr. Watson Type (Microsoft Word file)

Pictorial Methods with Applications to Monitoring, Diagnostics and Control in Industrial Processes (html file)

Leonid Gurvits
Traditional and not-so-Traditional Applications of VC-dimension and its Generalizations (abstract)

Pierre Hansen and Nenad Mladenovic
Large Scale Clustering by Variable Neighborhood Search (abstract)

Haym Hirsh
Learning to Recommend (abstract)

Sorin Istrail and R. Ravi
Multiple Alignment of Biomolecular Sequences and Voting Paradoxes (abstract)

O.K. Kedrov
Algorithm of Multichannel On-Line Detection of Seismic Signals at Three-Component Station (gzipped postscript file)

Christopher Landauer
"Thar She Blows!": Analysis of Yellowstone Geyser Eruptions (postscript file)

Yann LeCun, Yoshua Bengio, Leon Bottou, Corinna Cortes and Vladimir Vapnik
Neural Networks and Other Numerical Learning Techniques for Pattern Recognition in Large Data Sets (abstract)

Vyacheslav Mazur and Alexander Genkin
Pareto Distributions in Business Modeling (abstract)

Alex Meystel
Algorithms of Unsupervised Learning for Organizing and Interpreting Large Data Sets (Microsoft Word file)

Boris Mirkin
Approximation clustering as a framework for solving challenging problems in processing of massive data sets (gzipped postscript file)

David Ozonoff
Environmental Epidemiology: A Natural Application for Discrete Mathematics (Microsoft Word file)

William Shannon
Clustering in Large Biomedical Databases (postscript file)

Mark Stitson and J. Weston
Function Approximation Using SV Machines (abstract)

Simon Streltsov
Traffic Behavior Pattern Analysis (abstract)

Vladimir Yancher and Alexander Genkin
Multidimensional visualization using rectangles for business applications (postscript file)