### DIMACS MiniWorkshop Exploring Large Data Sets Using Classification, Consensus, and Pattern Recognition Techniques: Abstracts

#### May 29-30, 1997 DIMACS Center, CoRE Building, Rutgers University Organizers: F.R. McMorris and Ilya Muchnik

Jaime Cohen
Dept. of Computer Science, Rutgers University
"Pivot Algorithms for Clustering"

We consider the problem of fitting an n by n distance matrix D by a tree metric T. This problem is NP-hard for most reasonable distance functions between D and T. Recently, an approximation algorithm was discovered which achieves a factor of 3 approximation to the L-infinity best fitting tree. We call this method the Single Pivot (SP) heuristic. We will describe an extension of the SP heuristic, the Double Pivot (DP) heuristic, and show that DP achieve good results compared with the Neighbor Joining heuristic. Practical aspects of the method will be presented and a list of open problems will be discussed.

(Joint work with Prof. Martin Farach.)

Lenore Cowen
Dept. of Math. Sc., Johns Hopkins University
"Approximate Distance Methods for Clustering High-Dimensional Data"

Suppose \$x_1, \ldots, x_n\$ are a collection of \$n\$ data points in \$\Re^{d}\$. The goal is to cluster this data into \$c\$ classes. We are interested in this problem in the unsupervised case (where nothing is known a priori about the class labels, and we are essentially looking for shape and structure in a ``point cloud'') as well as in the supervised case, where some partial training data indicates correct class labels for some of the points, and the goal is to construct a rule that correctly classifies the remainder. For both these problems, there is a wealth of classical statistical literature for small \$d\$, but as \$d\$ gets large, they are considered extremely hard-- leading to what Hartigan/Gordon call ``the curse of dimensionality''.

There are many senses in which the curse of dimensionality is real, whether one takes a geometric approach based on orthogonal projections, or one focuses instead on capturing inter-point distances. A recent paper of Linial, London and Rabinovich, investigated algorithmic constructions for embedding points from \$\Re^d\$ into a lower dimensional Euclidean space, while approximately preserving all inter-point distances.\footnote{Many of their results generalize to other distance metrics.} Their results are based on work of Bourgain, and Johnson and Lindenstrauss in the 1980s who showed that independent of the dimension of the original space, such points can be embedded in an \$O(\log n)\$ dimensional space, with at most a \$O(\log n)\$ distortion in any of the pairwise distances-- and in some cases this is tight.

These results allows a tradeoff of approximation for dimension-- in the case where \$\log n\$ is smaller than \$d\$, if we can tolerate the distortion, we reduce the curse of dimensionality. However, even if we can tolerate the distortion, for many applications, \$O(\log n)\$ dimensions is still too high to apply classical clustering and classification methods. Instead we propose a parametirized version of approximate distance dimension reduction, called ADC (for approximate distance clustering) where we introduce a family of maps based on distances to small random subsets, that preserve some, but not all pairwise distances at once. Note that if we are looking at a high-dimensional data set that clusters, we are mainly interested in preserving the inter-class distances, if the intra-class distances collapse, this only increases the clustering in the low-dimensional space.

The methods involve both the combinatorics of these random subsets, and resulting rank sum statistics, and a theoretical computer science approach to randomized algorithms.

(Joint work with C.E. Priebe)

Boris Mirkin
DIMACS, Rutgers University
"Approximation clustering as a framework for solving
challenging problems in processing of massive data sets"

Approximation clustering is an extension of what is usually called ``square-error clustering'' onto various cluster structures (such as single clusters or hierarchies) and criteria. It appears to be not only a mathematical device to support, specify and extend K-Means and other fast clustering techniques, but also a framework for developing workable concepts and fast procedures for such challenging problems in machine learning and data processing as approximate concept learning, feature selection, constructive induction, knowledge discovery, data compression/decompression, and design and use of multiresolution hierarchies. It appears, for instance, that the cluster constructions involved parallel to such image/signal processing concepts as wavelet and quadtree; however, the framework suggested seems to offer some potential improvements based on relaxing ``continuity and homogeneity'' restrictions of classical theories.

Peter Hammer
RUTCOR, Rutgers University
"Convexity in Logical Analysis of Data (LAD)"

The paper presents a polynomial solution to the important LAD problem of constructing a convex Boolean function which agrees with a given set of observations. This solution uses a complete combinatorial and algebraic characterization of convex functions that implies the uniqueness of their prime disjunctive normal form representation.

(Joint work with Oya Ekin and Alex Kogan)

Pierre Hansen
"Large scale clustering by variable neighborhood search"

Traditional clustering algorithms such as KMEANS or HMEANS give poor results when applied to large data sets with many clusters. A new metaheuristic, Variable Neighboorhood Search, is applied to large scale clustering with various criteria: minimum sum of squares, minimum sum of stars (the p-median problem) and minimum sum of continuous stars (the multisource Weber problem). Computational results on problems with up to several thousand entities is reported on.

Yann LeCun
AT&T Labs
"Neural Networks and other numerical learning techniques
for pattern recognition in large data sets"

Yann LeCun, Yoshua Bengio, Leon Bottou,
AT&T Labs Research, Red Bank, NJ

Among all the machine learning algorithms proposed in recent years, only a few can cope with high-dimensional input vectors and very large numbers of samples. Learning processes can usually be described as minimizing an objective function that measures the performance of the machine. Learning algorithms are characterized by the type of the objective function they minimize (discrete, continuous, convex), and the method they use to find the minimum (combinatorial, gradient-based, direct). Gradient-based learning algorithms such as Neural Networks and Support Vector Machines have been successfully applied to pattern recognition problems with hundred of input variables and hundreds of thousands of samples. Neural Networks minimize a non-convex objective functions with gradient-based techniques, while Support Vector Machines minimize a quadratic objective function with linear constraints. Both methods can learn high-dimensional, non-linear decision surfaces with tens of thousands of free parameters. However, large databases are often very redundant, and the relevant information is difficult to extract. Emphasizing methods, such as boosting and active set techniques, address that problem by increasing the relative statistical weight of difficult or non-typical samples. Applications of those methods to pattern recognition, fault prediction, and database marketing will be presented.

( Joint work with Yoshua Bengio, Leon Bottou, Corinna Cortes, and Vladimir Vapnik)

Simon Streltsov
Alphatech Inc.
"Traffic behavior pattern analysis"

Moving target indication (MTI) sensors provide surveillance information about ground traffic. We are interested in discovering and explaining unusual traffic patterns observed over long periods of time.

A priori knowledge about traffic patterns is limited and clutter (background traffic) may have a complex structure. We demonstrate how clustering techniques can be used in order to identify different classes of objects and understand interactions between them, and discuss arising problems in search and generalization of associations and feature selection.

Nate Dean
Bell Labs
"A Measure for Analyzing Group Interaction"

We propose a measure that ranks groups based on their interaction with the rest of some predefined universe. The measure is easy to evaluate, and we explore possible applications. We also derive formulas for the mean and variance of this measure that can be computed quickly and hence used effectively in an interactive data analysis environment. We outline a computational method for higher order moments of this measure, but this computation involves a deeper analysis of the associated network structure.

joint work with Kiran Chlakamarri

Cross/Z Intl., CA
"Multidimensional visualization using rectangles for business applications"

The problem of data and model visualization in multidimensional attribute space is pending. We introduce approach to build and visualize multidimensional rectangles. The idea is to approximate the given set of points in the attribute space by a set of reasonably large rectangular selections having standard presentation form and covering desirable area that cannot be described in another way. A kind of decision tree building approach is suggested. With the interactive visualization capability the user can locate qualified prospect with the same degree of accuracy as existing DBMS tools (find the right customers in the right place at the right time) but much faster.

This is the joint work with Alex Genkin, IITP, Rus. Acad. Sci.

Bill Shannon
Washington University School of Medicine, MO
"Clustering in Large Biomedical Databases"

Health services research (HSR) is concerned with measuring the performance and quality of health care delivery strategies (e.g., managed care versus fee for service; hospital based versus clinic based) for policy decision making. Performance and quality are often measured in terms of health care resource utilization and how it relates to patient outcomes. Intrinsic in HSR is the concept of 'case mix' which are clusters of patients with similar disease severity, health care access, etc. Large research-oriented academic medical centers have more severe disease case mix than do small regional hospitals, and so will have patients with worse outcomes and higher resource utilization. Direct comparisons of academic with non-academic centers will be invalid if case mix is not controlled. Important sources of data for HSR and case mix identification are electronic medical records and administrative claims databases. In this talk I will discuss our experiences to date with large medical databases, and argue why automatic clustering and classification strategies are essential for controlling patient case mix.

Mark Stitson
University of London
"Function approximation using SV Machines"

Function approximation using SV Machines
by M. O. Stitson and J. Weston

Support Vector Machines are a new type of universal learning machines that implement SRM inductive inference. These machines can construct approximations of multi-dimensional functions in various sets of functions: splines, Fourier expansions, radial basis function expansion etc. SV machines effectively control the trade off between the accuracy of approximation and the complexity of the approximating function.

In this talk we will discuss practical implementation issues of SV machines for large data bases, consider examples of function approximation in different sets of functions and demonstrate the effectiveness of controlling complexity versus accuracy of an approximation. We will also discuss some examples of solving real-life problems using SV machines.

(Joint work with J. Weston)

Moses Charikar
Stanford University
"Incremental Clustering and Dynamic Information Retrieval"

Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. The goal is to efficiently maintain k clusters of small diameter as new points are inserted. We propose a model called incremental clustering based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance, and which we believe should also perform well in practice. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters.

This is joint work with Chandra Chekuri, Tomas Feder and Rajeev Motwani.

Alex Meystel
NIST, Washington, DC
"Algorithms of Unsupervised Learning for Organizing
and Interpreting Large Data Sets"

Unsupervised learning is performed by using computational algorithm which treats a large data set as an external environment in which the goal is supposed to be achieved and the proper rules of the goal achievement should be learned. As a byproduct of the learning process, the data end up to be organized in a multigranular (multiscale) system of "world representation" which allows for effective interpretation. The latter is done via tracing connections between two multigranular hierarchies: for the concepts and for the temporal changes (if the time factor is involved). The algorithm performs recursively the goal oriented clustering of the available information. The structure of final "world representation" is intimately linked with mechanisms of "behavior generation" as far as the goal of analysis is specified.

Dan Daly
The Lubrizol Corporation, OH
"Predicting Intake Valve Deposits: A Joint QSAR Project Between LZ and
Purdue University Employing Neural Networks and First Principles Modeling"

A QSAR study was conducted for the prediction of Intake Valve Deposit (IVD) formation in a BMW engine test developed for gasoline quality evaluation. The analysis performed were: Stepwise Linear Regression, neural network models using descriptors from Stepwise Linear Regression analysis as inputs, neural network models using inputs from Principal Component Analysis and neural network models using inputs from a Partial Least Square Analysis. The best neural network based on the root mean squared error is one using four principal components as input features and three hidden neurons. This model was an improvement over the accuracy of previous linear and neural net models. In an attempt to reduce this error even further, workers from Lubrizol and Purdue calculated a set of 'meta descriptors', which are based on current theories of IVD formation and yielded a neural net model derived on first principles.

Haym Hirsh
Dept. of Computer Science, Rutgers University
"Learning to Predict a User's Next Command"

Although most users develop their own distinctive and often idiosyncratic modes of interactions with a command-line shell, for any particular user the user's interactions often reflect many systematic underlying patterns of use. This talk will describe work on developing a personal assistant for UNIX that learns to recognize such systematic patterns of use and makes them available to the user. We have explored a range of learning mechanisms for predicting a user's next command, and evaluated our methods on command histories from 77 people, consisting of a total of more than 150,000 commands from a 2-6 month period. Our best prediction method obtains an accuracy of over 45%

David Ozonoff
Boston University
"Environmental Epidemiology: A Natural Application for Discrete Mathematics"

Epidemiologic studies of environmental influences on health involve a comparison of rates, ratios or proportions of an attribute or attributes in two or more populations. Mathematical models and tools used in epidemiology traditionally have been confined to statistical techniques used to evaluate the role of chance or to provide a quantitative description of the comparisons (e.g., hypothesis testing, regression modeling, statistical inference). Recently dynamical systems models of disease evolution (most often continuous) have been introduced, usually for infectious disease problems (e.g., maliaria, AIDS). However, this talk will focus attention on other questions that arise naturally in epidemiology which might usefully be formulated as problems in discrete mathematics. Some involve the large datasets that commonly occur, others are a result of the fact that people occur as integers (counts), still others from the kinds of algebraic structures (e.g., posets and lattices) that arise in certain applications. This will be a general survey of the kinds of problems that occur, framed in a way to suggest applications for discrete mathematics.

David Banks
Dept. of Statistics, Carnegie Mellon University
"Superlarge Datasets: Three Problem Areas"

Massive datasets engender unique problems that aren't simply scale-ups of those encountered in the analysis of smaller samples. Three examples of such issues are:

1.) Pre-analysis. The task of preparing a superlarge dataset is harder and different than that faced in classical statistics. It becomes necessary to automate the kind of routine data audits people usually make by eye, so that outliers, missing values, and unforeseen patterns can be detected without direct human involvement.

2.) The Curse of Dimensionality. Despite some impressive recent results, the practical truth is that high-dimensional data defeats analysis. However, the most successful new methods rely on a common, but unenunciated, strategy---the discovery of locally low-dimensional structure. On a related topic, it is also necessary to estimate the local sparsity of the data.

3.) Index Creation. Analysts need to develop a way of indexing massive datasets, so that it is possible to examine a set of "snapshots" that capture the range of behavior in the data, where the kind of behavior is described in some general, context-specific manner. This problem is most easily appreciated in the context of databases that consist of large numbers of images, such as are obtained from the Voyager spacecraft.

This talk details these problems, and describes solutions (or at least solution strategies) for each. The first two topics are more mature than the last.

Vyacheslav Mazur
Cross/Z Intl.

Some variables in the business data sets, especially the variables of econometric nature, belong to Pareto distribution law and its close relatives. These variables are very inconvenient to process using conventional statistics technique; some of them have infinite variance. The report describes how to recognize these variables and how to use them in modeling in the fully automated manner. Approach is implemented in the QO Voyager (TM) by Cross/Z, Intl.

This is the joint work with Alex Genkin, IITP, Rus. Acad. Sci.

Corinna Cortes
AT&T Labs
"Tracking STARS in the Universe"

We draw certain analogies between computational astronomy and mega-business applications as regards the ability to track large numbers of objects or entities. In astronomy the objects could be stars while in the world of telecommunications, STARS might be Standard Telephone Alphanumeric Recording Symbols (i.e. telephone numbers). In both cases the universe contains many such entities (trillions of stars and billions of STARS). In the business world, an understanding of STARS could be helped by questionnaires. But the quality of self-reported data is typically very low, so it is necessary to adopt the practice of astronomers only to rely on observed data. In other words, we will not and cannot design experiments that will provide information on a STAR but simply wait for the STAR to expose itself.

Observations on stars and STARS are continuous, though not every object is sighted on every day. In fact, most entities appear quite infrequently while others seemto be highly visible every day. Scientific and business needs dictate that we understand each and every entity, in the very least, for cataloging purposes. Telecommunication network traffic is used to stimulate the discussion. In this case we are concerned with monitoring the behavior of tens of millions of STARS based on observing transactions between STARS that number in the hundreds of millions per day. Thus we consider dynamic statistical models with many millions of parameters, each of which needs to be meaningfully defined and understood. The discussion focuses on data access and management, statistical modeling, and presentation issues, especially regarding the user interface to the fitted models.

This is the joint work with Daryl Pregibon.

Christopher Landauer
The Aerospace Corporation
""Thar She Blows!": Analysis of Yellowstone Geyser Eruptions"

This talk is about analyzing a data set that has been collected off and on for over a hundred years. This data records the geyser eruptions in Yellowstone National Park, and while it is not massive in the sense of terabytes or more, it makes up for that in anomalies: it is very heterogeneous, partly missing and somewhat erroneous (almost all data collection is by volunteers who trek out to the geyser basins and watch). It affords a very good (and stringent) test of any data analysis method.

The purpose of the data collection and analysis project is to support models of the causal interconnections among the geysers and predictive models of the eruptions (including times and other characteristics). Currently both are based on informal and anecdotal descriptions of their temporal relationships. The models are further complicated by the fact that occasional earthquakes (Yellowstone is in a _very_ active seismic area) may change the subterranean connectivity, and definitely change some of the eruption characteristics.

We will describe some of the data, some preliminary models that have been developed before, and show how the many different techniques available can be used (or not) in this context. This is very much a "work in progress".

Sorin Istrail
Sandia National Laboratories
"Multiple Alignment of Biomolecular Sequences and Voting Paradoxes"

The notion of multiple sequence alignment appeared for the first time in a 1963 paper by L. Pauling and E. Zuckerkandl entitled: "Chemical Paleogenetics: Molecular Restoration Studies of Extinct Forms of Life." Multiple sequence alignment is today one of the most vibrant and active research areas of computational molecular biology. Although it plays a founding role in the development of the field, and a vital role in the recent breakthrough discoveries, the research in this area is facing challenges in the development of methods that are simultaneously: well-defined, biologically meaningful, and computationally feasible.

The situation resembles in spirit the mathematical economics line of research pioneered by the axiomatic approach of Kenneth J. Arrow, who asked whether one can design a system of voting that is at the same time: rational, decisive and fair. Voting -- as studied by economists, political scientists, and philosophers -- and multiple sequence alignment are methods of aggregation of a set of individual patterns into a collective pattern that represents in a fair and meaningful way the set of individual patterns. The 1972 Nobel Prize for Economics was given to Arrow for his discovery of the Voting Impossibility Theorem: "No voting method could be simultaneously rational decisive and fair!"

In this talk we will present preliminary results on a similar axiomatic approach for understanding the intrinsic difficulties in multiple sequence alignment, for the particular case when no phylogenetic tree information is available. We will review the literature, focusing on the search for guiding principles that are well-agreed upon for creating biologically significant alignments. We have identified three basic Axioms, proposed in the literature, which in our view capture the essence of the competing difficulties: well-definedness, biological meaningfulness, and computational feasibility. No alignment method in the literature satisfies simultaneously these three axioms. We will present (1) a new alignment method, the Perron-Frobenius Alignment, that appears to be the closest in the literature to satisfying the three axioms, and (2) a new unifying framework for combinatorial optimization alignment problems, based on "fair" objective functions. We will conclude by discussing our attempts of proving a "Molecular Restoration Impossibility Theorem."

(Joint work with R. Ravi, Carnegie Mellon)

Leonid Gurvits
NECI,Princeton and DIMACS,Rutgers
of VC-dimension and its generalizations"

In a first part of the talk I will discuss briefly traditional applications of VC-dimension: uniform laws in statistics and why VC-dimension is so crucial for the Learning in a PAC setting.

In a second(larger) part , I will discuss, among not-so-traditional applications, connections between VC-dimension and its real-valued generalizations to interpolations, graph theory, approximation, functional analysis, control theory, measure theory, infinite combinatorics and analysis of stoping times of some algorithms.

Also, I will present one general idea how to bound scale-sensitive dimension using Moments Inequalities.

Previous: Participation
Next: Registration
Index
DIMACS Homepage
Contacting the Center