Classification Society of North America 2006 Meeting on Network Data Analysis and Data Mining:
Applications in Biology, Computer Science, Intrusion Detection, and Other areas

May 10 - 13, 2006
DIMACS Center, Rutgers University, Piscataway, NJ

Mel Janowitz, DIMACS,
David Banks, Duke University, (IMS Representative)
Program Committee:
David Banks, Duke University,
Stanley L. Sclove, University of Illinois at Chicago,
William Shannon, Washington University School of Medicine,
The Classification Society of North America (CSNA)

This meeting will be held partly as a joint meeting with the DIMACS workshop on Clustering Problems in Biological Networks May 9 - 11, 2006.
The CSNA meeting is co-sponsored by The Institute of Mathematical Statistics.


Sanjoy Dasgupta, UCSD

Title: Active learning of linear separators

In the well-studied setting of "supervised learning", the goal is to learn a classifier from *labeled* data: for instance, given a collection of credit card transactions which have been labeled as legitimate or fraudulent, to learn a rule which will accurately classify future transactions.

The problem with this model is that often, *unlabeled* data is easy to come by but labels are expensive. For instance, if you're building a speech recognizer, it's easy enough to get raw speech samples -- just walk around with a microphone -- but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes.

In the field of "active learning", the goal is as always to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in the regular supervised learning framework (that is, when all points automatically come labeled).

I'll give detailed background on active learning, and then show that for a wide range of data distributions, it is possible to learn linear separators using exponentially fewer labels than would be needed in supervised learning.

Mustapha Aouchiche, École Polytechnique de Montréal
Gilles Caporossi, Gerad and HEC Montréal and
Pierre Hansen, Gerad and HEC Montréal

Title: Proximity and Remoteness in Social Networks

The transmission of a vertex v in a graph, which may model a social network, is equal to the sum of distances from v to all other vertices. For ease of comparison with other graph invariants such as diameter, radius, maximum, minimum or average degree and index, it is convenient to divide the transmission by the order of the graph minus one. We call proximity the smallest so normalized transmission of the graph and remoteness the largest one. Using the system AutoGraphiX 2 a systematic comparison between proximity, remoteness and a series of graph invariants was performed. Particular attention was given to indices of centrality. This systematic exploration led to a large number of algebraic or structural conjectures. The easiest ones are proved automatically by the system. Many others are proved interactively or in an unaided way. Several conjectures remain open.

Casimir Kulikowski, Akshay Vashisht, Ilya Muchnik, Department of Computer Science and DIMACS, Rutgers University

Title: Clustering and Machine Learning over Genomic Databases with Underlying (Network or Tree) Structure

The rapid increase in available genomic datasets, especially from complete genomes, raises the urgent need for automatic methods of clustering and machine learning to discover relationships between protein groups based on sequence, structure, and functional information. The state of the art is often summarized in the form of network or tree structures encoding putative evolutionary or functional relationships between the proteins. These can serve as important tools for biasing and increasing the efficiency of cluster searches, and for their comparative analysis, interpretation, and validation.

Our group has developed a novel multipartite graph cluster extraction procedure which has been demonstrated to work efficiently in extracting orthologs across large collections of complete genomes as well as incomplete genomic sequence sets. Comparative results against expertly manually curated COG and KOG clusters yields high degrees of agreement. Used complementarily with related techniques for paralog extraction within genomes also developed within our group, we illustrate how these methods can provide powerful tools for extracting structured clusters (of conserved proteins) by rapid automatic analysis of massive sequence databases.

Christina Leslie, Columbia University

Title: Multi-class protein classification using string kernels and adaptive codes

One of the central problems in computational biology is the classification of protein sequences into functional and structural families based on sequence homology. Traditional approaches to this task include alignment-based algorithms and probabilistic models like profile hidden Markov models for protein families. However, these methods are less successful for remote homology detection, where one wants to predict a structural relationship between sequences that have diverged over a long evolutionary distance.

Our approach to protein remote homology detection and fold recognition is to use modern discriminative methods from machine learning, namely support vector machine classifiers (SVMs), combined with kernels specialized for biological sequence data. To this end, we introduce novel and efficient-to-compute string kernels that incorporate biologically motivated notions of inexact string matching, based on shared approximate occurrences of short subsequences ("k-mers"). Extending these ideas to take advantage of abundant unlabeled data -- large databases of protein sequences whose structural class is unknown -- we then define cluster kernels and profile-based string kernels that outperform all currently available remote homology detection methods. In the case of profile kernels, we can also interpret the SVM classifier by extracting discriminative motif regions that suggest conserved structural subunits in a protein superfamily.

Finally, to deal with the large underlying multi-class problem -- there are hundreds of protein folds and over 1000 superfamilies in widely used structural classification systems -- we introduce an adaptive code approach for learning to weight prediction scores from different binary SVM classifiers. Our adaptive code learning approach significantly outperforms 1-vs-all for this problem and also allows us to use optimize with respect to different loss functions. We are building and will soon release a full multi-class fold prediction server, called SVM-fold, which will incorporate both our string kernel and our adaptive code learning work.

David Madigan, Rutgers University

Title: The Power of the Prior: Improving Predictive Performance with Extra-data Knowledge

Standard approaches to classification and regression fail to utilize extra-data knowledge. Such knowledge can include class-specific descriptors, physical constraints, and exchangeability judgments. Using a number of examples, this talk will demonstrate the benefits and costs of harnessing such knowledge. The Bayesian paradigm proves especially convenient in this context.

Fionn Murtagh, University of London

Title: Identifying and Exploiting Ultrametricity

We begin with pervasive ultrametricity due to high dimensionality and/or spatial sparsity. How extent or degree of ultrametricity can be quantified leads us to the discussion of varied practical cases when ultrametricity can be partially or locally present in data. We show how the ultrametricity can be assessed in text or document collections, in time series signals, and in chemical information retrieval. We conclude with a discussion of ultrametricity in astrophysics, relating to observational cosmology.

Panos M. Pardalos, University of Florida

Title: Data Mining and Network Models of Massive Datasets

In this talk we discuss new research directions in data mining - using network-based models for data analysis and decision making. In many practical situations, a real-world dataset can be represented as a large graph (network) with certain attributes associated with its vertices and edges. These attributes may contain specific information characterizing the given application, which often provides a new insight into the internal structure and patterns of the data. The considered examples include telecommunications, social networks, biomedicine, and finance.

Salvatore J. Stolfo, Columbia University

Title: Unsupervised Anomaly detection in Computer Security

Anomaly detection has been an active area of research in computer security for some time. There are now several commercial products that include anomaly detectors of one sort or another for network and host-based intrusion detection. We present an overview of anomaly detection used in computer security, and provide two examples of content-based anomaly detectors designed to detect suspicious network packets that may indicate a zero-day exploit. The PAYL Payload Anomaly Detector is based upon an efficient statistical analysis of content flows. Recent work has shown that PAYL may be blinded by mimicry attack. The Anagram anomaly detector is designed to resist such attacks. Either sensor is intended to be used not as a standalone detector but one of a correlated set of sensors.


Salvatore J. Stolfo is Professor of Computer Science at Columbia University. He received his Ph.D. from NYU Courant Institute in 1979 and has been on the faculty of Columbia ever since. (See He has published extensively in the areas of parallel computing, AI Knowledge-based systems, Data Mining and most recently Computer Security and Intrusion Detection Systems. His research has been supported by DARPA, NSF, ONR, and numerous companies and state and federal agencies.

Previous: Invited Speakers
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on April 11, 2006.