This meeting will be held partly as a joint meeting with the DIMACS workshop on
Clustering Problems in Biological Networks May 9 - 11, 2006.
The CSNA meeting is co-sponsored by The Institute of Mathematical Statistics.
Title: Active learning of linear separators
In the well-studied setting of "supervised learning", the goal is to learn a classifier from *labeled* data: for instance, given a collection of credit card transactions which have been labeled as legitimate or fraudulent, to learn a rule which will accurately classify future transactions.
The problem with this model is that often, *unlabeled* data is easy to come by but labels are expensive. For instance, if you're building a speech recognizer, it's easy enough to get raw speech samples -- just walk around with a microphone -- but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes.
In the field of "active learning", the goal is as always to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in the regular supervised learning framework (that is, when all points automatically come labeled).
I'll give detailed background on active learning, and then show that for a wide range of data distributions, it is possible to learn linear separators using exponentially fewer labels than would be needed in supervised learning.
Title: Proximity and Remoteness in Social Networks
The transmission of a vertex v in a graph, which may model a social network, is equal to the sum of distances from v to all other vertices. For ease of comparison with other graph invariants such as diameter, radius, maximum, minimum or average degree and index, it is convenient to divide the transmission by the order of the graph minus one. We call proximity the smallest so normalized transmission of the graph and remoteness the largest one. Using the system AutoGraphiX 2 a systematic comparison between proximity, remoteness and a series of graph invariants was performed. Particular attention was given to indices of centrality. This systematic exploration led to a large number of algebraic or structural conjectures. The easiest ones are proved automatically by the system. Many others are proved interactively or in an unaided way. Several conjectures remain open.
Title: Clustering and Machine Learning over Genomic Databases with Underlying (Network or Tree) Structure
The rapid increase in available genomic datasets, especially from complete genomes, raises the urgent need for automatic methods of clustering and machine learning to discover relationships between protein groups based on sequence, structure, and functional information. The state of the art is often summarized in the form of network or tree structures encoding putative evolutionary or functional relationships between the proteins. These can serve as important tools for biasing and increasing the efficiency of cluster searches, and for their comparative analysis, interpretation, and validation.
Our group has developed a novel multipartite graph cluster extraction procedure which has been demonstrated to work efficiently in extracting orthologs across large collections of complete genomes as well as incomplete genomic sequence sets. Comparative results against expertly manually curated COG and KOG clusters yields high degrees of agreement. Used complementarily with related techniques for paralog extraction within genomes also developed within our group, we illustrate how these methods can provide powerful tools for extracting structured clusters (of conserved proteins) by rapid automatic analysis of massive sequence databases.
Title: Multi-class protein classification using string kernels and adaptive codes
One of the central problems in computational biology is the classification of protein sequences into functional and structural families based on sequence homology. Traditional approaches to this task include alignment-based algorithms and probabilistic models like profile hidden Markov models for protein families. However, these methods are less successful for remote homology detection, where one wants to predict a structural relationship between sequences that have diverged over a long evolutionary distance.
Our approach to protein remote homology detection and fold recognition is to use modern discriminative methods from machine learning, namely support vector machine classifiers (SVMs), combined with kernels specialized for biological sequence data. To this end, we introduce novel and efficient-to-compute string kernels that incorporate biologically motivated notions of inexact string matching, based on shared approximate occurrences of short subsequences ("k-mers"). Extending these ideas to take advantage of abundant unlabeled data -- large databases of protein sequences whose structural class is unknown -- we then define cluster kernels and profile-based string kernels that outperform all currently available remote homology detection methods. In the case of profile kernels, we can also interpret the SVM classifier by extracting discriminative motif regions that suggest conserved structural subunits in a protein superfamily.
Finally, to deal with the large underlying multi-class problem -- there are hundreds of protein folds and over 1000 superfamilies in widely used structural classification systems -- we introduce an adaptive code approach for learning to weight prediction scores from different binary SVM classifiers. Our adaptive code learning approach significantly outperforms 1-vs-all for this problem and also allows us to use optimize with respect to different loss functions. We are building and will soon release a full multi-class fold prediction server, called SVM-fold, which will incorporate both our string kernel and our adaptive code learning work.
Title: The Power of the Prior: Improving Predictive Performance with Extra-data Knowledge
Standard approaches to classification and regression fail to utilize extra-data knowledge. Such knowledge can include class-specific descriptors, physical constraints, and exchangeability judgments. Using a number of examples, this talk will demonstrate the benefits and costs of harnessing such knowledge. The Bayesian paradigm proves especially convenient in this context.
Title: Identifying and Exploiting Ultrametricity
We begin with pervasive ultrametricity due to high dimensionality and/or spatial sparsity. How extent or degree of ultrametricity can be quantified leads us to the discussion of varied practical cases when ultrametricity can be partially or locally present in data. We show how the ultrametricity can be assessed in text or document collections, in time series signals, and in chemical information retrieval. We conclude with a discussion of ultrametricity in astrophysics, relating to observational cosmology.
Title: Data Mining and Network Models of Massive Datasets
In this talk we discuss new research directions in data mining - using network-based models for data analysis and decision making. In many practical situations, a real-world dataset can be represented as a large graph (network) with certain attributes associated with its vertices and edges. These attributes may contain specific information characterizing the given application, which often provides a new insight into the internal structure and patterns of the data. The considered examples include telecommunications, social networks, biomedicine, and finance.
Title: Unsupervised Anomaly detection in Computer Security
Anomaly detection has been an active area of research in computer security for some time. There are now several commercial products that include anomaly detectors of one sort or another for network and host-based intrusion detection. We present an overview of anomaly detection used in computer security, and provide two examples of content-based anomaly detectors designed to detect suspicious network packets that may indicate a zero-day exploit. The PAYL Payload Anomaly Detector is based upon an efficient statistical analysis of content flows. Recent work has shown that PAYL may be blinded by mimicry attack. The Anagram anomaly detector is designed to resist such attacks. Either sensor is intended to be used not as a standalone detector but one of a correlated set of sensors.
Salvatore J. Stolfo is Professor of Computer Science at Columbia University. He received his Ph.D. from NYU Courant Institute in 1979 and has been on the faculty of Columbia ever since. (See http://www.cs.columbia.edu/~sal.) He has published extensively in the areas of parallel computing, AI Knowledge-based systems, Data Mining and most recently Computer Security and Intrusion Detection Systems. His research has been supported by DARPA, NSF, ONR, and numerous companies and state and federal agencies.