Applications in Biology, Computer Science, Intrusion Detection, and Other areas

DIMACS Center, Rutgers University, Piscataway, NJ

**Organizers:****Mel Janowitz**, DIMACS, melj@dimacs.rutgers.edu**David Banks**, Duke University, banks@stat.duke.edu (IMS Representative)

**Program Committee:****David Banks**, Duke University, banks@stat.duke.edu**Stanley L. Sclove**, University of Illinois at Chicago, slsclove@uic.edu**William Shannon**, Washington University School of Medicine, shannon@ilya.wustl.edu

This meeting will be held partly as a joint meeting with the DIMACS workshop on
Clustering Problems in Biological Networks May 9 - 11, 2006.

The CSNA meeting is co-sponsored by The Institute of Mathematical Statistics.

Title: Information-Based Clustering

Existing clustering methods in computational biology implicitly invoke several nontrivial assumptions about the structure of data. Thus the strength of a particular clustering technique depends on how well these assumptions match the true generative model of the data. Here, we address the clustering problem from an information theoretic perspective that avoids many of these assumptions. In particular, we motivate a cost function expressing the tradeoff between the average intra-cluster similarity and the compression of the data. Our formulation obviates the need for defining a cluster "prototype," does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures nonlinear relations. We also address the mathematical problems associated with extracting information theoretic quantities from finitely sampled biological datasets. We apply this approach to different domains, including the yeast stress response module microarray expression profiles, and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.

Title: Protein Cluster Analysis via Directed Diffusion

Graph-theoretical approaches are useful for elucidating the modular compositions of protein-protein interaction networks, which are known to consist of regions of increased network connectivity (clusters) corresponding to known molecular complexes or functional pathways. In this work, we introduce the concept of local spectral search as a graph-based methodology for cluster analysis.

Based on a set of known samples within a target set, we aim to identify the complete target set. We derive both an expansion scheme (form the known set of samples tot the entire set) and a rigorous clustering criterion that allows us to identify the target cluster. We apply the proposed scheme to a protein interaction network, where we infer the set of proteins related to a particular function based on a small number of proteins in that set.

Title: Learning and Classification in Biological Data

Biological data are often relational in nature and recent machine learning techniques in relation learning and network classification are therefore applicable to such data. In this talk, I explore classifying biological data using only the implicit and explicit relations between the instances, i.e. the network structure and ignoring any other attributes of the objects. I will introduce the concept of network learning and introduce NetKit, a network learning toolkit for statistical relational learning, and show its use on the biological data in a network classification framework.

Title: Classification vs Clustering, Analyzing Gene Functionality

In this work we contrast supervised relational learning and unsupervised clustering on the 2005 ILP challenge domain of classifying the yeast gene functionality. We represent the domains as a large network of genes and proteins, where the edges capture similarity information that was calculated using BLAST. We presents a statistical propositionalisation approach to relational classification and contrast the classification performance with clustering approaches, that use different pairwise similarities. We show that the advantage of the supervised approach is its ability to aggregate information from multiple sources and to optimize the relative weighting of the information.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on April 13, 2006.