Classification Society of North America 2006 Meeting on Network Data Analysis and Data Mining:
Applications in Biology, Computer Science, Intrusion Detection, and Other areas

May 10 - 13, 2006
DIMACS Center, Rutgers University, Piscataway, NJ

Mel Janowitz, DIMACS,
David Banks, Duke University, (IMS Representative)
Program Committee:
David Banks, Duke University,
Stanley L. Sclove, University of Illinois at Chicago,
William Shannon, Washington University School of Medicine,
The Classification Society of North America (CSNA)

This meeting will be held partly as a joint meeting with the DIMACS workshop on Clustering Problems in Biological Networks May 9 - 11, 2006.
The CSNA meeting is co-sponsored by The Institute of Mathematical Statistics.


Gurinder Singh Atwal, Department of Physics, Lewis Sigler Institute for Integrative Genomics, Princeton University

Title: Information-Based Clustering

Existing clustering methods in computational biology implicitly invoke several nontrivial assumptions about the structure of data. Thus the strength of a particular clustering technique depends on how well these assumptions match the true generative model of the data. Here, we address the clustering problem from an information theoretic perspective that avoids many of these assumptions. In particular, we motivate a cost function expressing the tradeoff between the average intra-cluster similarity and the compression of the data. Our formulation obviates the need for defining a cluster "prototype," does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures nonlinear relations. We also address the mathematical problems associated with extracting information theoretic quantities from finitely sampled biological datasets. We apply this approach to different domains, including the yeast stress response module microarray expression profiles, and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.

Yosi Keller, Stephane Lafon, and Michael Krauthammer, Department of Applied Mathematics, Yale University

Title: Protein Cluster Analysis via Directed Diffusion

Graph-theoretical approaches are useful for elucidating the modular compositions of protein-protein interaction networks, which are known to consist of regions of increased network connectivity (clusters) corresponding to known molecular complexes or functional pathways. In this work, we introduce the concept of local spectral search as a graph-based methodology for cluster analysis.

Based on a set of known samples within a target set, we aim to identify the complete target set. We derive both an expansion scheme (form the known set of samples tot the entire set) and a rigorous clustering criterion that allows us to identify the target cluster. We apply the proposed scheme to a protein interaction network, where we infer the set of proteins related to a particular function based on a small number of proteins in that set.

Sofus Macskassy, Fetch Technologies, Inc.

Title: Learning and Classification in Biological Data

Biological data are often relational in nature and recent machine learning techniques in relation learning and network classification are therefore applicable to such data. In this talk, I explore classifying biological data using only the implicit and explicit relations between the instances, i.e. the network structure and ignoring any other attributes of the objects. I will introduce the concept of network learning and introduce NetKit, a network learning toolkit for statistical relational learning, and show its use on the biological data in a network classification framework.

Claudia Perlich, IBM T.J. Watson Research Center

Title: Classification vs Clustering, Analyzing Gene Functionality

In this work we contrast supervised relational learning and unsupervised clustering on the 2005 ILP challenge domain of classifying the yeast gene functionality. We represent the domains as a large network of genes and proteins, where the edges capture similarity information that was calculated using BLAST. We presents a statistical propositionalisation approach to relational classification and contrast the classification performance with clustering approaches, that use different pairwise similarities. We show that the advantage of the supervised approach is its ability to aggregate information from multiple sources and to optimize the relative weighting of the information.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on April 13, 2006.