Classification Society of North America 2006 Meeting on Network Data Analysis and Data Mining:
Applications in Biology, Computer Science, Intrusion Detection, and Other areas

May 10 - 13, 2006
DIMACS Center, Rutgers University, Piscataway, NJ

Organizers:
Mel Janowitz, DIMACS, melj@dimacs.rutgers.edu
David Banks, Duke University, banks@stat.duke.edu (IMS Representative)
Program Committee:
David Banks, Duke University, banks@stat.duke.edu
Stanley L. Sclove, University of Illinois at Chicago, slsclove@uic.edu
William Shannon, Washington University School of Medicine, shannon@ilya.wustl.edu
The Classification Society of North America (CSNA)

This meeting will be held partly as a joint meeting with the DIMACS workshop on Clustering Problems in Biological Networks May 9 - 11, 2006.
The CSNA meeting is co-sponsored by The Institute of Mathematical Statistics.


Session Title: Clustering and classification in computational biology

Session Organizer: Rebecka Jornsten, Rutgers University

Session Papers and Abstracts:

Title: Predicting and analyzing protein interaction networks

Author: Mona Singh, Princeton University

Abstract:

Protein-protein interactions play a central role in many cellular functions. In this talk, I will discuss methods that my group has been developing for (1) predicting protein physical interactions and (2) uncovering protein function via graph-theoretic analysis of large-scale protein interaction networks.

About the author: Mona Singh's research interests are in computational molecular biology, as well as its interface with machine learning and algorithms. She is particularly interested in developing computational methods for deciphering genomic data at the level of proteins, especially algorithms for genome-level analysis of protein structure, function and interactions.


Title: Nonparametric Pathway-Based Regression Models for Analysis of Genomic Data

Author: Hongze Li, University of Pennsylvania

Abstract:

High-throughput genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data anlysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes within pathways that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways and genes within pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis and cell cycle regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence.

About the Author: Hongzhe Li is a Professor of Biostatistics and a graduate faculty member of the Genomnics and Computational Biology Graduate Group at the Univesity of Pennsylvania School of Medicine. His research interests include survival analysis methods for mapping genes for complex traits, statistical methods for analysis of microarray time course gene expression data, survival analysis methods for linking genomics data to censored survival outcomes, and methods for genetic networks and pathways analysis.


Title: Clustering and Classification of Functional Data using the Interval Band Depth

Author: Rebecka Jornsten, Rutgers University

Abstract:

Non-parametric clustering and classification of functional data often takes the route of analyzing the problem as a multivariate data set. If the data is first projected onto a set of basis functions and the coefficients treated as multivariate data, the notion of functional smoothness can be preserved. However, if the functional data itself is treated as multivariate data, the coordinates are exchangeable and the notion of functional smoothness lost. In many practical applications, finding an appropriate basis to work with is not easy. For example; in time-course gene expression data the sampling process is highly irregular and sparse. We therefore propose a non-parametric method that works on the raw data, but is tailored to handle functions. We introduce a non-parametric clustering and classification method based on a new concept of data depth; the Interval Band Depth (IBD). We define a curve A(x,d) as the proportion of data objects x(i) with supnorm(x-x(i)) less than d. Then, the IBD of observation x is defined as the reciprocal of the area above the curve A(x,d). Note, for an observation x near a center of mass, IBD(x) is large; for an outlying observation IBD(x) is small. In the univariate case, the maximizer of IBD(x) corresponds to the median. IBD is a computationally simple concept of depth, and provides both an estimate of a center of a distribution (a multivariate median) and a sense of dispersion. In contrast to most other notions of depth, it is not stricly zero outside the convex hull of the data but maintains a notion of distance everywhere. Moreover, IBD can be altered to reflect various functional similarities by substituting the supnorm in A(x,d) with other measures, e.g. the longest consecutive range of |x-x(i)| less than d. We introduce a simple and fast clustering and classification method using IBD. The data objects are partitioned such that functions in a cluster all obtain maximum IBD with respect to eachother, and minimum IBD with respect to functions in other clusters. IBD provides not only an allocation, a cluster center, and a measure of cluster disperion, but also a measure of clustering confidence for each observation. We apply our method to various bench mark data sets, as well as a time course gene expression data set from injured spinal cord. We discuss the analysis outcome and compare with other methods.

About the author: Rebecka Jornsten is an Assistant Professor in the Department of Statistics, Rutgers University. Her research interests are in clustering, and model selection methodologies for the analysis of genomic data and neuron morphology. She collaborates with several research groups on campus; the Hart lab at the Keck center for collaborative neuroscience, Rutgers, the Firestein lab in MCB, Rutgers, and the Nowakowski lab at UMDNJ.



Session Title: Weights and Metrics for Cluster Analysis

Session Organizer: Jon Kettenring, Drew University

Session Papers and Abstracts: ( * denotes speaker)

Title: A New Variable Weighting and Selection Procedure for K-means Cluster Analysis

Authors: Douglas Steinley1 * and Michael Brusco2

1 Department of Psychological Sciences, 
 University of Missouri-Columbia, Columbia, MO 65211
2 Department of Marketing, College of Business, Florida State University,
 Tallahassee, FL 32306

Abstract:

A variance-to-range ratio variable weighting procedure is proposed. We show how this weighting method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the variable weighting technique.


Title: Strategies for Scaling and Weighting Variables in Cluster Analysis

Authors: R. Gnanadesikan1, J. R. Kettenring2, and Srinivas Maloor3 *

1 Professor Emeritus, Department of Statistics, Rutgers University,
 Piscataway, NJ 08854
2 Charles A. Dana Research Institute for Scientists Emeriti (RISE), 
 Drew University, Madison, NJ 07940
3 Department of Electrical and Computer Engineering, Rutgers University,
 Piscataway, NJ 08854
 

Abstract:

Cluster Analysis is a widely used practical approach to partition objects into similar and dissimilar groupings. Though the intuitive idea of clustering is clear enough, the steps involved in actually carrying out such an analysis constitute many unresolved conceptual issues. One such issue is that multivariate (interval) data often poses a problem, in that the variables are not commensurate. Since the outcome of a cluster analysis is sensitive to the scales of measurement of the input data, many practitioners resort to standardizing the data prior to the analysis. One such naive approach is "autoscaling", i.e., divide each variable by its total standard deviation, so as to put all variables on an "equal footing". This approach ignores the inherent cluster structure and actually proves counterproductive. In this paper, we propose some simple intuitive alternatives which we call "equalizers". In addition, we consider letting the data suggest weights or "highlighters" that emphasize those variables with most promise for revealing the latent cluster structure. The methods vary in degree of complexity from very simple weights based on order statistics to more complicated iterative ones. The results indicate that, in many situations, the new methods are much better than the most popular method, autoscaling.


Title: An Improved Distance Measure Between the Expression Profiles Linking Co-Expression and Co-Regulation in Mice

Authors: Ryung S. Kim1, 4 *, Hongkai Ji2, Wing H. Wong3

1 Department of Neurology, Harvard Medical School, Boston, MA 02115
2 Department of Statistics, Harvard University, Cambridge, MA 02138
3 Department of Statistics, Stanford University, Stanford, CA 94305 
4 Department of Medical Oncology, Dana-Farber Cancer Institute, Boston,
 MA 02115

Abstract:

Background: Many statistical algorithms combine microarray expression data and genome sequence data to identify transcription factor binding motifs in the low eukaryotic genomes. Finding cis-regulatory elements in higher eukaryote genomes, however, remains a challenge, as searching in the promoter regions of genes with similar expression patterns often fails. The difficulty is partially attributable to the poor performance of the similarity measures for comparing expression profiles. The widely accepted measures are inadequate for distinguishing genes transcribed from distinct regulatory mechanisms in the complicated genomes of higher eukaryotes.

Results: By defining the regulatory similarity between a gene pair as the number of common known transcription factor binding motifs in the promoter regions, we compared the performance of several expression distance measures on seven mouse expression data sets. We propose a new distance measure that accounts for both the linear trends and fold-changes of expression across the samples.

Conclusions: The study reveals that the proposed distance measure for comparing expression profiles enables us to identify genes with large number of common regulatory elements because it reflects the inherent regulatory information better than widely accepted distance measures such as the Pearson correlation or cosine correlation with or without log transformation.

Citation: Kim RS, Ji H, Wong WH, An improved distance measure between the expression profiles linking co-expression and co-regulation in mouse, BMC Bioinformatics, 2006;7:44



Session Title: Networks and Classification

Session Organizer: Stanley Wasserman, Indiana University

Session Papers and Abstracts

Title: Computational Framework for Analysis of Dynamic Social Networks

Author: Tanya Y. Berger-Wolf, University of Illinois - Chicago

Abstract:

Finding patterns of social interaction within a population has wide-ranging applications including: disease modeling, cultural and information transmission, phylogeography, conservation, and behavioral ecology. Social interactions are often modeled with networks. A key characteristics of social interactions is their continual change. However, most past analyses of social networks are essentially static in that all information about the time that social interactions take place is discarded. In this paper, we propose a new mathematical and computational framework that enables analysis of dynamic social networks and that explicitly makes use of information about when social interactions occur. We present several algorithms for obtaining information about the structure of dynamic social networks in this framework and pose many open questions. This research is joint with Jared Saia.


Title: Path-based Sampling and Inference in the Internet: Implications of Network Structure

Author: Eric D. Kolaczyk, Boston University

Abstract: It is understood that, generally speaking, the statistical analysis of data can be impacted in fundamental ways by the manner in which the data are obtained. That is, sampling design and methodology can have important, and sometimes subtle, effects on statistical inferences. The current proliferation of networks and network analysis across the sciences brings with it new challenges on the topic of sampling and its implications. One such challenge is that of drawing inferences from path-based sampling in the Internet. We present some of our recent work looking at this issue in two contexts: (i) inference of Internet structural attributes, and (ii) inference of global network traffic summaries. In both cases, one finds that there are important interactions between network structure and inferential accuracy.


Title: Clusterwise p* regression for social networks

Author: Douglas Steinley, University of Missouri

Abstract:

Heterogeneity among subjects in social network analysis is often an overlooked problem both in terms of network structure and external attribute information. As with standard linear multiple regression, the application of p* (an exponential family of random graphs) models in the presence of heterogeneous subjects can result in failure to uncover meaningful relationships that may function differently among subsets of the observations. Here we describe a procedure that is able to partition the network data by: (a) searching for homogeneous sub-networks, (b) homogeneous subsets of with respect to attribute data, or (c) a weighted combination of (a) and (b). This research is joint with Stanley Wasserman.


Title: Goodness of fit of social network models

Author: David Hunter, Penn State University

Abstract:

Curved exponential family models represent a useful generalization of exponential random graph models (ERGMs), also known as p* models. This talk presents some background of curved EF models, then describes methods for evaluating goodness-of-fit of these models to real network data. It is argued that curved EF models that use the recently-developed geometrically weighted edgewise shared partner (GWESP), geometrically weighted dyadic shared partner (GWDSP), and geometrically weighted degree (GWD) network statistics provide dramatically better fit to certain actual social network data than several well-studied models in the literature.



Session Title: Author Identification

Session Organizer: Paul Kantor, Rutgers University

Session Papers and Abstracts:

Title: Identifying Authors and Authors' Styles

Author: David L. Hoover, Professor of English & Webmaster NYU English Department

Abstract:

Recent years have seen a rapid increase in activity in the areas of authorship attribution and statistical stylistics. Earlier research typically applied Principal Components analysis or cluster analysis to a small number?typically fewer than 100?of the most frequent words of the texts in question (mainly or exclusively function words). These words occur at such high frequencies that they comprise a large proportion of the running words (tokens) of a text, and their frequencies are assumed to be resistant to intentional authorial manipulation. In the past five years, innovations in technique have lead to the analysis of much larger numbers of words: the 1,000 to 4,000 most frequent words often produce more accurate results for large (novel-sized) texts, in spite of the fact that most of them are content words. These large numbers of words are also attractive because analyses based on them account for almost all of the text: the 4,000 most frequent words typically account for more than 90% of the tokens. Other innovations involve the elimination of personal pronouns, the automatic culling of words that are extremely frequent in only one text, and an expansion of analytic techniques beyond Principal Components analysis and cluster analysis. These innovations have produced more accurate, stable, and consistent results that are more resistant to problems of style variation within the works of a single author and have also suggested new methods of analyzing style variation and authorial style.


Title: Simulated Entity Resolution: DIMACS Work on the KDD Challenge of 2005

Authors: Aynur Dayanik, Ph.D., Dmitriy Fradkin, Paul Kantor, David Lewis, David Madigan, Fred Roberts. Rutgers University.

Abstract:

We describe an entity resolution problem studied for the KDD Challenge in 2005. This problem aims to judge whether one author is the same as another. Using abstracts and author information from the life sciences, the goal is to recognize the identity of an individual, even though that name is shared with other individuals. We tackled this problem using various clustering methods, document similarities, and fusion of methods.


Title: The Words of Our Lives: Analyzing Age-and Sex-Linked Language Variation in the Blogosphere

Authors: Shlomo Argamon, Moshe Koppel, James W. Pennebaker, Jonathan Schler, Illinois Institute of Technology.

Abstract:

Analysis of a corpus of tens of thousands of blogs indicates significant differences in writing style and content between male and female authors as well as among authors of different ages. Such differences can be exploited to determine an unknown author' age and gender on the basis of a blog's vocabulary. Multiple analytic methodologies, including discriminant and factor analysis point to a surprising similarity of age-linked and sex-linked variation in language usage patterns.


Previous: Contributed Papers
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on April 28, 2006.