DIMACS Workshop on Data Mining Techniques in Bioinformatics

October 30 - 31, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Mona Singh, Princeton University, mona@cs.princeton.edu
Mark Gerstein, Yale University, Mark.Gerstein@yale.edu
Presented under the auspices of the Special Focus on Data Analysis and Mining and the Special Focus on Computational Molecular Biology.


Ziv Bar-Joseph, MIT

Title: Computational discovery of gene modules and regulatory networks

Recent advances in high-throughput experimental methods in molecular biology hold great promise. DNA microarrays have been used to measure the expression levels of thousands of genes, and more recently microarrays have been exploited to measure genome-wide protein-DNA binding events. While useful, these datasets present many computational challenges. High-throughput biological datasets are often very noisy and contain many missing points. In addition, each of these data source measures only one type of activity in the cell. Principled computational methods are required in order to make full use of each of these datasets, and to combine them to infer genetic interaction networks. In this talk I will describe an algorithm that efficiently combines complementary large-scale expression and protein-DNA binding data to discover co-regulated modules of genes. I will then present extensions to this algorithm by using time series expression data to automatically infer and validate a dynamic sub-network for the cell cycle system.

Harmen Bussemaker, Columbia University

Title: Integrative modeling of microarray data for mRNA expression and transcription factor occupancy

Functional genomics studies are yielding information about regulatory processes in the cell at an unprecedented scale. We present a computational approach that integrates genomewide transcription factor occupancy data with a library of mRNA expression data to define the regulatory network of the cell. Applying our method to S. cerevisiae we find that on average 58% of the genes whose promoter region is bound by a transcription factor are functional targets. We are able to assign directionality to transcription factors that control divergently transcribed genes sharing the same promoter region. By naturally taking into account the combinatorial nature of transcriptional control, our approach overcomes limitations of transcription factor deletion experiments.

Andrea Califano, Columbia University

Title: Global Search for Genetic Associations by Pattern Discovery : Methods and Examples

Linkage-based parametric & non-parametric methods have proven successful in localizing the genetic factors of Mendelian traits. However, the dissection of complex inheritance of common phenotypic traits requires new analytical approaches to reveal the small-to-medium effects of multiple susceptibility loci. Although single locus analysis is straightforward and the statistics to evaluate significance has been adequately formulated, it increasingly lacks the power to dissect the genetic complexity of common heterogeneous diseases associated to small individual effects and gene-gene interactions.

This talk presents a pattern discovery and corresponding statistical analysis framework that has been validated in several functional genomics contexts, from the functional classification of proteins, to the analysis of cancer microarray data, to the dissection of complex, heterogeneous traits. The approach is based on the global, exhaustive discovery of functional genomic patterns that cosegregate with a given molecular or clinical phenotype and may therefore be useful in its dissection. Such patterns include arbitrarily distant markers, possibly spanning several different chromosomes and are therefore ideally suited to a whole-genome analysis approach. The underlying deterministic pattern discovery algorithm can efficiently comb through very large data sets involving hundreds of patients and thousands of markers and the significance of the discovered patterns is assessed using a variety of statistical tests against both theoretical and simulated distributions.

We will first give examples of the application of this approach to the functional analysis of proteins and to the classification of lymphoma and brain tumors using microarray data. We will then address the issue of complex genetic disease and discuss the whole-genome analysis of Hirschsprung and schizophrenia data.

Mark Gerstein, Columbia University

Title: Computational Proteomics: Predicting Protein Function on a Genome-scale

P Harrison, J Qian, R Jansen, V Alexandrov, P Bertone, R Das, D Greenbaum, W Krebs, Y Liu, H Hegyi, N Echols, J Lin, C Wilson, A Drawid, Z Zhang, Y Kluger, N Lan, N Luscombe, S Balasubramanian

My talk will address two major post-genomic challenges: trying to predict protein function on a genomic scale and interpreting intergenic regions. I will approach both of these through analyzing the properties and attributes of proteins in a database framework. The work on predicting protein function will discuss the strengths and limitations of a number of approaches: (i) using sequence similarity; (ii) using structural similarity; (iii) clustering microarray experiments; and (iv) data integration. The last approach involves systematically combining information from the other three and holds the most promise for the future. For the sequence analysis, I will present a similarity threshold above which functional annotation can be transferred, and for the microarray analysis, I will present a new method of clustering expression timecourses that finds "time-shifted" relationships. In the second part of the talk, I will survey the occurrence of pseudogenes in several large eukaryotic genomes, concentrating on grouping them into families and functional categories and comparing these groupings with those of existing "living" genes. In particular, we have found that duplicated pseudogenes tend to have a very different distribution than one would expect if they were randomly derived from the population of genes in the genome. They tend to lie on the end of chromosomes, have an intermediate composition between that of genes and intergenic DNA, and, most importantly, have environmental-response functions. This suggests that they may be resurrectable protein parts, and there is a potential mechanism for this in yeast.



P Harrison H Hegyi, P Bertone, N Echols, T Johnson, S Balasubramanian, N Luscombe, M Gerstein. "Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome." Nucleic Acids Res 29: 818-30 (2001).

J Qian, M Dolled-Filhart, J Lin, H Yu, M Gerstein. "Beyond synexpression relationships: Local Clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions." J Mol Biol 314: 1053-1066 (2001).

R Jansen, D Greenbaum, M Gerstein. "Relating whole-genome expression data with protein-protein interactions." Genome Research 12: 37-46 (2002).

P Harrison, H Hegyi, P Bertone, N Echols, T Johnson, S Balasubramanian, N Luscombe, M Gerstein. "Molecular fossils in the human genome: Identification and analysis of pseudogenes in chromosomes 21 and 22." Genome Research 12: 273-281 (2002).

Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome. Z Zhang, P Harrison, M Gerstein (2002) Genome Res 12: 1466-82.

Spectral biclustering of microarray cancer data: co-clustering genes and conditions Y Kluger, R Basri, J T Chang, M Gerstein. (2003) Genome Res 13: 703-16

Bridging structural biology and genomics: assessing protein interaction data with known complexes. AM Edwards, B Kus, R Jansen, D Greenbaum, J Greenblatt, M Gerstein (2002) Trends Genet 18: 529-36.

A bayesian networks approach for predicting protein-protein interactions from genomic data. R Jansen, H Yu, D Greenbaum, Y Kluger, NJ Krogan, S Chung, A Emili, M Snyder, JF Greenblatt, M Gerstein (2003) Science 302: 449-53.

Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data. J Qian, J Lin, NM Luscombe, H Yu, M Gerstein (2003) Bioinformatics 19: 1917-26.


Zoltan N. Oltvai, Northwestern University

Title: Organization of cellular networks

One of the most important goals of post-genomic biology is the elucidation of the fundamental logic and constraints that determine the systemic behavior of a cell or microorganism. In their totality, the combined interaction of various cellular constituents forms a complex cellular network whose functional organization is an important determinant of all cellular behaviors.

Our research aims to understand the organization of this network from various perspectives, and with a combination of experimental and theoretical approaches. In our recent work we have demonstrated that the large-scale organization of metabolic-, transcriptional- and protein interaction networks appear identical, all possessing a robust and error tolerant scale-free topology with hierarchically embedded modularity. Within E. coli the uncovered hierarchical modu-larity closely overlaps with known metabolic func-tions, suggesting that the identified network archi-tecture may be generic to system-level cellular or-ga-nization.

Supported by NIH GM 62449 and DOE LAB01-21

Andrey Rzhetsky, Columbia University

Title: On truth, pathways and interactions

I will give an overview of our effort to automatically extract pathway information from a large number of full-text research articles (GeneWays system), automatically curate the extracted information, and to combine the literature-derived information with sequence and experimental (such as yeast two-hybrid) data using a probabilistic approach.

Eric Siggia, Rockefeller University

Title: Predicting patterns of transcriptional regulation in early fly embryos

Computational approaches to deciphering the regulatory information in the genome will be illustrated for the early patterning of the fly. The comparison of multiple related species allows a first look at how these regulatory networks evolved

Mona Singh, Princeton University

Title: Large-scale, high-confidence predictions of bZIP protein interactions

A major challenge in bioinformatics is to develop methods for predicting protein-protein interactions at the genomic scale. We approach this problem by focusing on a common and well-studied interaction motif, the parallel two-stranded coiled coil. Testing on coiled-coil interactions among nearly all human and yeast bZIP transcription factors, our method identifies 70% of strong interactions while maintaining that 92% of predictions are correct. Furthermore, cross-validation testing shows that including the bZIP experimental data significantly improves performance. Our method can be used to predict bZIP interactions in other genomes, and is a promising approach for predicting coiled-coil interactions more generally. While whole- and cross-genomic approaches to predicting protein partners have had some success, our work is the first to demonstrate an interaction interface for which large-scale, high-confidence computational predictions can be made.

(Joint work with Jessica Fong, Princeton University and Amy Keating, MIT)

Gustavo Stolovitzky, IBM

Title: Can we identify cellular processes that behave differentially in related cancers using gene expression data?

Differences between the cancer state of a cell can be characterized by alterations of important cellular processes such as cell proliferation, apoptosis, DNA-damage repair, and also by less well characterized processes within the cell. Some of these alterations involve modifications of the expression of genes that participate in these processes. From this simple observation it follows that the expression of genes involved in processes responsible for the differences in related cancers should be telling of which processes behave differentially. A systematic list of these processes should inform about the biology that differentiates these related states in a given cancer. We explore various means to find those differences using machine learning techniques. We apply these ideas to a data set of two subtypes of Chronic Lymphocytic Leukemia, one with a benign, the other with a more malignant disease course.

Sarah Teichmann, University of Cambridge

Title: Evolution of Multi-Domain Proteins

Two thirds of all prokaryote proteins, and eighty percent of eukaryote proteins are multi-domain proteins. The composition and interaction of the domains within a multi-domain protein determine its function. Using structural assignments to the proteins in completely sequenced genomes, we have insight into the domain architectures of a large fraction of all multi-domain proteins. Thus we can investigate the patterns of pairwise domain combinations, as well as the existence of evolutionary units larger than individual protein domains. Structural assignments provide us with the sequential arrangement of domains along a polypeptide chain. In order to fully understand the structure and function of a multi-domain protein, we also need to know the geometry of the domains relative to each other in three dimensions. By studying multi-domain proteins of known three-dimensional structure, we can gain insight into the conservation of domain geometry, and the prediction of the structures of domain assemblies.

Olga Troyanskaya, Princeton University

Title: Integrating heterogeneous data sources for gene function prediction

With the /Saccharomyces cerevisiae/ genome sequenced, gene function annotation remains a key challenge in yeast systems biology. A variety of high-throughput functional experimental techniques are available in /S. cerevisiae/, from classical methods such as affinity precipitation to advanced high-throughput techniques such as gene expression microarrays. An integrated analysis of heterogeneous data produced by these studies can provide accurate gene function predictions which can further be tested experimentally.

I will describe a system we developed to address these issues and present its evaluation and some novel functional predictions. Our system, called MAGIC (Multi-source Association of Genes by Integration of Clusters), is a general framework that uses formal Bayesian reasoning to integrate heterogeneous types of high-throughput biological data for accurate gene function prediction in /S. cerevisiae/. The system formally incorporates expert knowledge about relative accuracies of data sources in order to combine them within a normative framework and provides a belief level with its output that allows the user to vary the stringency of predictions. We applied MAGIC to /S. cerevisiae/ genetic and physical interactions, microarray, and transcription factor binding sites data and assessed the biological relevance of gene groupings using Gene Ontology annotations produced by the /Saccaromyces/ Genome Database. We found that by creating functional groupings based on heterogeneous data types, MAGIC improved accuracy of the groupings compared to microarray analysis alone and provided high-probability novel functional predictions for many unknown proteins. **


Troyanskaya OG, Dolinski K, Owen AB, Altman RB, and Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae). Proc Natl Acad Sci USA 100(14): 8348-53, 2003.

See: http://www.cs.princeton.edu/~ogt/

Cathy Wu, Georgetown University

Title: PIR Integrated Bioinformatics System for Functional Genomics and Proteomics

The completion of the human genome sequence marked the beginning of a new era of biological research when scientists have begun to tackle gene functions and other complex regulatory processes using global-scale data generated at various levels of biological organization. Meanwhile, new bioinformatics methods allow inference of protein function using associative analysis of functional properties to complement the traditional sequence homology-based methods. To fully exploit such high-throughput data requires bioinformatics infrastructures that support both data integration and associative analysis.

The Protein Information Resource (PIR, http://pir.georgetown.edu/) is a public bioinformatics resource that supports genomic and proteomic research and scientific studies. PIR recently joined the European Bioinformatics Institute and Swiss Institute of Bioinformatics to establish UniProt, an international resource of protein knowledge that unifies the PIR, Swiss-Prot, and TrEMBL databases. Central to the PIR/UniProt functional annotation of proteins is the PIRSF (SuperFamily) classification system that provides classification of whole proteins into a network structure to reflect their evolutionary relationships. This framework is supported by the iProClass integrated database of protein family, function, and structure, which provides value-added descriptions of all UniProt proteins with rich links to over 50 other databases of protein family, function, pathway, interaction, modification, structure, genome, ontology, literature, and taxonomy.

Coupling protein classification and data integration allows associative studies of protein family, function, and structure. Domain- or structural classification-based searches allow identification of protein families sharing domains or structural fold classes. Functional convergence and divergence are revealed by the relationships between the enzyme classification and protein family classification. With the underlying taxonomic information, protein families that occur in given lineages can be identified. Combining phylogenetic pattern and biochemical pathway information for protein families allows identification of alternative pathways. The systematic approach for protein family curation using integrative data facilitates functional inference for uncharacterized b as a basis for further analysis of protein functional evolution, and its relationship to the co-evolution of metabolic pathways, cellular networks, and organisms.

The PIR is supported by grant U01 HG02712 from the National Institutes of Health, and grants DBI-0138188 and ITR-0205470 from the National Science Foundation.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on October 24, 2003.