DIMACS Workshop on Data Mining Techniques in Bioinformatics
October 30 - 31, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ
- Organizers:
- Mona Singh, Princeton University, mona@cs.princeton.edu
- Mark Gerstein, Yale University, Mark.Gerstein@yale.edu
Presented under the auspices of the
Special Focus on Data Analysis and Mining and
the Special Focus on Computational Molecular Biology.
Abstracts:
Ziv Bar-Joseph, MIT
Title: Computational discovery of gene modules and regulatory networks
Recent advances in high-throughput experimental methods in molecular biology
hold great promise. DNA microarrays have been used to measure the expression
levels of thousands of genes, and more recently microarrays have been
exploited to measure genome-wide protein-DNA binding events. While useful,
these datasets present many computational challenges. High-throughput
biological datasets are often very noisy and contain many missing points. In
addition, each of these data source measures only one type of activity in
the cell. Principled computational methods are required in order to make
full use of each of these datasets, and to combine them to infer genetic
interaction networks. In this talk I will describe an algorithm that
efficiently combines complementary large-scale expression and protein-DNA
binding data to discover co-regulated modules of genes. I will then present
extensions to this algorithm by using time series expression data to
automatically infer and validate a dynamic sub-network for the cell cycle
system.
Harmen Bussemaker, Columbia University
Title: Integrative modeling of microarray data for mRNA expression and transcription factor occupancy
Functional genomics studies are yielding information about regulatory
processes in the cell at an unprecedented scale. We present a
computational approach that integrates genomewide transcription factor
occupancy data with a library of mRNA expression data to define the
regulatory network of the cell. Applying our method to S. cerevisiae
we find that on average 58% of the genes whose promoter region is
bound by a transcription factor are functional targets. We are able to
assign directionality to transcription factors that control
divergently transcribed genes sharing the same promoter region. By
naturally taking into account the combinatorial nature of
transcriptional control, our approach overcomes limitations of
transcription factor deletion experiments.
Andrea Califano, Columbia University
Title: Global Search for Genetic Associations by
Pattern Discovery : Methods and Examples
Linkage-based parametric & non-parametric methods have proven
successful in localizing the genetic factors of Mendelian
traits. However, the dissection of complex inheritance of common
phenotypic traits requires new analytical approaches to reveal the
small-to-medium effects of multiple susceptibility loci. Although
single locus analysis is straightforward and the statistics to
evaluate significance has been adequately formulated, it increasingly
lacks the power to dissect the genetic complexity of common
heterogeneous diseases associated to small individual effects and
gene-gene interactions.
This talk presents a pattern discovery and
corresponding statistical analysis framework that has been validated
in several functional genomics contexts, from the functional
classification of proteins, to the analysis of cancer microarray data,
to the dissection of complex, heterogeneous traits. The approach is
based on the global, exhaustive discovery of functional genomic
patterns that cosegregate with a given molecular or clinical phenotype
and may therefore be useful in its dissection. Such patterns include
arbitrarily distant markers, possibly spanning several different
chromosomes and are therefore ideally suited to a whole-genome
analysis approach. The underlying deterministic pattern discovery
algorithm can efficiently comb through very large data sets involving
hundreds of patients and thousands of markers and the significance of
the discovered patterns is assessed using a variety of statistical
tests against both theoretical and simulated distributions.
We will
first give examples of the application of this approach to the
functional analysis of proteins and to the classification of lymphoma
and brain tumors using microarray data. We will then address the issue
of complex genetic disease and discuss the whole-genome analysis of
Hirschsprung and schizophrenia data.
Mark Gerstein, Columbia University
Title: Computational Proteomics: Predicting Protein Function on a Genome-scale
P Harrison, J Qian, R Jansen, V Alexandrov, P Bertone, R Das,
D Greenbaum, W Krebs, Y Liu, H Hegyi, N Echols, J Lin, C Wilson,
A Drawid, Z Zhang, Y Kluger, N Lan, N Luscombe, S Balasubramanian
My talk will address two major post-genomic challenges: trying to
predict protein function on a genomic scale and interpreting
intergenic regions. I will approach both of these through analyzing
the properties and attributes of proteins in a database framework. The
work on predicting protein function will discuss the strengths and
limitations of a number of approaches: (i) using sequence similarity;
(ii) using structural similarity; (iii) clustering microarray
experiments; and (iv) data integration. The last approach involves
systematically combining information from the other three and holds
the most promise for the future. For the sequence analysis, I will
present a similarity threshold above which functional annotation can
be transferred, and for the microarray analysis, I will present a new
method of clustering expression timecourses that finds "time-shifted"
relationships. In the second part of the talk, I will survey the
occurrence of pseudogenes in several large eukaryotic genomes,
concentrating on grouping them into families and functional categories
and comparing these groupings with those of existing "living" genes.
In particular, we have found that duplicated pseudogenes tend to have
a very different distribution than one would expect if they were
randomly derived from the population of genes in the genome. They
tend to lie on the end of chromosomes, have an intermediate
composition between that of genes and intergenic DNA, and, most
importantly, have environmental-response functions. This suggests that
they may be resurrectable protein parts, and there is a potential
mechanism for this in yeast.
See:http://bioinfo.mbb.yale.edu
References:
P Harrison H Hegyi, P Bertone, N Echols, T Johnson, S Balasubramanian,
N Luscombe, M Gerstein.
"Digging for dead genes: an analysis of the characteristics of
the pseudogene population in the Caenorhabditis elegans genome."
Nucleic Acids Res 29: 818-30 (2001).
J Qian, M Dolled-Filhart, J Lin, H Yu, M Gerstein.
"Beyond synexpression relationships: Local Clustering of
time-shifted and
inverted gene expression profiles identifies new, biologically
relevant interactions."
J Mol Biol 314: 1053-1066 (2001).
R Jansen, D Greenbaum, M Gerstein.
"Relating whole-genome expression data with protein-protein
interactions."
Genome Research 12: 37-46 (2002).
P Harrison, H Hegyi, P Bertone, N Echols, T Johnson, S Balasubramanian,
N Luscombe, M Gerstein.
"Molecular fossils in the human genome:
Identification and analysis of pseudogenes in chromosomes 21 and 22."
Genome Research 12: 273-281 (2002).
Identification and analysis of over 2000 ribosomal protein pseudogenes
in the human genome.
Z Zhang, P Harrison, M Gerstein (2002) Genome Res 12: 1466-82.
Spectral biclustering of microarray cancer data: co-clustering genes and
conditions
Y Kluger, R Basri, J T Chang, M Gerstein. (2003) Genome Res 13: 703-16
Bridging structural biology and genomics: assessing protein interaction
data with known complexes.
AM Edwards, B Kus, R Jansen, D Greenbaum, J Greenblatt, M Gerstein
(2002) Trends Genet 18: 529-36.
A bayesian networks approach for predicting protein-protein interactions from genomic data.
R Jansen, H Yu, D Greenbaum, Y Kluger, NJ Krogan, S Chung, A Emili, M Snyder,
JF Greenblatt, M Gerstein (2003) Science 302: 449-53.
Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data.
J Qian, J Lin, NM Luscombe, H Yu, M Gerstein (2003) Bioinformatics 19: 1917-26.
See:http://bioinfo.mbb.yale.edu/papers/
http://partslist.org
http://pseudogene.org
http://genecensus.org
http://molmovdb.org
Zoltan N. Oltvai, Northwestern University
Title: Organization of cellular networks
One of the most important goals of post-genomic biology is the
elucidation of the fundamental logic and constraints that determine
the systemic behavior of a cell or microorganism. In their totality,
the combined interaction of various cellular constituents forms a
complex cellular network whose functional organization is an important
determinant of all cellular behaviors.
Our research aims to understand the organization of this network from
various perspectives, and with a combination of experimental and
theoretical approaches. In our recent work we have demonstrated that
the large-scale organization of metabolic-, transcriptional- and
protein interaction networks appear identical, all possessing a robust
and error tolerant scale-free topology with hierarchically embedded
modularity. Within E. coli the uncovered hierarchical modu-larity
closely overlaps with known metabolic func-tions, suggesting that the
identified network archi-tecture may be generic to system-level
cellular or-ga-nization.
Supported by NIH GM 62449 and DOE LAB01-21
Andrey Rzhetsky, Columbia University
Title: On truth, pathways and interactions
I will give an overview of our effort to automatically extract pathway
information from a large number of full-text research articles (GeneWays
system), automatically curate the extracted information, and to combine the
literature-derived information with sequence and experimental (such as yeast
two-hybrid) data using a probabilistic approach.
Eric Siggia, Rockefeller University
Title: Predicting patterns of transcriptional regulation in early fly
embryos
Computational approaches to deciphering the regulatory information in the
genome will be illustrated for the early patterning of the fly. The comparison
of multiple related species allows a first look at how these regulatory
networks evolved
Mona Singh, Princeton University
Title: Large-scale, high-confidence predictions of bZIP protein interactions
A major challenge in bioinformatics is to develop methods for
predicting protein-protein interactions at the genomic scale. We
approach this problem by focusing on a common and well-studied
interaction motif, the parallel two-stranded coiled coil. Testing on
coiled-coil interactions among nearly all human and yeast bZIP
transcription factors, our method identifies 70% of strong
interactions while maintaining that 92% of predictions are
correct. Furthermore, cross-validation testing shows that including
the bZIP experimental data significantly improves performance. Our
method can be used to predict bZIP interactions in other genomes, and
is a promising approach for predicting coiled-coil interactions more
generally. While whole- and cross-genomic approaches to predicting
protein partners have had some success, our work is the first to
demonstrate an interaction interface for which large-scale,
high-confidence computational predictions can be made.
(Joint work with Jessica Fong, Princeton University and Amy Keating, MIT)
Gustavo Stolovitzky, IBM
Title: Can we identify cellular processes that behave differentially in
related cancers using gene expression data?
Differences between the cancer state of a cell can be characterized by
alterations of important cellular processes such as cell proliferation,
apoptosis, DNA-damage repair, and also by less well characterized
processes within the cell. Some of these alterations involve modifications
of the expression of genes that participate in these processes. From this
simple observation it follows that the expression of genes involved in
processes responsible for the differences in related cancers should be
telling of which processes behave differentially. A systematic list of
these processes should inform about the biology that differentiates these
related states in a given cancer. We explore various means to find those
differences using machine learning techniques. We apply these ideas to a
data set of two subtypes of Chronic Lymphocytic Leukemia, one with a
benign, the other with a more malignant disease course.
Sarah Teichmann, University of Cambridge
Title: Evolution of Multi-Domain Proteins
Two thirds of all prokaryote proteins, and eighty percent of eukaryote
proteins are multi-domain proteins. The composition and interaction of
the domains within a multi-domain protein determine its
function. Using structural assignments to the proteins in completely
sequenced genomes, we have insight into the domain architectures of a
large fraction of all multi-domain proteins. Thus we can investigate
the patterns of pairwise domain combinations, as well as the existence
of evolutionary units larger than individual protein domains.
Structural assignments provide us with the sequential arrangement of
domains along a polypeptide chain. In order to fully understand the
structure and function of a multi-domain protein, we also need to know
the geometry of the domains relative to each other in three
dimensions. By studying multi-domain proteins of known
three-dimensional structure, we can gain insight into the conservation
of domain geometry, and the prediction of the structures of domain
assemblies.
Olga Troyanskaya, Princeton University
Title: Integrating heterogeneous data sources for gene function prediction
With the /Saccharomyces cerevisiae/ genome sequenced, gene function
annotation remains a key challenge in yeast systems biology. A variety
of high-throughput functional experimental techniques are available in
/S. cerevisiae/, from classical methods such as affinity precipitation
to advanced high-throughput techniques such as gene expression
microarrays. An integrated analysis of heterogeneous data produced by
these studies can provide accurate gene function predictions which can
further be tested experimentally.
I will describe a system we developed to address these issues and
present its evaluation and some novel functional predictions. Our
system, called MAGIC (Multi-source Association of Genes by Integration
of Clusters), is a general framework that uses formal Bayesian reasoning
to integrate heterogeneous types of high-throughput biological data for
accurate gene function prediction in /S. cerevisiae/. The system
formally incorporates expert knowledge about relative accuracies of data
sources in order to combine them within a normative framework and
provides a belief level with its output that allows the user to vary the
stringency of predictions. We applied MAGIC to /S. cerevisiae/ genetic
and physical interactions, microarray, and transcription factor binding
sites data and assessed the biological relevance of gene groupings using
Gene Ontology annotations produced by the /Saccaromyces/ Genome
Database. We found that by creating functional groupings based on
heterogeneous data types, MAGIC improved accuracy of the groupings
compared to microarray analysis alone and provided high-probability
novel functional predictions for many unknown proteins. **
References:
Troyanskaya OG, Dolinski K, Owen AB, Altman RB, and Botstein D. A
Bayesian framework for combining heterogeneous data sources for gene
function prediction (in S. cerevisiae). Proc Natl Acad Sci USA 100(14):
8348-53, 2003.
See: http://www.cs.princeton.edu/~ogt/
Cathy Wu, Georgetown University
Title: PIR Integrated Bioinformatics System for Functional Genomics and
Proteomics
The completion of the human genome sequence marked the beginning of a
new era of biological research when scientists have begun to tackle
gene functions and other complex regulatory processes using
global-scale data generated at various levels of biological
organization. Meanwhile, new bioinformatics methods allow inference of
protein function using associative analysis of functional properties
to complement the traditional sequence homology-based methods. To
fully exploit such high-throughput data requires bioinformatics
infrastructures that support both data integration and associative
analysis.
The Protein Information Resource (PIR, http://pir.georgetown.edu/) is
a public bioinformatics resource that supports genomic and proteomic
research and scientific studies. PIR recently joined the European
Bioinformatics Institute and Swiss Institute of Bioinformatics to
establish UniProt, an international resource of protein knowledge that
unifies the PIR, Swiss-Prot, and TrEMBL databases. Central to the
PIR/UniProt functional annotation of proteins is the PIRSF
(SuperFamily) classification system that provides classification of
whole proteins into a network structure to reflect their evolutionary
relationships. This framework is supported by the iProClass integrated
database of protein family, function, and structure, which provides
value-added descriptions of all UniProt proteins with rich links to
over 50 other databases of protein family, function, pathway,
interaction, modification, structure, genome, ontology, literature,
and taxonomy.
Coupling protein classification and data integration allows
associative studies of protein family, function, and
structure. Domain- or structural classification-based searches allow
identification of protein families sharing domains or structural fold
classes. Functional convergence and divergence are revealed by the
relationships between the enzyme classification and protein family
classification. With the underlying taxonomic information, protein
families that occur in given lineages can be identified. Combining
phylogenetic pattern and biochemical pathway information for protein
families allows identification of alternative pathways. The systematic
approach for protein family curation using integrative data
facilitates functional inference for uncharacterized b as a basis for
further analysis of protein functional evolution, and its relationship
to the co-evolution of metabolic pathways, cellular networks, and
organisms.
The PIR is supported by grant U01 HG02712 from the National Institutes
of Health, and grants DBI-0138188 and ITR-0205470 from the National
Science Foundation.
Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on October 24, 2003.