DIMACS Workshop on Integration of Diverse Biological Data
June 21-22, 2001
DIMACS Center, Rutgers University
Presented under the auspices of the Special Focus on Computational Molecular Biology.
- Andrea Califano (co-chair),, First Genetic Trust, firstname.lastname@example.org
- Conrad Gilliam (co-chair), Columbia University, email@example.com
- Fred S. Roberts, Rutgers University, firstname.lastname@example.org
Multidimensional Scaling of Massive Data Sets
Dimitris K. Agrafiotis, 3-Dimensional Pharmaceuticals, Inc.
Multidimensional scaling (MDS) is a collection of statistical techniques
that attempt to embed a set of patterns described by means of a
dissimilarity matrix into a low-dimensional display plane in a way that
preserves their original pairwise relationships as closely as possible.
Unfortunately, current MDS algorithms are notoriously slow, and their use
is limited to small data sets. In this paper, we present a family of
algorithms that combine nonlinear mapping techniques with neural networks,
and make possible the scaling of very large data sets that are intractable
with conventional methodologies. The method employs a nonlinear mapping
algorithm to project a small random sample, and then 'learns' the
underlying transform using one or more multi-layer perceptrons. The
distinct advantage of this approach is that it captures the nonlinear
mapping relationship in an explicit function, and allows the scaling of
additional patterns as they become available, without the need to
reconstruct the entire map. A novel encoding scheme is described, allowing
this methodology to be used with a wide variety of input data
representations and similarity functions. The advantages of this approach
are illustrated using examples from the field of combinatorial chemistry.
It is shown that in the case of combinatorial libraries, it is possible
predict the coordinates of the products on the nonlinear map from pertinent
features of their respective building blocks, and thus limit the
computationally expensive steps of virtual synthesis and descriptor
generation to only a small fraction of products. In effect, the method
provides an explicit mapping function from reagents to products, and allows
the vast majority of compounds to be projected without constructing their
Cluster Analysis and the Development of a Multi-Genome Database of Tandem Repeats
Gary Benson, The Mount Sinai School of Medicine
This research deals with problems involved in the on-going development of a
multi-genome database of tandem repeats (TRDB). A tandem repeat is an
occurance of two or more adjacent, often approximate copies of a
sequence of Nucleotides. Tandem repeats 1) are known to cause or be associated
with a variety of human diseases, 2) can cause phenotypic variation or
loss-of-function in protein and 3) can modify gene expression possibly by
adopting unusual structural conformations or by acting as transcription factor
binding sites. Tandem repeats are the primary component of major chromosomal
structures including the centromere, telomere and heterochromatin, and are
involved in chromosome condensation. Because tandem repeats often exhibit copy
number polymorphism, they are useful markers for genetic linkage analysis, DNA
fingerprinting and evolutionary studies.
While tandem repeats form one of the major classes of repeats in genomic DNA,
information about them remains incomplete, fragmented and difficult for
researchers to access. Development of TRDB involves generating primary
information about the repeats and collecting ancillary annotation
information. Central issues of TRDB development are:
Clustering of repeats into families. A family is a set of
repeats which have similar patterns but occur in different genomic locations
or in different genomes. We discuss partition type clustering based on distance
between profile representations of the repeats. A profile is a sequence
of discrete distributions. The sequence has length equal to the fundamental
pattern size of the repeat and each distribution represents the frequency of
the four nucleotides in a column of the aligned repeat copies. Use of a profile
representations has necessitated the development of new distance measures for
comparing discrete distributions.
Functional clustering. Complementary to profile clustering is clustering
based on functional properties of the repeats. Functional data are generated
by new sequence analysis methods, accessing existing data sources for annotation
features and biological experiment. Functional data includes, for individual
repeats: 1) 'genomic environment': adjacent genes and localization to intron,
exon, untranslated or intergenic regions, 2) known or predicted copy number
polymorphism, and 3) potential transcription factor binding sites; for repeat
families: 4) internal homogeneity of the family, 5) distribution in the genome,
6) similarity of flanking sequence, and 7) association with protein families.
Integrated data visualization and selection (IDVS) tool. To assist
in the validation of clustering, funtions of this tool include 1) a query system
to select repeats/families and associations/properties based on the predefined
data requests and, 2) the creation of views for repeat, family and genome properties.
Functional Classification of Protein Families by Top-down Clustering from
Sequence and Structural Data
Andrea Califano, First Genetic Trust
Given a set of proteins, two important problems in biology are the inference of
biologically and functionally related subsets and the identification of functional
regions and residues. The former is typically performed by unsupervised, bottom-up
clustering, using sequence similarity as a measure of relatedness. The latter
is typically performed as an independent step, starting from protein sets
determined a-priori, either manually or computationally. Semantically, however,
the two processes are inextricably linked, since protein families are usually
characterized by corresponding functional regions and residues.
This paper introduces a high-performance, unsupervised clustering system that
accomplishes both tasks simultaneously. Potential functional regions, inferred
using the SPLASH pattern discovery algorithm, are first filtered using statistical
criteria and then used to determine functionally related protein subsets. To
achieve increased accuracy and sensitivity, the regular expression patterns
discovered by the algorithm are converted into more sensitive and accurate profile
Hidden Markov Models (HMM). This is the first reported system where potential
functional regions are exhaustively and automatically identified from a set of
The inference of functional relationship is performed via a general and flexible
model which integrates both sequence and structural information. The resulting
classification system is organized into structures of varying complexity, ranging
from a tree to an acyclic graph. Since the relationships correspond to conservation
of functional regions these structure are expected to be representative of the
functional relationships formed throughout the evolutionary processes.
To test the system's ability to deal with complex taxonomies, comparative results
on the G-Protein Coupled Receptor (GPCR) superfamily are reported. This includes
more than 150 functionally independent subfamilies. As shown in the results, the
amino acids that are highly conserved in the discovered patterns are very likely
to correspond to functional residues. Several hundred functional residues reported
in the literature, based on mutagenesis experiments, have been analyzed in the
context of the reported patterns. This shows that the system can be used as a
highly predictive aid to the planning of this type of experiments.
Modeling Tumors as Complex Dynamic BioSystems
Thomas Deisboeck, Harvard Medical School
There is growing evidence that malignant tumors behave as complex
dynamic biosystems rather than as unorganized cell masses. If this is true,
tumors need to be experimentally studied and ultimately treated as such
systems. This requires the integration of multi-modality data sets. A
promising approach is the crossdisciplinary combination of novel
experimental assays and computational modeling. The talk will describe the
concept leading to the development of such an experimental device as well as
its input in computer visualizations and simulations using cellular automata
and agent-based modeling. Potential clinical applications of this ongoing
work will be discussed.
Functional Characterization in the Post-Genomic Era by Means of Declarative
Query Access to Diverse Data and Applications
(Joint work with A.S. Kosky, Gene Logic, Inc. and L.A. Laroco, Jr.,
Barbara A. Eckman, GlaxoSmithKline
To perform functional characterization in genomic sequence it is necessary to
integrate data from a variety of locations (within an organization or across
the Internet) in a variety of formats (traditional databases, flat files, web
sites). In addition to simply retrieving data, as in traditional DBMSs, it is
necessary to perform specialized data analysis to discover patterns of biological
interest. Integrating arbitrary analysis with query execution permits filtering,
organizing, and enhancing data retrieved by wide-ranging multi-database queries,
and increases data mining efficiency by enabling analyses to be performed only
on datasets of interest.
TINet (Target Informatics Net) is a readily extensible data integration system
developed at GlaxoSmithKline (GSK), based on the Object-Protocol Model (OPM)
multi-database middleware system of Gene Logic, Inc. Data sources currently
integrated include: the Mouse Genome Database (MGD) and Gene Expression Database
(GXD), GenBank, SwissProt, PROSITE, PubMed, GeneCards, and GSK proprietary
relational and SRS databases. Analytic tools are integrated either as data
source servers (e.g., runtime BLAST and GCG motifs searches) or as special-purpose
class methods (e.g., regular expression pattern-matching over BLAST HSP
alignments and retrieving partial sequences derived from GenBank primary
structure annotations). All data sources and methods are accessible through
an SQL-like query language or a GUI, so that when new investigations arise no
additional programming beyond query specification is required. The power and
flexibility of this approach are illustrated in such integrated queries as:
1) "Find homologues in genomic sequence to all novel genes cloned and reported
in the scientific literature within the past three months that are linked to
the MeSH term 'neoplasms'"; 2) "Using a neuropeptide precursor query sequence,
return only HSPs where the target genomic sequences conserve the G?[KR][KR]
motif at the appropriate points in the HSP alignment"; and 3) "Of the human
genomic sequences annotated as channels having exon boundaries in GenBank,
return only those with valid putative donor/acceptor sites and start/stop
Integrative Genomics: Surveys of a Finite Parts List
(Joint workwith P. Harrison, J. Qian, V. Alexandrov, P. Bertone, R. Das, D. Greenbaum,
R. Jansen, W. Krebs N. Echols, J. Lin, C. Wilson and A. Drawid)
Mark Gerstein, Yale University
My talk will focus on analyzing genomes and functional genomics data
in terms of the finite list of protein "parts". I use the term "part"
rather broadly, and depending on context, it can either be a protein
fold or family. I will touch on SOME of the following topics: (i) How
one can compare different genomes in terms of the occurrence of
parts. (ii) How one can do the exact same operation on the
pseudogenome -- the total complement on pseudogenes in an organism.
(iii) How this idea can be further extended to compare the
representation of parts in the genome versus the transcriptome.
P Harrison , N Echols , M Gerstein (2001). "Digging for Dead Genes: An
Analysis of the Characteristics of the Pseudogene Population in
the C. elegans Genome." Nuc. Acids. Res. (in press).
A Drawid , R Jansen , M Gerstein (2000). "Genome-wide analysis
relating expression level with protein subcellular
localization." Trends Genet 16: 426-30.
A Drawid , M Gerstein (2000). "A Bayesian system integrating
expression data with sequence patterns for localizing proteins:
comprehensive application to the yeast genome." J Mol Biol 301:
J Lin , M Gerstein (2000). "Whole-genome trees based on the occurrence
of folds and orthologs: implications for comparing genomes on
different levels." Genome Res 10: 808-18.
R Jansen , M Gerstein (2000). "Analysis of the yeast transcriptome
with structural and functional categories: characterizing highly
expressed proteins." Nucleic Acids Res 28: 1481-8.
M Gerstein (1998). "Patterns of protein-fold usage in eight microbial
genomes: a comprehensive structural census." Proteins 33: 518-34.
Application for Support Vector Machine to detect an association between a disease or trait
and multiple SNP variations.
MyungHo Kim, Genomics Collaborative, Inc.
After the completion of human genome sequence was anounced, it is evident that
interpretation of DNA sequences is an immediate task to work on. For understanding
their function and signals, improvement of present sequence analysis tools and
developing new ones become necessary. Along this current trend, we attack one
of the fundamental questions, which set of SNP(single nucleotide polymorphism)
variations is related to a specific disease or trait is. For, in the whole DNA
sequence, it is known that people have different DNAs only at SNP locations,
and moreover, the total SNPs are less than 5 millions, finding an association
between SNP variations and certain disease or trait is believed to be one of
the essential steps not only for genetic researches but for drug design and
discovery. In this paper, we are going to present a method of detecting whether
there is an association between multiple SNP variations and a trait or disease.
Here is the basic scheme.
1. Assume that there is no environmental factor.
2. Suggest a vector representation of multiple SNP variations.
3. Apply the Support Vector Machine, which has been attracting lots of
A Bayesian Framework for Combining Gene Predictions
(Joint work with V. Pavlovic, A. Garg and S. Kasif)
Pedro Moreno, Cambridge Research Laboratory
Gene identification and gene discovery in new genomic sequences is one of the
most timely computational questions addressed by bioinformatics scientists.
This computational research has resulted in several systems that have been used
successfully in many whole-genome analysis projects. As the number of such
systems grows the need for a rigorous way to combine the predictions becomes
In this presentation we provide a Bayesian network framework for combining gene
predictions from multiple systems. The framework allows us to treat the problem
as combining the advice of multiple experts. Previous work in the area used
relatively simple ideas such as majority voting. We describe the application
of a family of combiners of increasing statistical complexity. In particular,
we introduce, for the first time, the use of Hidden Input/Output Markov models
for combining gene predictions.
We apply the framework to the analysis of the Adh region in Drosophila that has
been carefully studied in the context of gene finding and used as a basis for
the GASP competition. Our preliminary results suggest that the probabilistic
network solution appears promising resulting in a significant improvement in
exon level accuracy vs the best single predictor.
The main challenge in combination of gene prediction programs is the fact that
the systems are relying on similar features such as codon usage and as a result
the predictions are often correlated. We show that our approach is promising to
improve the prediction accuracy and provides a systematic and flexible framework
for incorporating multiple sources of evidence into gene prediction systems.
We also note that the approach we described is in principle applicable to other
predictive tasks such as promoter or transcription elements recognition and/or
combining different sources of functional genomics data.
Gene finding in Eukaryotes
Mihaela Pertea, Johns Hopkins University and The Institute for Genomic Research
The gene finding research community has focused considerable effort on human
and bacterial genome analysis. This has left some small eukaryotes without a
system to address their needs. We focused our attention on this category of
organisms, and designed several algorithms to improve the accuracy of the gene
finding detection for them. We considered three alternatives for gene searching.
The first one identifies a coding region by searching signals surrounding the
coding region. This technique is used by GeneSplicer - a program that predicts
putative locations for the splice sites. A second alternative is to identify a
protein region by analyzing the nucleotide distribution within the coding region.
Complex gene finders like GlimmerM combine both the above alternatives to
discover genes. The third alternative carefully combines the predictions of
existing gene finders to produce a significantly improved gene detection system.
GeneSplicer is a new, flexible system for detecting splice sites in the genomic
DNA of various eukaryotes. The system has been tested successfully using DNA
from two reference organisms: the model plant Arabidopsis thaliana and human.
It was compared to six programs representing the leading splice site detectors
for each of these species: NetPlantGene, NetGene2, HSPL, NNSPLICE, GENIO and
SpliceView. In each case GeneSplicer performed comparably to the best alternative,
in terms of both accuracy and computational efficiency.
The basis of GlimmerM is a dynamic programming algorithm that considers all
combinations of possible exons for inclusion in a gene model, and chooses the
best of these combinations. The decision about what gene model is best is a
combination of the strength of the splice sites and the score of the exons
produced by an interpolated Markov model (IMM). The system, which is freely
available at http://www.tigr.org/softlab, has been trained for Plasmodium
falciparum, Arabidopsis thaliana, Oryza sativa (rice), and should also work
well on closely related organisms.
We developed a combiner algorithm that gains from the diversity of three or
more gene finders. The combiner was tested on three gene finders developed
specifically for the Arabidopsis genome: GENSCAN, GeneMark.HMM and GlimmerA -
the GlimmerM version for A. thaliana. These gene finders are the result of
years of development, and improving upon these systems is quite difficult. The
combiner algorithm not only succeeds at this, but it also offers a real possibility
of further improvements if and when the underlying gene finders are improved.
Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements
(Joint work with P. Sudarsannam and G.M. Church)
Yitzhak Pilpel, Harvard University
Abstract: The recent availability of microarray data has led to the
development of several computational approaches for studying genome-wide
transcriptional regulation. However, few studies have addressed the
combinatorial nature of transcription, a well-established phenomenon
in eukaryotes. We have developed a new computational method that
analyzes microarray data to discover synergistic motif combinations
in the promoters of S. cerevisiae. Our method suggests causal
relationships between each motif in a combination and the observed
expression patterns. In addition to identifying novel motif
combinations that affect expression patterns during the cell cycle,
sporulation, and various stress response conditions, we have also
discovered regulatory cross-talk between several of these processes.
We have generated motif synergy maps that provide a global view of
the transcription networks in the cell. The maps are highly connected
suggesting that a small number of transcription factors are
responsible for a complex set of expression patterns in diverse
conditions. This approach should be important for modeling
transcriptional regulatory networks in more complex eukaryotes.
Unweaving Regulatory Networks: Automated Extraction from Literature and
Audry Rzhetsky, Columbia University
Abstract: In the first part of the talk I will describe our on-going
effort to built a natural language processing system extracting information
on interactions between genes and proteins from research articles. In the
second part of the talk I will introduce an algorithm for predicting
molecular networks from sequence data and stochastic models of birth of
An integrative platform for expression and sequence data
(Joint work with S. Bergling, I. Crignon, U. Dengler, S. Grzybek, J. Lange.
J. Rahuel, M. Reinhardt and J. Zhue)
Sven Schuierer, Novartis
Experimental high-throughput techniques such as sequencing and protein
and microarray expression experiments are creating an enormous amount of
data. The rapidly changing environment of genomic research involves a
wide range of technologies which lead to a large number of heterogeneous
data sources and access methods. System integration is essential in this
To meet this challenge we have developed the integrative platform DEMON
(Differential Expression & Annotation Monitor) which combines expression
and sequence data in one data base. The core features of DEMON are:
a generic interface to data and algorithms
a flexible, open architecture
a scheduling mechanism for automated high-throughput data
processing, e.g. for the analysis of sequence- and expression-data.
DEMON provides tools for the analysis of microarray expression which are
linked to sequence annotation and sequence classification information.
The sequence annotation is pre-computed which gives the users immediate
access to the results of a number of different sequence similarity
Furthermore, DEMON allows to query different sequence types in a uniform
manner by building a common coordinate system to which related DNA, RNA
and protein sequences are mapped.
NSF Funding Availability For Data Integration Efforts
Sylvia Spengler, National Science Foundation
The National Science Foundation has a variety of opportunites for
individuals seeking support for data integration activities. These include
a variety of cross-directorate activities as well as Programs in BIO and
CISE. I will give an overview of the opportunies as well as discuss ways to
make future calls rapidly available.
What can be "learned" from gene expression arrays?
Gustavo Stolovitzky, IBM
One important application of gene expression arrays is functional annotation. When
cells are treated under different conditions, genes will change their profile of
expression according to their cellular role, and in principle this profile can
be learned using machine learning techniques.When these algorithms are used along
the prior knowledge of gene function, one might expect to learn the expression
signatures of different functional classes. Support Vector Machines and other
machine learning algorithms have been applied for this purpose [Brown et al.,
PNAS 97, 262-267 (2000)], and this work will be reviewed.
We have explored the use of a supervised learning scheme that uses artificial
neural networks (NN) for the purpose of functional annotation. We considered 100
functional classes catalogued in the MIPS (Munich Information Center for Protein
Sequences) database, and attempted to learn their signature, using the gene
expression data previously used by Eisen [PNAS 95, 14863-14868 (1998)]. We found
that only a small subset (less that 10%) of these functional classes can be
learned. We explored the features that make a class "learnable". For one of the
best learned classes, the TCA cycle, we did a systematic analysis of the False
Positives and False Negatives arising from a cross-validation scheme, and found
that they can be accounted for in terms of metabolic pathways related to the TCA
cycle using the KEGG database of biochemical pathways.
A Heart Failure Knowledgebase Combining Experimental Data with Tools for
Integrative Biological Modeling
(Joint work with W. Baumgartner Jr., P. Helm, D. Scollan, C. Yung and T. Suzek)
Raimond L. Winslow, The Whitaker Biomedical Engineering Institute Center for
Computational Medicine and Biology, and Department of Computer Science
Heart failure , the most common cardiovascular disorder, is characterized
by ventricular dilatation, decreased myocardial contractility and cardiac
output. Prevalence in the general population is over 4.5 million, and
increases with age to levels as high as 10%. New cases number approximately
400,000 per year. Patient prognosis is poor, with mortality roughly 15% at
one year, increasing to 80% at six years subsequent to diagnosis. It is now
the leading cause of Sudden Cardiac Death in the U.S., accounting for
nearly half of all such deaths.
For the past six years, we have worked with experimental colleagues in the
NIH-funded Specialized Center of Research in Sudden Cardiac Death to
achieve a more comprehensive understanding of the origins and treatment of
heart failure. We have done so by undertaking a range of experimental
studies which include: a) large-scale measurement of mRNA levels in cardiac
tissue; b) patch-clamp and whole-cell recording in individual isolated
myocytes; c) imaging of cardiac micro-anatomical structure; and d)
electrophysiological recordings of electrical activity; in both normal and
failing hearts. At the same time, we have formulated computational models,
ranging from the level of individual ion channels to single cells and whole
heart, and have used these models to investigate the relationship between
altered patterns of gene expression and mechanisms of arrhythmia in heart
In this talk, we will describe how computational models have been applied
to reach specific conclusions regarding the mechanisms by which
life-threatening arrhythmias arise in heart failure. We will also describe
a more general conclusion emerging from this work - that heart failure is a
complex disease characterized by changes in expression of hundreds of genes
involved in many different cellular sub-systems. In our view, understanding
the functional significance of these changes is a challenging problem that
will require the development of a heart failure knowledgebase comprised of:
a) regional gene expression data; b) regional cellular electrophysiological
data; c) cardiac micro-anatomical data; and d) whole-heart electrical
mapping data; obtained from normal versus failing hearts. In addition, an
interface that supports the exploration of these diverse data sources for
the purpose of model development is required. We will describe our initial
efforts at creating key components of this heart failure knowledgebase.
(supported by NIH HL60133, the NIH Specialized Center of Research on Sudden
Cardiac Death P50 HL52307, the Whitaker Foundation, and IBM Corporation,
Contacting the Center
Document last modified on June 11, 2001.