DIMACS Workshop on Whole Genome Comparison

February 28 - March 2, 2001
DIMACS Center, Rutgers University, Piscataway, NJ

Organizer:: Dannie Durand, Carnegie Mellon University, durand@cmu.edu
Program Committee: Joseph H. Nadeau, Case Western Reserve University School of Medicine, jhn4@po.cwru.edu; Steven L. Salzberg, The Institute for Genomic Research, salzberg@tigr.org; David Sankoff, University of Montreal, sankoff@ere.umontreal.ca
Presented under the auspices of the Special Year on Computational Molecular Biology.
Abstracts:


1.

High-Performance Algorithm Engineering for Gene-Order Phylogenies

David A. Bader, Bernard M.E. Moret, Tandy Warnow, Stacia K. Wyman, and Mi Yan
University of New Mexico and University of Texas at Austin

email: dbader@eece.unm.edu
URL:   http://www.eece.unm.edu/~dbader

Abstract:

    Some organisms have a single chromosome or contain
single-chromosome organelles (mitochondria or chloroplasts), the
evolution of which is mostly independent of the evolution of the
nuclear genome.  Given a particular strand from a single chromosome
(whether linear or circular), we can infer the ordering of the genes
along with the directionality of the genes, thus representing each
chromosome by an ordering of oriented genes.  The evolutionary process
that operates on the chromosome may include inversions and
transpositions, which change the order in which genes occur in the
genome as well as their orientation; and also insertions, deletions,
or duplications, which change the number of times and the positions in
which a gene occurs.

    A natural optimization problem for phylogeny reconstruction from
such data attempts to reconstruct a tree with a minimum number of
permitted evolutionary events.  Versions of this problem are all known
or conjectured to be NP-hard.  Another approach is to estimate
leaf-to-leaf distances between all genomes and then use a standard
distance-based method (such as neighbor-joining) to construct the
tree.  Such approaches are fast and may prove valuable in
reconstructing the underlying tree, but cannot recover the ancestral
gene orders.

    Sankoff and his colleagues developed a technique, breakpoint
analysis, for reconstructing a phylogeny based on gene-order data and
implemented it in the code `BPAnalysis'.  Within a framework that
enumerates all trees, BPAnalysis uses an iterative heuristic to label
the internal nodes with signed gene orders.  The outer loop enumerates
all (2n-5)!! leaf-labelled trees on `n' leaves, an exponentially large
value.  The inner loop runs an unknown number of iterations (until
convergence), with each iteration solving an instance of the NP-hard
Traveling Salesperson problem (TSP) at each internal node.  Our
simulation studies suggested that the approach worked well for many
datasets, but that the BPAnalysis software was too slow.  For
instance, we estimated that BPAnalysis would require several centuries
of computation on a typical workstation in order to solve a collection
of chloroplast data from 12 species of Campanulaceae (plus one
outgroup), each genome consisting of 105 gene segments.

    We therefore decided to produce a new software suite, `GRAPPA'
(Genome Rearrangements Analysis under Parsimony and other Phylogenetic
Algorithms), based on the approach of Blanchette and Sankoff, but
easily extensible to other search strategies and using
high-performance algorithm engineering to make it fast enough to be
useful.  High-performance algorithm engineering is a combination of
low-level algorithmic improvements, data structures improvements, and
coding strategies that combine to eliminate bottlenecks in the code,
balance its computational tasks, and make it cache-sensitive.  We also
used massive parallelism to gain an additional speedup.  The code for
GRAPPA can be obtained from http://www.cs.unm.edu/~moret/GRAPPA/ and
is freely available under the GPL.


Algorithm Engineering.

    We improved bounding, both within the exact TSP solver and at the
level of tree enumeration---the latter by considering circular
orderings rather than trees.  We condensed genomes---that is, we
grouped subsequences of gene fragments that appear in all genomes into
a single `supergene,' thereby considerably reducing the size of
instances.  This condensing is done initially in a static pass and
then dynamically after each alteration of the set of candidate labels.
We scored trees incrementally in constant time.  We took advantage of
the special nature of the instances of the TSP generated in the
reduction to speed up the exact solution of these instances.  We paid
special attention to the initialization phase, in which an initial
label is assigned to each node of the current tree topology.


    Algorithm engineering suggests a refinement cycle in which the
behavior of the current implementation is studied in order to identify
problem areas, which are then eliminated through algorithmic or coding
changes.  Constant profiling is the key to such an approach, because
the identity of the principal `culprits' in time consumption changes
after each improvement, so that attention must shift to different
parts of the code during the process---including revisiting already
improved code for further improvements.

    The original BPAnalysis has a significant memory footprint and
poor locality: it requires over 60MB of space and 12MB for its working
set when running on our Campanulaceae dataset.  GRAPPA has a tiny
memory footprint (1.8MB on the same dataset) and mostly good locality
(nearly all of our storage is in arrays preallocated in the main
routine), which enables it to run almost completely in cache (the
working set size is 600KB).  Cache locality can be improved by
returning to a FORTRAN-style of programming, in which records
(structures/classes) are avoided in favor of separate arrays, in which
simple iterative loops that traverse an array linearly are preferred
over pointer dereferencing, in which code is replicated to process
each array separately, etc.


Results.

    GRAPPA is highly flexible and includes various TSP solvers.  In
order to estimate the raw speed of each method, we ran all methods on
real and synthetic datasets, letting them process thousands to
hundreds of thousands of trees until a fixed time bound was attained.
The high processing rate of our exact solver (we have observed rates
from 100 to 5,000 times faster than BPAnalysis) makes it possible to
solve problems with 10--12 genomes on a single processor within
minutes to hours.  We then parallelized the code and used LosLobos,
the 512-processor Linux supercluster at The University of New Mexico's
Albuquerque High-Performance Computing Center, to achieve a
million-fold speedup on the Campanulaceae dataset: a complete
breakpoint analysis (with inversion distances) for the 13 genomes in
that set ran in less than 1.5 hours!


Ongoing and Future Developments.

    Our current implementation also permits (in the same running time)
a search for the minimum inversion phylogeny. We are extending GRAPPA
to be able to handle a larger variety of events (deletions,
insertions, duplications, transpositions, etc.).  We have also
developed new algorithmic strategies for solving the median phylogeny,
based upon fixed-parameter approaches in which we bound the number of
events on each edge.  Our talk will present these new developments, as
well as describe in detail the algorithmic engineering used in
designing and improving GRAPPA.  We will also show how the techniques
we used in making the code efficient and in parallelizing it are
applicable to a wide variety of algorithmic approaches in phylogeny
and other areas of computational biology. Finally, we will give a
short overview of the applicability of high-performance computing to
computational biology.



2.

Human and Mouse Gene Structure:
Finding Genes by Cross-Species Sequence Comparison

Serafim Batzoglou
Whitehead Institute

Abstract:

A fundamental task in analyzing genomes is to identify the
genes. This is relatively straightforward for organisms with
compact genomes (such as bacteria, and yeast) because exons
tend to be large and the introns are either non-existent, or
short. The challenge is much greater for larger genomes (such
as those of mammals) because the exonic "signal" is scattered
in a vast sea of non-genic "noise".

We explore a powerful new approach to gene recognition, by
using cross-species comparison, that is, by simultaneously
analyzing homologous loci from two related species.
Cross-species sequence comparison can help highlit important
functional elements such as exons, because such elements tend
to be more strongly conserved by evolution than random
genomic sequence. We focus on the ability to accurately 
identify coding exons by comparison of syntenic human and 
mouse genomic sequence.

We first undertake a systematic comparison of the genomic
structure of 117 orthologous gene pairs from human and mouse,
to understand the extent of conservetion ofthe number, length,
and sequence of exons and introns. We then use these results
to develop algorithms for cross-species gene recognition,
consisting of GLASS, a new alignment program to provide
global alignments of large genomic regions by using a
hierarchical approach, and ROSETTA, a program to identify
coding exons in both species based on the coincidence of 
genomic structure.



3.

The Central Role of DNA Structural Properties in Genomic Regulation 

Craig J. Benham   
Department of Biomathematical Sciences 
Mount Sinai School of Medicine 
New York, NY


Standard methods to computationally predict the locations of regulatory regions
within DNA sequences commonly search for consensus motifs.  Yet several classes
of DNA regulatory regions have no associated consensus sequence.  And even in
cases where a clear consensus sequence is known, its presence may be necessary,
but not sufficient, for function.  So searches based on such a consensus
commonly have unacceptably high false positive rates.  Further, string-based
methods do not illuminate the mechanisms of regulation.  In many cases other
attributes, not readily discernible from sequence character strings, are also
required for activity.  This talk will consider one such effect - the extent to
which the B-form helical duplex structure of DNA is destabilized by the imposed
stresses that are stringently regulated in vivo.

Both formally exact and more efficient but approximate statistical mechanical
methods have been developed to analyze the stress-induced destabilization of
duplex DNA sequences.  This phenomenon is complicated to analyze because the
imposed stress globally couples together the transition behaviors of all the
base pairs experiencing it. When applied to the analysis of specific DNA
sequences, the predictions of these methods agree in a quantitatively precise
manner with experimental measurements of the locations and extents of local
strand separations.  This high degree of accuracy is achieved even though there
are no free parameters to be fit.  This quantitative precision justifies their
use to predict the duplex destabilization properties of other DNA base
sequences, on which experiments have not been performed. 

These methods have recently been extended to analyze long genomic sequences,
including complete chromosomes,  In this talk I will describe the computational
method, and present some results found in the analysis of the complete E. coli
chromosome.  

The sites of predicted duplex destabilization within this genome are closely
associated with specific DNA transcriptional regulatory elements.  The
associations between predicted destabilized sites and two classes of putative
promoters have been assessed.  These are the set of experimentally
characterized promoter, and the set of divergent ORFs.  The associations that
would be expected at random also have been determined.  In both cases the
association with actual promoters is much stronger than would be expected at
random, often more than 40 standard deviations above the mean for random
association. 

In a collaboration with Dr. Wes Hatfield (UC-Irvine) we are investigating the
IHF-regulated ilvPG  promoter.  We have shown that it is activated by a
mechanism involving the transmission of stress-induced destabilization.  The
theoretical analysis of all other known IHF-regulated genes finds that they all
have destabilization properties that suggest that they also could be regulated
by the same mechanism.  Moreover, a complete analysis of the entire genome
finds a total of 125 ORFs with the catenation of properties needed for this
mechanism.  These results are currently being experimentally tested using
expression arrays.  This is the first collaborative investigation of a global
mechanism of regulation.

This work shows that stress-induced strand separation may play central roles in
the initiation of gene expression. Indeed, recent results suggest that it may
be among the most archaic regulators of this essential biological process. 
Speculations will be presented regarding the reasons for this importance.

The results obtained to date suggest that accurate new computational methods
can be developed to search DNA sequences for specific classes of regulatory
regions based on their characteristic destabilization properties. The progress
will be described on integrating these destabilization methods into a
promoter-prediction algorithm.



4.

Title: Analysis of large-scale genome rearrangement and duplication in real
       sequence data: experience with Arabidopsis thaliana and the draft
       human sequence

Speaker: Daniel G. Brown
	 Whitehead Institute / MIT Center for Genome Research and
	 Department of Computer Science, University of Waterloo

Abstract:

The identification of large-scale evolutionary events such as whole-genome
duplication and genome rearrangement can be difficult long after they
occur.  This problem is particularly worrisome in the case of whole-genome
duplication, as one copy of most duplicated genes become pseudogenes
or diverge substantially.  The problem becomes more confused when multiple
rounds of duplication are a possibility, and yet more complicated as we
analyze genomes for which gene annotation data is uncertain or incomplete,
such as the draft human genome sequence, and are made still more
complicated by tandem duplications, difficulties identifying which genes
are actually paralogous, and by a host of other sources of error which are
not a problem in theoretical models.

We discuss our experiences studying genome rearrangement in 90% of the
genome of Arabidopsis thaliana and in the draft sequence of the human
genome.  Using methods we have developed to resolve most of these
difficulties, we have been able to identify multiple rounds of likely
duplication in the Arabidopsis genome, and have approximately dated the
time of these duplications.  In the human genome, the uncertain annotation
and difficulty in knowing the exact order of the draft contigs has made
things much more complicated, and we have developed new methods to deal
with this data set; here, we present our preliminary results looking for
duplicated gene sets.

Joint work with Todd Vision and Steve Tanksley (Arabidopsis); and with
Tarjei Mikkelsen, Serafim Batzoglou, Simon Kasif and Eric Lander (human
genome)



5.

Title: "Large-scale sequence comparison by locality-sensitive hashing"
Speaker: Jeremy Buhler, University of Washington

Abstract:

In annotating very long genomic DNA sequences, it is useful to compare
these sequences to find high-scoring pairwise local alignments, which
often represent conserved biological features.  Comparisons may be
performed either between two or more long sequences, such as
orthologous stretches of human and mouse DNA, or within one sequence,
as when finding repetitive elements in a chromosome.

Typically, algorithms for discovering local alignments on a large
scale rely on the presence of sufficiently long runs of matching
nucleotides in any alignment of interest.  We describe a new
randomized algorithm, LSH-ALL-PAIRS, that lifts this restriction by
directly finding ungapped alignments with up to a specified fraction
of substitutions, regardless of whether they contain a long exact
substring match.  Our algorithm adapts the technique of
locality-sensitive hashing, previously used to solve high-dimensional
computational geometry problems, to compute local alignments.  Our
implementation of LSH-ALL-PAIRS scales to sequences tens of megabases
in length while discovering similarities with as little as 60-70\%
nucleotide identity.  The algorithm can be effectively parallelized by
iterating its core randomized trial simultaneously on multiple
processors.



6.

The Space of RNA Encodings of a Target Protein
Barry Cohen, State University of New York at Stony Brook

This talk describes work on the space of RNA encodings for a
target protein. This question is of interest from two
angles.

First, it sheds light on evolutionary mechanisms at the
molecular level, which have been at work for billions of
years. We ask whether evolution through natural selection,
the familiar force which over the course of many generations
shapes species and adapts them to their environment, also
adapts and shapes RNA molecules? This speaks to our
curiosity about the forces which mold living things.

Second, this work is of practical importance. To the extent
that we identify characteristics of particular RNA molecules
which nature selects for or against, we may modify those
characteristics to achieve desired changes in an organism.
The desired changes could have medical or agricultural or
industrial significance.

RNA and protein molecules are both linear sequences, or
strings, over a small alphabet. Each set of three
consecutive letters in an RNA sequence (called a codon)
codes for a letter of protein. Each letter in the protein
alphabet has, on average, three possible precursor codons.
There are, therefore, exponentially many RNA sequences which
code for a given target protein. This large set of RNA
sequences are the protein's RNA ``space." These spaces are
not only theoretical and virtual. Nature actively explores
the space of possible RNA encodings for each protein through
random mutations within a population over evolutionary time.

Why, over the course of eons of evolution, has nature chosen
one of these possible encodings? Has it, indeed, ``chosen,"
or is the encoding we see simply a snapshot of a particular
moment in a random walk through this space, without
functional significance?

We conjecture that the shape and stability of an RNA
molecule enhance or impede the chemical machinery which
translates it into a protein, and so impact on the chances
of the organism for survival and reproduction. Accordingly,
we expect the stability of RNA molecules found in nature to
differ from those of randomly composed RNA molecules from
the same space.

We present the following results:

  Extensive simulations showing that naturally occurring 
  RNA coding sequences differ systematically from randomly
  composed sequences in the same space, providing
  evidence for the conjecture that they are shaped by
  natural selection.

  An efficient algorithm to find the most stable RNA
  sequence in a ``space": this dynamic programming algorithm
  provides an exact solution in $O(n^3)$ time and $O(n^2)$
  space.

  A proof that the corresponding least-stable RNA sequence
  problem is NP-complete. But various heuristics show
  promise of giving reasonably good answers.



7.

Computational, Spatial Analysis of Gene Duplication in the Mouse Genome
Dannie Durand, Carnegie Mellon University

Duplication of large genomic regions, ranging from chromosomal
segments to the entire genome, is believed to have played a crucial
role in developmental innovation in vertebrates.  While many
isolated examples of evolution through gene duplication in mammals
exist, the large-scale events - duplication, rearrangement and gene
loss - that led to the spatial organization of modern mammalian
genomes are poorly understood.  I describe a computational,
genome-wide approach to finding duplicated genes in mouse and
discuss methods for reconstructing the history of large-scale
duplications.



8.

Automating Comparative Map Construction 

Debra Goldberg, Center for Applied Mathematics, Cornell University, 
Susan McCouch, Department of Plant Breeding, Cornell University, 
Jon Kleinberg, Department of Computer Science, Cornell University
            
Comparative maps are a powerful tool for aggregating genetic information 
about related organisms, for predicting the location of orthologous genes, 
for understanding chromosome evolution and inferring phylogenetic 
relationships and for examining hypotheses about the evolution of gene 
families and gene function in diverse organisms.  Construction of any 
genetic map is laborious, but compiling comparative maps across multiple 
species requires a large investment of manual effort on the part of 
biologists.  Additionally, biologists from different labs often use 
differing rules which are often not quantified or made precise in the 
literature, making it difficult to understand how the various comparative 
maps relate to one another.  

We have formalized the concept of a comparative map in a way which is
broadly applicable.  One of the primary goals of comparative mapping 
efforts is taken to be the identification of common ancestral linkage 
groups, and consider homeologous (syntenic) segments to be modern 
remnants of such common ancestral segments.  Another goal is to reduce 
volumes of data into a form more easily used and understood.  To achieve 
these goals, we offer a rigorous yet simple definition for the process 
of constructing comparative maps that is based on fundamental principles.  
The first fundamental principle is the comparative map should be accurate, 
in that the data should be explained well by the map.  Second, it should 
represent a high-level global view of the relationships between the two 
genomes; in other words, there should be relatively few distinct segments 
in the labeling, so that a large volume of marker data can be distilled 
into a concise representation from which further hypotheses can be made 
at a global level.  We call this the parsimony criterion.  Consistent 
with providing a representation at a global scale of resolution, the 
labeling need not ``explain'' the presence of every marker.  Some 
homologous markers are the result of microevolutionary events which are 
not indicative of common ancestral linkage groups, some are the results 
of errors or ambiguities in the data, and others indeed represent evidence 
of common ancestral linkage groups, but we do not assign too much weight 
to any single piece of evidence.

From this we have developed robust, efficient algorithms which automate 
the identification of homeologous (syntenic) segments and construction of 
comparative maps.  Our algorithms find an optimal balance of parsimony and 
accuracy using dynamic programming techniques.  We begin with a simple 
linear model, but discover that it implicitly notices (and penalizes for) 
certain long-range dependencies which are biologically important.  We solve 
this problem with a stack model, which explicitly handles these long-range 
dependencies in a biologically intuitive manner.

We have incorporated this algorithm into a tool which is now available in 
a preliminary version.  For input, it requires the positions of the 
markers of one species, as well as the location of homologs to each marker 
in the second species.  Output is given both graphically and in text form.  
Only a single parameter is required, which carries a simple biological 
explanation.  Our program allows comparative maps to be constructed in a 
few minutes.  Results have been evaluated for diverse pairs of species, 
and closely approximate prior manual expert analyses.  




9.

Amino Acid Substitution Matrices for Compositionally Biased Bacterial
Genomes 

Susan Abramski (1), Marek Kimmel (1) and George Weinstock (2) 

(1) Rice University, Houston, TX 
(2) University of Texas at Houston and Baylor College of Medicine,
Houston, TX 

Introduction. Amino acid substitution matrices are tools used to determine
the most likely way a protein has or might evolve and to determine if two
proteins might be evolutionary relatives. Substitution matrices can reveal
similarities that indicate a relationship between two proteins that were
thought to be unconnected. Such relationships are particularly important
in the construction of phylogenetic trees and in understanding the
selection pressures that organisms have adapted to. The BLOSUM matrices
(Hennikoff and Henikoff 1991) are among the most widely used matrices in
protein comparison applications. BLOSUM matrices are presented as series
of matrices, indexed by integer numbers, with each matrix of the series
appropriate for a different evolutionary distance. Recently, a number of
bacterial genomes were completely sequenced. This allows constructing
substitution matrices for this group of organisms, and even for particular
subgroups of bacteria. One of the characteristics of bacterial genomes is
a frequent compositional bias of DNA. The purpose of this study was to
explore how different are the substitution matrices based on bacterial
genomes with different DNA compositional biases. The outcome may impact
the efficiency with which homologous sequences in different bacterial
species are found, as well as allow some insight into bacterial molecular
evolution. 

Methods. The BLOSUM matrices in this experiment were designed to account
for DNA compositional bias. If a genome contains more than 50 percent
guanine (G) and cytosine (C), it is GCbiased (or GCrich). If it contains
more than 50 percent adenine (A) and thymine (T), it is AT biased.
Matrices were constructed in much the same way as the original Henikoff
BLOSUM matrices, but instead of producing just one series intended for all
organisms regardless of compositional bias, two series of matrices were
constructed. One series was in tended to model the amino acid mutations
that take place in GC rich organisms, the other series to model the
mutations in AT rich organisms. A script was written to parse .fna files
from Genbank and calculate the percentage of GC in each bacteria. The
script was run on the following complete bacterial genomes contained in
the NCBI ftp directory genbank/genomes/bacteria as of June 27, 2000:
Aeropyrum pernix K1, Archaeoglobus fulgidus, Aquifex aeolicus, Borrelia
burgdorferi, Bacillus subtilis, Chlamydia trachomatis, Chlamydia
pneumoniae, Escherichia coli, Haemophilus influenzae Rd, Helicobacter
pyori 26695, Mycoplasma genitalium, Methanococcus jannaschii, Mycoplasma
pneumoniae, Methoanobacterium thermoautotrophicum, Mycobacterium
tuberculosis H37Rv, Pyrococcus horikoshii, Rickettsia prowazekii strain
Madrid E, Synechocystis PCC6803, and Treponema pallidum. The seven ATrich
genomes used were Borrelia burgdorferi, Mycoplasma genitalium, Rickettsia
prowazekki, Haemophilus influenzae, Helicobacter pylori, Methanococcus
jannaschii, and Mycoplasma pneumoniae, and the three CG rich genomes were
Mycobacterium tuberculosis, Escherichia coli, and Treponema pallidum. The
asymmetry is caused by the paucity of CGrich genomes among those
completely sequenced. More than 90 proteins that are highly conserved in
bacteria across evolutionary distance were chosen as the basis for the
matrices. These proteins are important for life functions in bacteria.
Amino acid mutations in these proteins are more likely to be reflections
of evolutionary pressures than mutations in less essential proteins. Most
of the chosen proteins fell into the categories of ribosomal proteins,
aminoacyl tRNA synthases, involvement in protein translation, and
involvement in RNA synthesis. A script was written to search for the de
sired proteins (protein search terms) in the .faa files of AT rich genomes
and to extract those protein sequences. The script created output in the
form of input for the Protomat package (Henikoff and Henikoff 1992).
Blocks of multiple local alignments of proteins from AT rich genomes were
made using the MotifJ component of the Protomat package. MotifJ was run on
all .lis files in the protomat directory in mode 3, which is the
noninteractive mode. The .blk files produced contained all of the blocks
extracted from the protein sequences of AT rich genomes. These blocks were
assembled into a blocks database called ATrich blocks. The BLOSUM package
was run repeatedly with the ATrich blocks database with a minimum block
strength of zero a maximum block strength of 9999, and the clustering
percentage was varied among 30, 40, 50, 62, 70, 80, 90, 100, and n (no)
clustering. The scale was set at 2 bits, or n=2. 

Results and Discussion. The BLOSUM matrices for the ATrich genomes are
very close to the original BLOSUM matrices, constructed without regard to
DNAcomposition bias. The matrices for the GCrich genomes differ in some
entries, but the differences may not be statistically significant (only
three GC rich genomes were used). The above result seemed unexpected, so
additional analyses were carried out. Since bacterial genomes contain
little noncoding DNA, the AT/GC biases are present in genes. One
possibility was that the 90 most conserved genes that we included, do not
exhibit these biases. Another possibility was that the AT/GC bias in these
genes is confined to the silent residues in codons. To test these
hypotheses, we compared the amino acid composition of the amino acid
composition of the ATrich genomes to that of the GCrich genomes. The
results indicated that proteins in these two groups of organisms have
different amino acid compositions, consistent with the AT/GC bias. The
outcome of the study is at the same time reassuring and surprising. It
seems that the same rules of amino acid substitution are valid for
apparently different genomes of eubacteria, a very diverse domain of life.
However, it seems unexpected that almost identical rules produced genomes
with different AT/GC biases. Further study of the problem seems justified. 

Acknowledgments. Tom McNeill helped a lot with writing the scripts in
PERL. 

References 
Henikoff, S. and Henikoff, J.G. (1991) Nucleic Acids Res. 19: 6565. 
Henikoff, S. and Henikoff, J.G. (1992) Proc. Natl. Acad. Sci. USA 89:
10915. 



10.

Have functional nucleic acid sequences evolved characteristic base
composition biases: a comparative study of genes and genomes

Rob Knight and Erik Schultes

Introduction

Only a miniscule fraction of the possible nucleic acid sequences actually
perform a catalytic or biological function. Here, we ask whether
functional sequences are isotropically distributed over the space of
possible nucleotide compositions, or whether functional sequences have
characteristic nucleotide composition biases. We apply this analysis to a
variety of biological and artificial sequences. Although differences in
mutation spectra between different organisms greatly affect the
composition of their genomes, functionally constrained sequences still
fall in only a small region of the possible composition space.

Here we focus on three specific questions:

1. How much can sequences with conserved function, such as rRNA and
orthologous protein-coding genes, vary in composition under mutation
pressure?

2. Organisms differ greatly in the composition of their coding regions,
although there is a large core of orthologous proteins conserved across
all three domains of life. Do these compositional differences arise
because different genomes favor proteins from superfamilies that have
particular amino acid compositions (e.g. olfactory proteins), or do
genome-wide mutational processes affect all genes including the core set?

3. Do unrelated catalytic sequences share compositional characteristics?
In particular, do artificially selected ribozymes have the same biases as
natural ribozymes (rRNA, RNase P, etc.)?

Methodology

LSU and SSU rRNA sequences and structural alignments were downloaded from
the rRNA database at http://rrna.uia.ac.be/. Codon usages for
protein-coding sequences were downloaded from CUTG at
http://www.kazusa.or.jp/codon/, based on GenBank Release 120.0. Aptamer
and ribozyme sequences were compiled from the literature by the authors.
Other sequences, including complete genomes and various classes of
functional RNA molecules, were downloaded from NCBI
http://www.ncbi.nlm.nih.gov/ or from other publicly available sources.

Compositional data (fraction of U or T, C, A, and G) for each sequence
were extracted by ISO C programs written by the authors. Where structural
information was available, we extracted compositional data for the
different types of feature (stem, loop, helix) individually. Similarly,
for protein-coding regions, we tabulated data for the first, second, and
third codon positions.

We used the program MAGE, freely available from
http://www.faseb.org/protein/kinemages/MageSoftware.html, for 3D data
visualization. We plotted the composition of particular types of sequences
in a tetrahedron, with U, C, A, and G as the four vertices, as previously
described [Schultes, 1997 #2361; Schultes, 1999 #2385]. We used Microsoft
Excel to quantitate patterns that we identified visually in the simplex.

Preliminary Results

1. In a large sample of rRNA (1320 LSU sequences and 13 215 SSU sequences,
varying in composition from 12% to 75% G+C), we find that base pairing
constrains helical regions of RNA to lie parallel to Chargaff's Axis
(where A=U and G=C), with characteristic offsets to account for
non-Watson-Crick pairing, e.g. G-U interactions; however, loops and bulges
are generally biased towards purines. These trends are replicated
independently in the three domains of life, and in plastids, suggesting
functional, rather than hereditary, constraints. However, in mitochondrial
rRNA, Chargaffs Rule is violated: G and C are uncorrelated, as are A and
U. This implies either extensive, uncharacterized base modification, or
that base pairing is greatly reduced in mitochondrial (vs. nuclear) rRNA.
We also find that the nucleotides annotated as loops and bulges differ
significantly in composition. We plan to compare these features with
analogous regions of artificially selected aptamers and ribozymes, and
with other natural ribozymes.

2. From a sample of complete mitochondrial genomes from 133 species, we
find that the compositions of coding regions, rRNA, tRNA, and unannotated
regions are highly correlated. This supports the idea that nucleotide bias
due to mutation greatly affects even sequences with conserved functions.
Additionally, we show that even individual protein-coding genes, such as
ATPase 8 and Cox3, vary greatly in sequence. Interestingly, the amino acid
composition of these conserved sequences is largely predictable from the
pattern of mutations. Interestingly, although bacterial and eukaryotic
genomes follow the intra-strand Chargaffs Rule where G=C and A=T [Sueoka,
1992 #2268], mitochondrial genomes are not bound by this constraint.

3. We are currently compiling sequences of other classes of functional RNA
for comparison with the LSU and SSU rRNA sequences, and hope to have this
comparison complete by the time of the meeting. Additionally, we hope to
repeat our mitochondrial analysis for all available complete nuclear
genomes, and to repeat the rRNA analysis for several families of
orthologous protein-coding genes.



11.

Determination of vertebrate gene families

              Antje Krause (1*), Georgia Panopoulou (2),
             Steffen Hennig (2), and Martin Vingron (1,2)

     (1) Deutsches Krebsforschungszentrum, Theoretische Bioinformatik,
         Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany
     (2) Max-Planck Institut fuer Molekulare Genetik,
         Ihnestrasse 73, D-14195 Berlin, Germany
     (*) EMail: A.Krause@dkfz.de

We present a clustering of invertebrate and vertebrate sequence data into
well separated gene families as a prerequisite for the analysis of
vertebrate evolution and functional annotation.

Starting point for our work was a data set of about 14,000 EST sequences
from amphioxus. The cephalochordate amphioxus is the last invertebrate
chordate that separated from the vertebrate lineage and for which there is
a living representative. Major duplication events during vertebrate
evolution are assumed to have happened after the divergence of amphioxus.
To prove this hypothesis one depends on well separated vertebrate gene
families having only one orthologous representative in the invertebrates.

Our clustering approach is based on the idea of Tatusov and Koonin, now
widely known as the "two-way-blast" approach and realized in the COGs
database.

As a basis we took the set of predicted protein sequences of the three
completely sequenced invertebrate organisms Saccharomyces cerevisiae
(6,358 sequences from MIPS), Caenorhabditis elegans (19,099 sequences from
wormpep20), and Drosophila melanogaster (13,675 sequences from FlyBase).

Our clustering approach starts with an all-against-all search of the
complete sets of invertebrate protein sequences. We use only the best hit of
each sequence in both other invertebrate sequence sets for clustering,
independently of the score or E-value. Sequence pairs having each other as
best hit are supposed to be orthologous and hits in only one direction to be
paralogs. Following this procedure with three different invertebrate
organisms, the number of false positives can be extremely reduced and a
quality for each cluster can be determined based on the structure and
connectivity of the orthologous group.

Different topologies can be observed in the resulting cluster set:
1. 1,418 orthologous groups are built of all three invertebrate organisms.
2. 2,048 groups contain only C.elegans and Drosophila, 136 groups only
    C.elegans and Yeast, and 248 groups only Drosophila and Yeast.
3. 411 groups contain multiple representatives of one organism. Families
    ending up in these clusters are e.g., HSP70, tubulin alpha, tubulin beta,
    multidrug resistance proteins, dynein heavy chain, myosin heavy chain,
    and actin.

In a second step we added vertebrate, agnathe, and cephalochordate protein
sequences from SWISS-PROT, PIR, and TrEMBL, and consensus sequences from
the human UniGene data set, our inhouse mouse GeneNest EST clustering, and
the amphioxus project. After searching these sequences against the
invertebrate sequence set, again only best hits were used and a sequence
added to an orthologous group only if all invertebrate orthologs of this
group were matched.

For 2,918 of all orthologous groups we were able to include at least one
vertebrate protein sequence. For another 685 groups without a vertebrate
protein match we found potentially related vertebrate EST consensus
sequences. 931 of the amphioxus consensus sequences are successfully
assigned to 645 of the orthologous groups.

The Clusters of Orthologous and Paralogous SEquences (COPSE) are available
for querying and browsing at: http://www.dkfz.de/tbi/services/copse/form



12.

A Gossip-Based Approach to Syntenic Distance
David Liben-Nowell, MIT

Recently, there has been considerable interest in the definition and
exploration of computational models that measure the genetic similarity
between species.  Many of these fit within the following family of models:
one identifies a set of fundamental transformations that can alter a
genome; the distance between two genomes G1 and G2 is then the length of
the shortest sequence of these steps that transforms G1 into G2.

Two of the most commonly-studied models within this paradigm are the
reversal distance and transposition distance.  Both are based on
transformations that alter the linear order of genes within a chromosome,
by reversing a contiguous segment of the chromosome or by moving a segment
to another point in the genome.  However, the order of genes within
chromosomes is often unknown; furthermore, in multichromosomal species, it
is not clear how to weight these intrachromosomal reorderings versus
interchromosomal events that move genes from one chromosome to another.

These types of considerations motivated Ferretti, Nadeau, and Sankoff [2]
to define the syntenic distance model of genomic distance.  This model
ignores any information about gene order within chromosomes, focusing
purely on the assignment of genes to chromosomes.  As such, the only
relevant transformations are those which result in interchromosomal
movement of genes: fusions, where two chromosomes join into one; fissions,
where one chromosome splits into two; and translocations, where two
chromosomes exchange subsets of their genes.

In this work, we describe two recent results on the syntenic distance.
In joint work with Jon Kleinberg, we have shown that the syntenic diameter
--- the syntenic distance between two maximally different species --- is
exactly 2n-4.  This result is based upon a surprising connection between
syntenic distance and gossip problems in communication networks, in which
we are interested in modeling the flow of information among people in a
network.

Further exploiting the connection between syntenic distance and gossip
problems, we have given a more efficient algorithm to exactly compute the
syntenic distance between two species.  Computing this distance exactly is
NP-Complete [1], but our algorithm is reasonably efficient in the
interesting special case when the syntenic distance is small.  Our
algorithm requires O(2^{O(d log d)}) time when the syntenic distance is d,
improving the best previous runningtime of O(2^{O(d^2)}), due to DasGupta
et al [1].

THE SYNTENIC DIAMETER.  The diameter problem for a genomic distance
measure is todetermine the maximum distance between any pair of genomes.
For rearrangement-based models, progress on the diameter problem has often
led to insights into structural properties of the distance measure on
general instances; unfortunately, these diameter questions have often
turned out to be difficult to resolve.

In recent work with Jon Kleinberg, we have shown that the syntenic
diameter of the space of n-chromosome genomes is exactly 2n-4.  Our proof
is based on the relationship between syntenic diameter and the following
gossip problem, popular among mathematicians in the 1970s.  There are n
people, each of whom initially knows a unique piece of information.  The
gossipers communicate by telephone calls, and whenever two speak, they
share all of the information that they know.  We aim to minimize the
number of phone calls before each person knows all of the information.  A
number of researchers have given independent proofs that 2n-4 calls are
necessary and sufficient to achieve this goal.  (See [3].)

For the syntenic diameter problem, a monotonicity property immediately
identifies the genomes G1 and G2 that are maximally separated: every
chromosome in G1 shares a gene with every chromosome of G2.  Solving the
syntenic diameter question reduces to determining the syntenic distance
between these genomes.  One can view the syntenic distance problem between
genomes G1 and G2 as follows.  Let the target of a chromosome C in G1
denote the set of chromosomes of G2 which share a gene with C.  The goal
for the syntenic distance problem is to exchange genes among the
chromosomes of G1 so that the target of each chromosome becomes a unique
single chromosome of G2.  Intuitively, the operation of a phone call for a
gossip problem is essentially a "reverse translocation": in a phone call,
two gossipers take their information sets A and B and exchange information
so that both know (A union B); in a translocation, two chromosomes with
targets contained in (A union B) exchange genes so that they have targets
A and B afterwards.

This intuition can be formalized to prove that 2n-4 moves are necessary
and sufficient to transform any n-chromosome genome G1 into any other
n-chromosome genome G2.

EXACT ALGORITHMS FOR SYNTENIC DISTANCE.  As with many rearrangement-based
genomic distance measures, exactly computing the syntenic distance has
been shown to be NP-Complete [1].  We would still like to be able to
exactly compute syntenic distances, especially in the interesting special
case when the distance d is small.  DasGupta et al [1] give an exact
algorithm that requires O(2^{O(d^2)}) time to compute the syntenic
distance.

Based upon a generalization of the gossip problem above --- where gossiper
i has a specified set of information S(i) that he wishes to learn --- we
are able to give a faster exact algorithm to compute syntenic distance,
requiring O(2^{O(d log d)}) time.  We define a collection of these
\emph{incomplete gossip} problems, and calculate the number of phone calls
necessary to inform each person i of information S(i) via brute force.
The tight connection with these gossip problems allows us to exactly
compute syntenic distance using these values.  Intuitively, the speed-up
in our algorithm is derived from the following.  The algorithm of DasGupta
et al essentially enumerates all possible sequences of transformations of
length d, and checks whether any of these sequences transform G1 into G2.
The vast majority of the time spent in this algorithm is on
translocations: for a translocation, we must select not only the input
chromosomes, but also for each gene g in either input chromosome we must
specify which output chromosome will contain g.  In the gossip-based
approach, we only need to select which people participate in each call.
Once we have selected the participants, both learn whatever new
information the other knows; there is no choice of different output sets.


[1] B. DasGupta, T. Jiang, S. Kannan, M. Li, and Z. Sweedyk.  On the
complexity and approximation of syntenic distance.  Discrete Applied
Mathematics, 88(1-3):59-82, November 1998.

[2] Vincent Ferretti, Joseph H. Nadeau, and David Sankoff.  Original
synteny.  In 7th Annual Symposium on Combinatorial Pattern Matching, pages
159-167, 1996.

[3] S. Hedetniemi, S. Hedetniemi, and A. Liestman.  A survey of gossiping
and broadcasting in communication networks.  Networks 18:319-349, 1988.



13.

DETECTING AND CHARACTERIZING FUNCTIONAL CHANGE IN PROTEINS: A GENOMIC 
PERSP ECTIVE

David A. Liberles

Department of Biochemistry and Stockholm Bioinformatics Center, 
Stockholm University, 10691 Stockholm, Sweden

adaptive evolution/phenotype-genotype-structure correlation

As biology enters the genomic era, new approaches to correlate 
genomic sequence, tertiary protein structure, and organismal phenotype 
are needed. A database (The Adaptive Evolution Database or TAED) of genes 
that appear to have undergone adaptive evolution or changes in function 
has been developed by comparing all chordate sequences found in Genbank.  
This analysis is based upon an estimation of the ratio of nonsynonymous 
to synonymous nucleotide substitution rates as a basis for detecting 
adaptivity or functional change.  

Cross-referencing of this database with other biological databases
allows the big picture of how proteins evolve functionally and structurally
and what this means for the organisms that contain them to emerge.  For 
example at a structural level, an automatable approach to mapping mutations 
onto tertiary structures is combined with a phylogenetic analysis of sequence
evolution.  This approach is applied first to a test case, 
deoxyribonucleoside kinase to explain changes in specificity from a structural
perspective and to relate them to the global picture of nucleoside metabolism 
and its evolution.  This approach is then expanded to every other example 
in TAED with a homologue within 180 PAM units in the Protein Databank (PDB).  

Other case studies of adaptive evolution that emerge from TAED are also 
presented.  The data presented here is a first step towards a comprehensive 
comparison of sequence-structure-function-phenotype analysis.  The future 
development of more refined methods for gene family characterization and 
more automated methods for the end point analysis of data will augment 
this type of approach.  The power of combining phylogenetic, biological, 
and structural approaches to the detection of adaptivity is demonstrated 
and will also become increasingly important as more metazoan genomes are
sequenced. 



14.

GENE DUPLICATION AND THE ORIGIN OF EVOLUTIONARY NOVELTIES
Michael Lynch, Department of Biology, University of Oregon, Eugene, OR

Gene duplication has generally been viewed as a necessary source of 
material for the origin of evolutionary novelties, but it is unclear how 
often gene duplicates arise and how frequently they evolve new functions. 
Preliminary answers to these questions can now be acquired from the 
information inherent in completely sequenced genomes. Observations for 
several eukaryotic species suggest that duplicate genes arise at a very 
high rate, on average 0.01/gene/million years. Most duplicated genes in 
eukaryotes experience a brief period of relaxed selection early in their 
history, with a moderate fraction of them evolving in an effectively 
neutral manner during this period. However, the vast majority of eukaryotic 
gene duplicates are silenced within a few million years, with the few 
survivors subsequently experiencing strong purifying selection. The 
situation appears to be rather different in prokaryotes, where the death 
rate of duplicates is elevated and few changes at the protein level are 
tolerated. Although it has frequently been argued that the only mechanism 
for the permanent preservation of a duplicate gene is neofunctionalization, 
there is little evidence that even a moderate fraction of such genes evolve 
new functions. Instead, many duplicate genes appear to become preserved by 
subfunctionalization, with each member of the pair losing a key function of 
the ancestral gene. By releasing genes from pleiotropic constraints, 
subfunctionalization may open up new evolutionary pathways, thereby 
promoting the evolution of novel phenotypes. In addition, the random 
silencing of duplicate genes results in microchromosomal rearrangements, 
giving rise to interspecies genomic incompatibilities, and thereby 
promoting the passive origin of new species.


15.
Strategy for the Multiple Whole Genome
Alignment of four strains of E.coli
using modified suffix arrays

Bob Mau, Aaron Darling, Nicole Perna, and Fred Blattner

Abstract

We present a time-and-space efficient algorithm that aligns large
sub-intervals of DNA sequence of closely related organism into segments
we call backbone. Backbone represents sequence that can be putatively
identified with the genome of the most recent common ancestor of the
aligned organisms.

The method is ideally suited for aligning a group of genomes possessing
large regions of high sequence similarity, punctuated with numerous instances
horizontally transferred sequence. Not surprisingly, this fits the profile
of Enterobacteriaceae, the primary focus of the Blattner Lab's sequencing
effort.Our methodology automatically identifies and places large relative
inversions, translocations, and inverted translocations. A recent implemen-
tation has successfully aligned four strains of E.coli: K-12 MG1655, K-12
W3110, O157:H7 EDL933(Enterohaemorraghic PEC), and CFT073(Uropathogenic EC).
The prototype software finished in under 5 CPU minutes and used less than
7 Megs of memory. The critical software and algorithmic constructs will be
described in some detail. Key is a novel representation of the genomic match
coordinates that partitions maximal matches into disjoint groups called
Canonical Maximal Exact Match(MEM) Equivalence Classes.

We sketch how to finish this "rough draft" into a full fledged multiple
alignment.Furthermore, we describe how some not-so-minor modifications will
allow extension of the basic method to include more divergent genomes.

16.

Comparing genomic DNA sequences: solved and unsolved problems
Webb Miller, Pennsylvania State University

Among the many uses of the forthcoming genomic sequence data from the mouse is
elucidation of the human genomic sequence.  The mouse sequence will be 
extremely
valuable for completing the catalog of all human genes, and its potential for
computational identification of other functional genomic regions, such as sites
regulating gene expression, is unparalleled.  A new World-Wide Web server,
called PipMaker (http://bio.cse.psu.edu), produces accurate alignments with
user-friendly visual overviews for any two long genomic DNA sequences.
Moreover, one of the two sequences being compared can be in "working draft"
form, and the same software can compare genomic sequences from other fairly
similar species, such as C. elegans and C. briggsae.  The following needs
remain to be met.  (1) Improved software that aligns two genomic sequences and
has a rigorous statistical basis.  (2) An industrial-strength gene prediction
system that effectively combines genomic sequence comparisons, intrinsic
sequence properties, and results from searching databases of proteins sequences
and ESTs. (3) Reliable and automatic software for aligning three or more
genomic sequences. (4) Better methods for displaying and browsing genomic
sequence alignments. (5) Improved datasets and protocols for evaluating the
correctness and performance of genomic alignment software.



17.

Joseph Nadeau, Case Western Reserve University School of Medicine
Genetic, expression and evolutionary mapping of a physiological pathway

Anomalies in folate and homocysteine metabolism are associated with neural
tube defects, cardiovascular disease, cancer,  neurological deficits, and a
variety of other birth defects and adult disorders.  The known mutated
genes in these pathways do not account for the frequency or variety of
consequences of disrupted folate and homocysteine metabolism.  We therefore
are taking genomic approaches to understand the genetic control of folate
and homocysteine metabolism. To identify mouse models for studying the
involvement of these pathways in various birth defects and adult diseases,
we documented significant variation in homocysteine levels among a panel of
inbred strains. We then mapped most of the genes that are directly involved
in folate and homocysteine metabolism, characterized the expression of
these genes in organs and tissues that are at risk in folate and
homocysteine metabolism, identified clusters of genes whose expression
levels change in systematic ways across these organs and tissues, examined
the evolutionary changes in the content of these metabolic pathways in
selected organisms whose genomes have been completely sequenced, and
described the patterns of DNA sequence variation for these genes in these
organisms. Each of these approaches has raised interesting analytical and
computional issues that are only partly resolved. This talk will therefore
identify some new research problems delaing primarily with physiological
pathways and will suggest some possible approaches to their resolution.



18.

Computational Functional genomics
Matteo Pelligrini, Protein Pathways Inc.

The availability of 75 fully sequenced genomes provides us new insights into
the function of many uncharacterized proteins.  We implement four methods,
Phylogenetic Profiles, Rosetta Stone, Gene Neighbor, and Operon
reconstruction, to predict the function of proteins.  These methods allow us
to link together functionally related proteins by searching for conserved
properties across genomes.  The full set of links for a complete genome is
captured in a protein functional network.  These networks are used to assign
protein functions and reconstruct metabolic pathways.



19.

Title: Large-scale genome duplications in the model plant Arabidopsis
and in the human genome

Steven L. Salzberg, Ph.D.
The Institute for Genomic Research

Our group has developed an efficient algorithm for comparison of whole
genomes and chromosomes, and using this system we have uncovered a
number of previously unobserved features of microbial genomes and
eukaryotic chromosomes.  The MUMmer system (Delcher et al., 1999) can
align DNA sequences of essentially any size, including entire
eukaryotic chromosomes, in just a few minutes on a powerful,
large-memory computer.  We have used this system to align all
microbial genomes against each other, and to align all the chromosomes
of Arabidopsis thaliana, Drosophila melanogaster, and most recently of
the human genome.  Two surprising discoveries have emerged recently.
The first is that the genomes of a number of bacteria can be aligned
in both directions: that is, one of the genomes is globally similar at
the DNA level to both the forward and the reverse strand of the other.
When both alignments are displayed together, an X-shaped pattern is
clearly visible.  We have discovered X-alignments between the genomes
of E. coli and V. cholerae, S. pyogenes and S. pneumoniae, 
M. tuberculosis and M. leprae, and others.  The likely explanation for this
phenomenon is much more frequent large-scale chromosome reversals than
have previously been suspected.

Second, our large-scale alignments of human and Arabidopsis have
revealed that large-scale chromosome duplications appear to be very
common evolutionary events.  In Arabidopsis, the duplications cover
approximately 60% of the genome in just 24 large, non-overlapping
blocks (The Arabidopsis Genome Initiative, 2000).  Our detailed analysis
of the data reveals the strong likelihood that all blocks were
duplicated simultaneously, lending support to the theory of a
whole-genome duplication.  Extensive gene loss followed this
duplication, leaving only about 25-30% of the genes with paralogs in
both copies of each duplicated segment.  In the human genome, our
analysis (Venter et al., 2001) reveals a large number of 
segmental duplications of varying ages.  This talk will present an
overview of this analysis with suggestions for further research
necessary to untangle this highly complex picture.

A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson,
O. White, and S.L. Salzberg (1999).  Alignment of whole genomes.
Nucleic Acids Research, 27:11, 2369-2376.

The Arabidopsis Genome Initiative (2000).  Analysis
of the genome sequence of the flowering plant Arabidopsis
thaliana.  Nature 408, 796-815.

J.C. Venter et al. (2001).  The sequence of the human
genome.  Science, in press.



20.

David Sankoff, University of Montreal
Conserved Clusters Versus Conserved Segments

Two independent sets of recent observations on the sequences of small
genomes pertain to the prevalence of short inversion as a gene order
rearrangement process and to the lack of conservation of gene order
within conserved gene clusters (in contrast to the familiar patterns of
conserved segments in the comparison of higher eukaryotes). We propose
a model of inversion where the key parameter is the length of the inverted
fragment.  We show that there is a qualitative difference in the pattern of
evolution when the inversion length is small with respect to the cluster
size and when it is large. This suggests an explanation of the lack of
parallel gene order in conserved clusters and raises questions about the
statistical validity of putative functionally selected gene clusters if these
have only been tested against inappropriate null hypotheses.



21.

Title: Detection of related genes in prokaryotes using syntenic regions

Authors: Nalvo F. de Almeida Jr.(*) and Joao C. Setubal(**)

(*) Department of Computer Science and Statistics
    Federal University of Mato Grosso do Sul, Brazil
    nalvo@dct.ufms.br

(**) Institute of Computing, University of Campinas, Brazil
     setubal@ic.unicamp.br

Extended abstract

In this work we explore the synteny between prokaryotic genomes. One
of our goals is to obtain clues for the function of hypothetical
genes; another is to gain a better understanding of changes in gene
organization between different species. Our method is based on a
careful pairwise comparison of genes in whole genomes. 

We first build a binary similarity matrix M based on BLAST [1]
e-values of genome G genes against genome H genes and vice versa (we
use default BLAST parameter values, but do not use the low-complexity
region filter). An entry m_{gh} in M is 1, for gene g in G and gene h
in H, if and only if the e-values resulting from having g as query
against h, and h as query against g, are both <= e-5.  We refer to
each m_{gh} != 0 as a _match_. We then look for physically (in each
genome) consecutive matches (called _runs_) and for runs that are
within k of each other, where k measures the maximum number of
intervening genes in one genome, or in the other, or in both. Runs are
found by a simple scanning of match lists. Note that there are two
types of runs: parallel (both sets of genes are in increasing order in
their respective genomes) and anti-parallel (one is increasing and the
other is decreasing). Note also that a run is more general than an
operon. In an operon, a series of consecutive genes is transcribed as
a unit, and hence must all be on the same strand. Our concept of run
allows consecutive genes in one or in the other genome to be in
different strands. When we find runs of the same type that are within
k and such that the flanking genes in each run have consistent strands
we join them. This join operation adds the intervening genes in both
genomes to the run. We use the term _clusters_ to designate joined
runs.

We designate a pair of genes (g,h), g in G and h in H, inside a
cluster, such that nor g nor h belongs to a run, as a candidate
related pair (CRP). Some of these pairs will be genes that are
evolutionary related, but whose homology cannot be detected by the
usual similarity methods. Some of the criteria that can be used to build
the case that the genes g and h in a CRP are indeed related are:

- Relative size. One would expect that functionally related genes have
  approximately the same size (the ratio of the smaller to the larger
  being no less than, say, 80%), but this is by no means a general
  rule.

- Consistent strands. The joining of runs impose a constraint on the
  strands of genes belonging to the cluster. In a parallel run, the
  strands of each pair (g,h) belonging to the run must match. In an
  anti-parallel run, the strands must be of opposite sign (taking one
  strand as + and the other as -). We would expect the same
  consistency to occur in CRPs that are evolutionary related, but
  again there may exceptions to this.

We have obtained two other kinds of comparative information using our
method. The first is related to fusion and fission of genes. A
candidate gene fusion event occurs when gene i in G has matches to
genes j and j+1 in H. Since our definition of match implies
reciprocity, genes j and j+1 are a candidate gene fission event. We
again use the word "candidate" to stress the fact that each of these
events must be carefully analyzed to determine whether they
correspond to actual biological events.

A second kind of information is a generalization of fusion/fission
using the concept of runs. We can detect the following kind of event:
given two parallel runs A = ((g_i,h_j),(g_i+1,h_j+1),...,(g_k,h_l))
and B = ((g_p,h_q),(g_p+1,h_q+1),...,(g_r,h_s)), it is the case that
p = k+1 but q > l+1. In other words there is a consecutive series of
genes in G that is split in two in H. An analogous definition can be
given for two anti-parallel runs. We term such events as candidate run
fusion events (from the point of view of G) and candidate run fission
events (from the point of view of H).

We present results of using the methodology described above on the
publicly available genomes Xylella fastidiosa (XF), Escherichia coli
(EC) and Pseudomonas aeruginosa (PA). We also use the current version
of the Xanthomonas axonopodis pv citri (XC) genome, which is now being
finished by a consortium of labs in Brazil. A summary of the results
we obtained follows in the tables below.

Table 1: Matches, runs, and clusters (using k = 2)
--------------------------------------------------
organisms   matches   runs   clusters
-------------------------------------
XF vs. XC     5759     534      11
XF vs. PA     7750     544       7
XF vs. EC     5213     362       5
XC vs. PA    23300    1563      13
XC vs. EC    13154     908       9
PA vs. EC    23630    1410       7
------------------------------------

(The clusters reported only include those for which there are CRPs
such that the smaller gene in the pair is at least 20% in size of the
larger gene in the pair.)

Table 2: Hypothetical genes that are members of CRPs
----------------------------------------------------
organism    # of hypo. genes
----------------------------
EC                 7
PA                16
XC                19
XF                19
----------------------------
total             61
----------------------------

Of the 61 genes listed above, 12 appeared in more than one pairwise
genome comparison.


Table 3: Candidate gene fusion/fission events
------------------------------------------------------
G      H     fus in G (fis in H)   fis in G (fus in H)
------------------------------------------------------
XF vs. XC        125                     257
XF vs. PA        191                     277
XF vs. EC        145                     220
XC vs. PA       1088                     803
XC vs. EC        527                     489
PA vs. EC        914                     828
------------------------------------------------------

Table 4: Candidate run fission (in G) / fusion (in H) events.
T is the number of genes between the two series of consecutive
genes in G.
-------------------------------------------
G      H    #events   mean T  max T   min T
-------------------------------------------
XF vs. XC     53       611.2   2530     1
XF vs. PA     47       800.3   2343     1
XF vs. EC     21       746.4   2760     1
XC vs. PA    278      1365.3   3870     1
XC vs. EC     57       997.3   4259     1
PA vs. EC    211      1668.6   5557     1
-------------------------------------------

Table 5: Candidate run fusion (in G) / fission (in H) events.
T is the number of genes between the two series of consecutive
genes in H.
-------------------------------------------
G      H    #events   mean T  max T   min T
-------------------------------------------
XF vs. XC     80       829.8   3692    1
XF vs. PA     69      1637.1   4942    1
XF vs. EC     36       979.4   3571    1
XC vs. PA    505      1724.0   5347    1
XC vs. EC    166      1449.1   4246    1
PA vs. EC    214      1441.9   4140    1
-------------------------------------------

These results are currently being analyzed. In particular we expect to
determine more candidate related pairs by inspecting the run
fusion/fission events as well as runs that have isolated matches close
by.

Previous works related to the results presented here are referenced below.

[1] S. Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of
   protein database search programs. Nucleic Acids Research,
   25:3389--3402, 1997.

[2] W. Fujibuchi, H. Ogata, H. Matsuda, and M. Kanehisa.
   Automatic detection of conserved gene clusters in multiple genomes by
   graph comparison and P-quasi grouping. Nucleic Acids Research,
   28:4029--4036, 2000.

[3] M. Y. Galperin and E. V. Koonin. Who's your neighbor? New
  computational approaches for functional genomics. Nature
  Biotechnology, 18:609--613, 2000.

[4] H. Ogata, W. Fujibuchi, S. Goto, and M. Kanehisa. A heuristic
  graph comparison algorithm and its application to detect
  functionally related enzyme clusters. Nucleic Acids Research,
  28:4021--4028, 2000.

[5] R. Overbeek, M. Fonstein, M. D'Souza, G. Pusch, and N. Maltsev.
  The use of gene clusters to infer functional coupling. PNAS,
  96:2896--2901, 1999.



22.

Computational Identification of Distal Regulatory Elements in the Human
Genome

David J. States, M.D., Ph.D, Washington University School of Medicine

The control of eukaryotic gene expression is mediated by the cooperative
action of multi-protein complexes binding to genomic DNA.  Identifying these
regulatory elements computationally is a challenge because they may be
located tens or even hundreds of kilobases from a gene.  A statistical
approach for assessing the significance of transcription factor binding site
clusters is described.  Clusters achieve high statistical significance,
correlate with evolutionary conservation of intergenic sequence and
biochemical markers of regulatory regions.  

Most CpG dinucleotides in vertebrate genomes are methylated.  Islands of
hypomethylated CpG containing DNA are hypothesized to arise through
protection of DNA from the action of methylases by sequence specific DNA
binding proteins.  Using human genomic sequences that are hypomethylated in
vivo, a fifth order Markov model is developed to recognize hypomethylated
like DNA sequences as a second approach to finding candidate regulatory
regions.  This Markov model is superior to traditional CpG island
definitions in identifying hypomethylated sequences.




23.

Computational Comparative Genomics

Clemens Suter-Crazzolara, Guenther Kurapkat & Deborah Riley
LION bioscience,
Waldhoferstr. 98
69123 Heidelberg
Germany
Contact: suter@lionbioscience.com

With the advent of world wide genome projects, scientists are confronted
with a rapid increase in the amount of novel sequence information. Both
the sizes and the number of biological databases grow exponentially.
However, the gap between data collection and interpretation is also
growing. For this reason, genome databases contain a wealth of
information which is accessible, but which may remain obscured. As a
result, across all sequenced species, nearly half of the potential genes
can not be assigned specific roles. Intelligent systems are needed to
bridge the widening gap between data collection and interpretation.
To interpret data from newly sequenced pro- and eukaryotic genomes,
comparative genomics is rapidly gaining importance, as genomes of
distant organisms may still encode proteins with high sequence
similarity. The order of genes (colinearity) in genomes is also be
conserved to some extend.

We have employed both these observations to create a versatile
computational analysis system (genomeSCOUT), which allows for rapid
identification and functional characterization of genes and proteins
through genome comparison. With a number of independent methods,
information about different levels of protein homology (concerning e.g.
paralogs, orthologs and clusters of orthologous groups, COGs) and gene
order is collected and stored in several value added databases. These
databases are then used for interactive comparison of genomes and
subsequent analysis. Genomes from in house sequencing projects or public
sources can efficiently be added to the system. The application is based
on the well established data integration system SRS (Etzold, T. et al.,
1996. Methods Enzymol. 266: 114-128). 

This ensures: 
1. Fast handling of large genomic data sets. The most complex queries 
give results instantly. 

2. Straightforward access to a multitude of biological
databases. Currently SRS allows the seamless integration of 400
biological databases. 

3. Unique, SRS-based linking functions between all
databases result in the accumulation of a wealth of biological
information. This ensures optimal data mining. 

4. Highly efficient collection of information on genes and proteins.

5. High flexibility allows effortless integration of additional 
public or proprietary databanks and applications such as CLUSTALW 
or BLAST.

6. Fully integrated and user friendly graphical representations of search
results. This application can be used for projects spanning from the
identification of drug targets or the optimization of microorganisms for
biotechnology, to the elucidation of pathogenic principles, to in depth
characterization and encyclopedic annotation of completed or partial
genomes. Currently, the software is extended to allow the analysis of
eukaryotic genomes (http://www.lionbioscience.com/)



24.

Structure and dynamics of transcriptional regulatory networks
in the yeast S. cerevisiae

Saeed Tavazoie, Princeton University

I will present computational methods for identifying regulatory
interactions within the transcriptional network of the yeast
S. cerevisiae.  We have used classification algorithms, such  
as partitional clustering, to discover intrinsic expression 
patterns from whole-genome DNA chip data.  A motif discovery  
algorithm based on Gibbs-sampling is then used to find significant
motifs which are common to each expression pattern.  Our approach 
identifies many known and putative novel cis=AD-regulatory elements. 
A major challenge now is to understand how these elements implement 
complex combinatorial regulatory programs.



25.

Comparative genome analysis and molecular evolution

Elizabeth Tillier, Centre for Clinical Genomics

     In many bacterial genomes, the leading and lagging strands have 
different skews in base composition; for example, an excess of guanosine 
compared to cytosine on the leading strand. Several factors contribute to 
these skews since they correlate with both the transcription direction of 
genes and with their replication direction. The molecular reasons for base 
composition skews attributable solely to replication direction are
not known, but are very strong in some genomes. In particular, we find that 
Chlamydia genes that have switched their orientation relative to the 
direction of replication, for example by inversion, acquire the skewed base 
composition of their new "host" strand. Replication-related skews reflect a 
directional evolutionary force that causes predictable changes in the base 
composition of switched genes resulting in increased DNA and amino acid 
sequence divergence.

     Additionally, comparisons between related genomes reveal that gene 
order does not appear to deteriorate randomly, but rather by a series of
imperfectly symmetrical translocations across an axis through the origin and 
termination of replication. Homologous recombination, transposable elements 
and illegitimate recombination could all have played a part in the genome 
rearrangements. Irrespective of the exact processes involved, the observation 
that almost all can be explained by a series of recombination events across 
the replication axis indicates that replication plays a central role in targeting 
the position of a translocation. These observations suggest that 
replication plays a major role in directing genome evolution.



26.

Optimal Spliced Alignment

Wei Zhu, Iowa State University

Supplementary cDNA or EST evidence is often decisive for discriminating
between alternative gene predictions derived from computational sequence
inspection by any of a number of requisite programs. Without additional
experimental effort, this approach must rely on the occurrence of
cognate
ESTs for the gene under consideration in available, generally
incomplete, EST collections for the given species.  In some cases,
particular exon assignments can be supported by sequence matching even
if the cDNA or EST is produced from non-cognate genomic DNA, including
different loci of a gene family or homologous loci from different
species.  However, marginally significant sequence matching alone can
also be misleading.  We designed a novel algorithm that produces the
optimal spliced alignment of a genomic DNA with a cDNA or EST based on
scoring for both sequence matching and intrinsic splice site strength.
By example, we demonstrate that this combined approach appears to
improve gene prediction accuracy compared with current methods that rely
only on either search by content and signal or on sequence similarity .A
similar algorithm is available for spliced alignment of genomic
DNA with protein sequences.

For the case that several ESTs give overlapping alignments, we developed
a post-processing procedure to assemble those sequences into a consensus
alignment. This has several advantages: (1)  more reliable gene
prediction; (2) reduced redundancy; (3) identification of alternative 
splicing. A JAVA-based GUI allows convenient viewing of the data and 
results. I will discuss examples of genome annotation of C. elegans and 
D. melanogaster.

REFERENCES
1. Usuka, J., Zhu, W., and Brendel, V. (2000) Optimal spliced alignment
of homologous cDNA to a genomic DNA template.  Bioinformatics 16,
203-211

2. Usuka, J. and Brendel, V. (2000) Gene Structure Predicted by Spliced
Alignment of Genomic DNA with Protein Sequences: Increased Accuracy by
Different Splice Site Scoring.  J. Mol. Biol. 297, 1075-1085.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on February 19, 2001.
DIMACS Workshop on Whole Genome Comparison

February 28 - March 2, 2001 DIMACS Center, Rutgers University, Piscataway, NJ

Abstracts:

February 28 - March 2, 2001
DIMACS Center, Rutgers University, Piscataway, NJ