DIMACS Workshop on Whole Genome Comparison

February 28 - March 2, 2001
DIMACS Center, Rutgers University, Piscataway, NJ

Dannie Durand, Carnegie Mellon University, durand@cmu.edu
Program Committee
Joseph H. Nadeau, Case Western Reserve University School of Medicine, jhn4@po.cwru.edu
Steven L. Salzberg, The Institute for Genomic Research, salzberg@tigr.org
David Sankoff, University of Montreal, sankoff@ere.umontreal.ca
Presented under the auspices of the Special Year on Computational Molecular Biology.



High-Performance Algorithm Engineering for Gene-Order Phylogenies

David A. Bader, Bernard M.E. Moret, Tandy Warnow, Stacia K. Wyman, and Mi Yan
University of New Mexico and University of Texas at Austin

email: dbader@eece.unm.edu
URL:   http://www.eece.unm.edu/~dbader


    Some organisms have a single chromosome or contain
single-chromosome organelles (mitochondria or chloroplasts), the
evolution of which is mostly independent of the evolution of the
nuclear genome.  Given a particular strand from a single chromosome
(whether linear or circular), we can infer the ordering of the genes
along with the directionality of the genes, thus representing each
chromosome by an ordering of oriented genes.  The evolutionary process
that operates on the chromosome may include inversions and
transpositions, which change the order in which genes occur in the
genome as well as their orientation; and also insertions, deletions,
or duplications, which change the number of times and the positions in
which a gene occurs.

    A natural optimization problem for phylogeny reconstruction from
such data attempts to reconstruct a tree with a minimum number of
permitted evolutionary events.  Versions of this problem are all known
or conjectured to be NP-hard.  Another approach is to estimate
leaf-to-leaf distances between all genomes and then use a standard
distance-based method (such as neighbor-joining) to construct the
tree.  Such approaches are fast and may prove valuable in
reconstructing the underlying tree, but cannot recover the ancestral
gene orders.

    Sankoff and his colleagues developed a technique, breakpoint
analysis, for reconstructing a phylogeny based on gene-order data and
implemented it in the code `BPAnalysis'.  Within a framework that
enumerates all trees, BPAnalysis uses an iterative heuristic to label
the internal nodes with signed gene orders.  The outer loop enumerates
all (2n-5)!! leaf-labelled trees on `n' leaves, an exponentially large
value.  The inner loop runs an unknown number of iterations (until
convergence), with each iteration solving an instance of the NP-hard
Traveling Salesperson problem (TSP) at each internal node.  Our
simulation studies suggested that the approach worked well for many
datasets, but that the BPAnalysis software was too slow.  For
instance, we estimated that BPAnalysis would require several centuries
of computation on a typical workstation in order to solve a collection
of chloroplast data from 12 species of Campanulaceae (plus one
outgroup), each genome consisting of 105 gene segments.

    We therefore decided to produce a new software suite, `GRAPPA'
(Genome Rearrangements Analysis under Parsimony and other Phylogenetic
Algorithms), based on the approach of Blanchette and Sankoff, but
easily extensible to other search strategies and using
high-performance algorithm engineering to make it fast enough to be
useful.  High-performance algorithm engineering is a combination of
low-level algorithmic improvements, data structures improvements, and
coding strategies that combine to eliminate bottlenecks in the code,
balance its computational tasks, and make it cache-sensitive.  We also
used massive parallelism to gain an additional speedup.  The code for
GRAPPA can be obtained from http://www.cs.unm.edu/~moret/GRAPPA/ and
is freely available under the GPL.

Algorithm Engineering.

    We improved bounding, both within the exact TSP solver and at the
level of tree enumeration---the latter by considering circular
orderings rather than trees.  We condensed genomes---that is, we
grouped subsequences of gene fragments that appear in all genomes into
a single `supergene,' thereby considerably reducing the size of
instances.  This condensing is done initially in a static pass and
then dynamically after each alteration of the set of candidate labels.
We scored trees incrementally in constant time.  We took advantage of
the special nature of the instances of the TSP generated in the
reduction to speed up the exact solution of these instances.  We paid
special attention to the initialization phase, in which an initial
label is assigned to each node of the current tree topology.

    Algorithm engineering suggests a refinement cycle in which the
behavior of the current implementation is studied in order to identify
problem areas, which are then eliminated through algorithmic or coding
changes.  Constant profiling is the key to such an approach, because
the identity of the principal `culprits' in time consumption changes
after each improvement, so that attention must shift to different
parts of the code during the process---including revisiting already
improved code for further improvements.

    The original BPAnalysis has a significant memory footprint and
poor locality: it requires over 60MB of space and 12MB for its working
set when running on our Campanulaceae dataset.  GRAPPA has a tiny
memory footprint (1.8MB on the same dataset) and mostly good locality
(nearly all of our storage is in arrays preallocated in the main
routine), which enables it to run almost completely in cache (the
working set size is 600KB).  Cache locality can be improved by
returning to a FORTRAN-style of programming, in which records
(structures/classes) are avoided in favor of separate arrays, in which
simple iterative loops that traverse an array linearly are preferred
over pointer dereferencing, in which code is replicated to process
each array separately, etc.


    GRAPPA is highly flexible and includes various TSP solvers.  In
order to estimate the raw speed of each method, we ran all methods on
real and synthetic datasets, letting them process thousands to
hundreds of thousands of trees until a fixed time bound was attained.
The high processing rate of our exact solver (we have observed rates
from 100 to 5,000 times faster than BPAnalysis) makes it possible to
solve problems with 10--12 genomes on a single processor within
minutes to hours.  We then parallelized the code and used LosLobos,
the 512-processor Linux supercluster at The University of New Mexico's
Albuquerque High-Performance Computing Center, to achieve a
million-fold speedup on the Campanulaceae dataset: a complete
breakpoint analysis (with inversion distances) for the 13 genomes in
that set ran in less than 1.5 hours!

Ongoing and Future Developments.

    Our current implementation also permits (in the same running time)
a search for the minimum inversion phylogeny. We are extending GRAPPA
to be able to handle a larger variety of events (deletions,
insertions, duplications, transpositions, etc.).  We have also
developed new algorithmic strategies for solving the median phylogeny,
based upon fixed-parameter approaches in which we bound the number of
events on each edge.  Our talk will present these new developments, as
well as describe in detail the algorithmic engineering used in
designing and improving GRAPPA.  We will also show how the techniques
we used in making the code efficient and in parallelizing it are
applicable to a wide variety of algorithmic approaches in phylogeny
and other areas of computational biology. Finally, we will give a
short overview of the applicability of high-performance computing to
computational biology.

2. Human and Mouse Gene Structure: Finding Genes by Cross-Species Sequence Comparison Serafim Batzoglou Whitehead Institute Abstract: A fundamental task in analyzing genomes is to identify the genes. This is relatively straightforward for organisms with compact genomes (such as bacteria, and yeast) because exons tend to be large and the introns are either non-existent, or short. The challenge is much greater for larger genomes (such as those of mammals) because the exonic "signal" is scattered in a vast sea of non-genic "noise". We explore a powerful new approach to gene recognition, by using cross-species comparison, that is, by simultaneously analyzing homologous loci from two related species. Cross-species sequence comparison can help highlit important functional elements such as exons, because such elements tend to be more strongly conserved by evolution than random genomic sequence. We focus on the ability to accurately identify coding exons by comparison of syntenic human and mouse genomic sequence. We first undertake a systematic comparison of the genomic structure of 117 orthologous gene pairs from human and mouse, to understand the extent of conservetion ofthe number, length, and sequence of exons and introns. We then use these results to develop algorithms for cross-species gene recognition, consisting of GLASS, a new alignment program to provide global alignments of large genomic regions by using a hierarchical approach, and ROSETTA, a program to identify coding exons in both species based on the coincidence of genomic structure.
3. The Central Role of DNA Structural Properties in Genomic Regulation Craig J. Benham Department of Biomathematical Sciences Mount Sinai School of Medicine New York, NY Standard methods to computationally predict the locations of regulatory regions within DNA sequences commonly search for consensus motifs. Yet several classes of DNA regulatory regions have no associated consensus sequence. And even in cases where a clear consensus sequence is known, its presence may be necessary, but not sufficient, for function. So searches based on such a consensus commonly have unacceptably high false positive rates. Further, string-based methods do not illuminate the mechanisms of regulation. In many cases other attributes, not readily discernible from sequence character strings, are also required for activity. This talk will consider one such effect - the extent to which the B-form helical duplex structure of DNA is destabilized by the imposed stresses that are stringently regulated in vivo. Both formally exact and more efficient but approximate statistical mechanical methods have been developed to analyze the stress-induced destabilization of duplex DNA sequences. This phenomenon is complicated to analyze because the imposed stress globally couples together the transition behaviors of all the base pairs experiencing it. When applied to the analysis of specific DNA sequences, the predictions of these methods agree in a quantitatively precise manner with experimental measurements of the locations and extents of local strand separations. This high degree of accuracy is achieved even though there are no free parameters to be fit. This quantitative precision justifies their use to predict the duplex destabilization properties of other DNA base sequences, on which experiments have not been performed. These methods have recently been extended to analyze long genomic sequences, including complete chromosomes, In this talk I will describe the computational method, and present some results found in the analysis of the complete E. coli chromosome. The sites of predicted duplex destabilization within this genome are closely associated with specific DNA transcriptional regulatory elements. The associations between predicted destabilized sites and two classes of putative promoters have been assessed. These are the set of experimentally characterized promoter, and the set of divergent ORFs. The associations that would be expected at random also have been determined. In both cases the association with actual promoters is much stronger than would be expected at random, often more than 40 standard deviations above the mean for random association. In a collaboration with Dr. Wes Hatfield (UC-Irvine) we are investigating the IHF-regulated ilvPG promoter. We have shown that it is activated by a mechanism involving the transmission of stress-induced destabilization. The theoretical analysis of all other known IHF-regulated genes finds that they all have destabilization properties that suggest that they also could be regulated by the same mechanism. Moreover, a complete analysis of the entire genome finds a total of 125 ORFs with the catenation of properties needed for this mechanism. These results are currently being experimentally tested using expression arrays. This is the first collaborative investigation of a global mechanism of regulation. This work shows that stress-induced strand separation may play central roles in the initiation of gene expression. Indeed, recent results suggest that it may be among the most archaic regulators of this essential biological process. Speculations will be presented regarding the reasons for this importance. The results obtained to date suggest that accurate new computational methods can be developed to search DNA sequences for specific classes of regulatory regions based on their characteristic destabilization properties. The progress will be described on integrating these destabilization methods into a promoter-prediction algorithm.
4. Title: Analysis of large-scale genome rearrangement and duplication in real sequence data: experience with Arabidopsis thaliana and the draft human sequence Speaker: Daniel G. Brown Whitehead Institute / MIT Center for Genome Research and Department of Computer Science, University of Waterloo Abstract: The identification of large-scale evolutionary events such as whole-genome duplication and genome rearrangement can be difficult long after they occur. This problem is particularly worrisome in the case of whole-genome duplication, as one copy of most duplicated genes become pseudogenes or diverge substantially. The problem becomes more confused when multiple rounds of duplication are a possibility, and yet more complicated as we analyze genomes for which gene annotation data is uncertain or incomplete, such as the draft human genome sequence, and are made still more complicated by tandem duplications, difficulties identifying which genes are actually paralogous, and by a host of other sources of error which are not a problem in theoretical models. We discuss our experiences studying genome rearrangement in 90% of the genome of Arabidopsis thaliana and in the draft sequence of the human genome. Using methods we have developed to resolve most of these difficulties, we have been able to identify multiple rounds of likely duplication in the Arabidopsis genome, and have approximately dated the time of these duplications. In the human genome, the uncertain annotation and difficulty in knowing the exact order of the draft contigs has made things much more complicated, and we have developed new methods to deal with this data set; here, we present our preliminary results looking for duplicated gene sets. Joint work with Todd Vision and Steve Tanksley (Arabidopsis); and with Tarjei Mikkelsen, Serafim Batzoglou, Simon Kasif and Eric Lander (human genome)
5. Title: "Large-scale sequence comparison by locality-sensitive hashing" Speaker: Jeremy Buhler, University of Washington Abstract: In annotating very long genomic DNA sequences, it is useful to compare these sequences to find high-scoring pairwise local alignments, which often represent conserved biological features. Comparisons may be performed either between two or more long sequences, such as orthologous stretches of human and mouse DNA, or within one sequence, as when finding repetitive elements in a chromosome. Typically, algorithms for discovering local alignments on a large scale rely on the presence of sufficiently long runs of matching nucleotides in any alignment of interest. We describe a new randomized algorithm, LSH-ALL-PAIRS, that lifts this restriction by directly finding ungapped alignments with up to a specified fraction of substitutions, regardless of whether they contain a long exact substring match. Our algorithm adapts the technique of locality-sensitive hashing, previously used to solve high-dimensional computational geometry problems, to compute local alignments. Our implementation of LSH-ALL-PAIRS scales to sequences tens of megabases in length while discovering similarities with as little as 60-70\% nucleotide identity. The algorithm can be effectively parallelized by iterating its core randomized trial simultaneously on multiple processors.
6. The Space of RNA Encodings of a Target Protein Barry Cohen, State University of New York at Stony Brook This talk describes work on the space of RNA encodings for a target protein. This question is of interest from two angles. First, it sheds light on evolutionary mechanisms at the molecular level, which have been at work for billions of years. We ask whether evolution through natural selection, the familiar force which over the course of many generations shapes species and adapts them to their environment, also adapts and shapes RNA molecules? This speaks to our curiosity about the forces which mold living things. Second, this work is of practical importance. To the extent that we identify characteristics of particular RNA molecules which nature selects for or against, we may modify those characteristics to achieve desired changes in an organism. The desired changes could have medical or agricultural or industrial significance. RNA and protein molecules are both linear sequences, or strings, over a small alphabet. Each set of three consecutive letters in an RNA sequence (called a codon) codes for a letter of protein. Each letter in the protein alphabet has, on average, three possible precursor codons. There are, therefore, exponentially many RNA sequences which code for a given target protein. This large set of RNA sequences are the protein's RNA ``space." These spaces are not only theoretical and virtual. Nature actively explores the space of possible RNA encodings for each protein through random mutations within a population over evolutionary time. Why, over the course of eons of evolution, has nature chosen one of these possible encodings? Has it, indeed, ``chosen," or is the encoding we see simply a snapshot of a particular moment in a random walk through this space, without functional significance? We conjecture that the shape and stability of an RNA molecule enhance or impede the chemical machinery which translates it into a protein, and so impact on the chances of the organism for survival and reproduction. Accordingly, we expect the stability of RNA molecules found in nature to differ from those of randomly composed RNA molecules from the same space. We present the following results: Extensive simulations showing that naturally occurring RNA coding sequences differ systematically from randomly composed sequences in the same space, providing evidence for the conjecture that they are shaped by natural selection. An efficient algorithm to find the most stable RNA sequence in a ``space": this dynamic programming algorithm provides an exact solution in $O(n^3)$ time and $O(n^2)$ space. A proof that the corresponding least-stable RNA sequence problem is NP-complete. But various heuristics show promise of giving reasonably good answers.
7. Computational, Spatial Analysis of Gene Duplication in the Mouse Genome Dannie Durand, Carnegie Mellon University Duplication of large genomic regions, ranging from chromosomal segments to the entire genome, is believed to have played a crucial role in developmental innovation in vertebrates. While many isolated examples of evolution through gene duplication in mammals exist, the large-scale events - duplication, rearrangement and gene loss - that led to the spatial organization of modern mammalian genomes are poorly understood. I describe a computational, genome-wide approach to finding duplicated genes in mouse and discuss methods for reconstructing the history of large-scale duplications.
8. Automating Comparative Map Construction Debra Goldberg, Center for Applied Mathematics, Cornell University, Susan McCouch, Department of Plant Breeding, Cornell University, Jon Kleinberg, Department of Computer Science, Cornell University Comparative maps are a powerful tool for aggregating genetic information about related organisms, for predicting the location of orthologous genes, for understanding chromosome evolution and inferring phylogenetic relationships and for examining hypotheses about the evolution of gene families and gene function in diverse organisms. Construction of any genetic map is laborious, but compiling comparative maps across multiple species requires a large investment of manual effort on the part of biologists. Additionally, biologists from different labs often use differing rules which are often not quantified or made precise in the literature, making it difficult to understand how the various comparative maps relate to one another. We have formalized the concept of a comparative map in a way which is broadly applicable. One of the primary goals of comparative mapping efforts is taken to be the identification of common ancestral linkage groups, and consider homeologous (syntenic) segments to be modern remnants of such common ancestral segments. Another goal is to reduce volumes of data into a form more easily used and understood. To achieve these goals, we offer a rigorous yet simple definition for the process of constructing comparative maps that is based on fundamental principles. The first fundamental principle is the comparative map should be accurate, in that the data should be explained well by the map. Second, it should represent a high-level global view of the relationships between the two genomes; in other words, there should be relatively few distinct segments in the labeling, so that a large volume of marker data can be distilled into a concise representation from which further hypotheses can be made at a global level. We call this the parsimony criterion. Consistent with providing a representation at a global scale of resolution, the labeling need not ``explain'' the presence of every marker. Some homologous markers are the result of microevolutionary events which are not indicative of common ancestral linkage groups, some are the results of errors or ambiguities in the data, and others indeed represent evidence of common ancestral linkage groups, but we do not assign too much weight to any single piece of evidence. From this we have developed robust, efficient algorithms which automate the identification of homeologous (syntenic) segments and construction of comparative maps. Our algorithms find an optimal balance of parsimony and accuracy using dynamic programming techniques. We begin with a simple linear model, but discover that it implicitly notices (and penalizes for) certain long-range dependencies which are biologically important. We solve this problem with a stack model, which explicitly handles these long-range dependencies in a biologically intuitive manner. We have incorporated this algorithm into a tool which is now available in a preliminary version. For input, it requires the positions of the markers of one species, as well as the location of homologs to each marker in the second species. Output is given both graphically and in text form. Only a single parameter is required, which carries a simple biological explanation. Our program allows comparative maps to be constructed in a few minutes. Results have been evaluated for diverse pairs of species, and closely approximate prior manual expert analyses.
9. Amino Acid Substitution Matrices for Compositionally Biased Bacterial Genomes Susan Abramski (1), Marek Kimmel (1) and George Weinstock (2) (1) Rice University, Houston, TX (2) University of Texas at Houston and Baylor College of Medicine, Houston, TX Introduction. Amino acid substitution matrices are tools used to determine the most likely way a protein has or might evolve and to determine if two proteins might be evolutionary relatives. Substitution matrices can reveal similarities that indicate a relationship between two proteins that were thought to be unconnected. Such relationships are particularly important in the construction of phylogenetic trees and in understanding the selection pressures that organisms have adapted to. The BLOSUM matrices (Hennikoff and Henikoff 1991) are among the most widely used matrices in protein comparison applications. BLOSUM matrices are presented as series of matrices, indexed by integer numbers, with each matrix of the series appropriate for a different evolutionary distance. Recently, a number of bacterial genomes were completely sequenced. This allows constructing substitution matrices for this group of organisms, and even for particular subgroups of bacteria. One of the characteristics of bacterial genomes is a frequent compositional bias of DNA. The purpose of this study was to explore how different are the substitution matrices based on bacterial genomes with different DNA compositional biases. The outcome may impact the efficiency with which homologous sequences in different bacterial species are found, as well as allow some insight into bacterial molecular evolution. Methods. The BLOSUM matrices in this experiment were designed to account for DNA compositional bias. If a genome contains more than 50 percent guanine (G) and cytosine (C), it is GCbiased (or GCrich). If it contains more than 50 percent adenine (A) and thymine (T), it is AT biased. Matrices were constructed in much the same way as the original Henikoff BLOSUM matrices, but instead of producing just one series intended for all organisms regardless of compositional bias, two series of matrices were constructed. One series was in tended to model the amino acid mutations that take place in GC rich organisms, the other series to model the mutations in AT rich organisms. A script was written to parse .fna files from Genbank and calculate the percentage of GC in each bacteria. The script was run on the following complete bacterial genomes contained in the NCBI ftp directory genbank/genomes/bacteria as of June 27, 2000: Aeropyrum pernix K1, Archaeoglobus fulgidus, Aquifex aeolicus, Borrelia burgdorferi, Bacillus subtilis, Chlamydia trachomatis, Chlamydia pneumoniae, Escherichia coli, Haemophilus influenzae Rd, Helicobacter pyori 26695, Mycoplasma genitalium, Methanococcus jannaschii, Mycoplasma pneumoniae, Methoanobacterium thermoautotrophicum, Mycobacterium tuberculosis H37Rv, Pyrococcus horikoshii, Rickettsia prowazekii strain Madrid E, Synechocystis PCC6803, and Treponema pallidum. The seven ATrich genomes used were Borrelia burgdorferi, Mycoplasma genitalium, Rickettsia prowazekki, Haemophilus influenzae, Helicobacter pylori, Methanococcus jannaschii, and Mycoplasma pneumoniae, and the three CG rich genomes were Mycobacterium tuberculosis, Escherichia coli, and Treponema pallidum. The asymmetry is caused by the paucity of CGrich genomes among those completely sequenced. More than 90 proteins that are highly conserved in bacteria across evolutionary distance were chosen as the basis for the matrices. These proteins are important for life functions in bacteria. Amino acid mutations in these proteins are more likely to be reflections of evolutionary pressures than mutations in less essential proteins. Most of the chosen proteins fell into the categories of ribosomal proteins, aminoacyl tRNA synthases, involvement in protein translation, and involvement in RNA synthesis. A script was written to search for the de sired proteins (protein search terms) in the .faa files of AT rich genomes and to extract those protein sequences. The script created output in the form of input for the Protomat package (Henikoff and Henikoff 1992). Blocks of multiple local alignments of proteins from AT rich genomes were made using the MotifJ component of the Protomat package. MotifJ was run on all .lis files in the protomat directory in mode 3, which is the noninteractive mode. The .blk files produced contained all of the blocks extracted from the protein sequences of AT rich genomes. These blocks were assembled into a blocks database called ATrich blocks. The BLOSUM package was run repeatedly with the ATrich blocks database with a minimum block strength of zero a maximum block strength of 9999, and the clustering percentage was varied among 30, 40, 50, 62, 70, 80, 90, 100, and n (no) clustering. The scale was set at 2 bits, or n=2. Results and Discussion. The BLOSUM matrices for the ATrich genomes are very close to the original BLOSUM matrices, constructed without regard to DNAcomposition bias. The matrices for the GCrich genomes differ in some entries, but the differences may not be statistically significant (only three GC rich genomes were used). The above result seemed unexpected, so additional analyses were carried out. Since bacterial genomes contain little noncoding DNA, the AT/GC biases are present in genes. One possibility was that the 90 most conserved genes that we included, do not exhibit these biases. Another possibility was that the AT/GC bias in these genes is confined to the silent residues in codons. To test these hypotheses, we compared the amino acid composition of the amino acid composition of the ATrich genomes to that of the GCrich genomes. The results indicated that proteins in these two groups of organisms have different amino acid compositions, consistent with the AT/GC bias. The outcome of the study is at the same time reassuring and surprising. It seems that the same rules of amino acid substitution are valid for apparently different genomes of eubacteria, a very diverse domain of life. However, it seems unexpected that almost identical rules produced genomes with different AT/GC biases. Further study of the problem seems justified. Acknowledgments. Tom McNeill helped a lot with writing the scripts in PERL. References Henikoff, S. and Henikoff, J.G. (1991) Nucleic Acids Res. 19: 6565. Henikoff, S. and Henikoff, J.G. (1992) Proc. Natl. Acad. Sci. USA 89: 10915.
10. Have functional nucleic acid sequences evolved characteristic base composition biases: a comparative study of genes and genomes Rob Knight and Erik Schultes Introduction Only a miniscule fraction of the possible nucleic acid sequences actually perform a catalytic or biological function. Here, we ask whether functional sequences are isotropically distributed over the space of possible nucleotide compositions, or whether functional sequences have characteristic nucleotide composition biases. We apply this analysis to a variety of biological and artificial sequences. Although differences in mutation spectra between different organisms greatly affect the composition of their genomes, functionally constrained sequences still fall in only a small region of the possible composition space. Here we focus on three specific questions: 1. How much can sequences with conserved function, such as rRNA and orthologous protein-coding genes, vary in composition under mutation pressure? 2. Organisms differ greatly in the composition of their coding regions, although there is a large core of orthologous proteins conserved across all three domains of life. Do these compositional differences arise because different genomes favor proteins from superfamilies that have particular amino acid compositions (e.g. olfactory proteins), or do genome-wide mutational processes affect all genes including the core set? 3. Do unrelated catalytic sequences share compositional characteristics? In particular, do artificially selected ribozymes have the same biases as natural ribozymes (rRNA, RNase P, etc.)? Methodology LSU and SSU rRNA sequences and structural alignments were downloaded from the rRNA database at http://rrna.uia.ac.be/. Codon usages for protein-coding sequences were downloaded from CUTG at http://www.kazusa.or.jp/codon/, based on GenBank Release 120.0. Aptamer and ribozyme sequences were compiled from the literature by the authors. Other sequences, including complete genomes and various classes of functional RNA molecules, were downloaded from NCBI http://www.ncbi.nlm.nih.gov/ or from other publicly available sources. Compositional data (fraction of U or T, C, A, and G) for each sequence were extracted by ISO C programs written by the authors. Where structural information was available, we extracted compositional data for the different types of feature (stem, loop, helix) individually. Similarly, for protein-coding regions, we tabulated data for the first, second, and third codon positions. We used the program MAGE, freely available from http://www.faseb.org/protein/kinemages/MageSoftware.html, for 3D data visualization. We plotted the composition of particular types of sequences in a tetrahedron, with U, C, A, and G as the four vertices, as previously described [Schultes, 1997 #2361; Schultes, 1999 #2385]. We used Microsoft Excel to quantitate patterns that we identified visually in the simplex. Preliminary Results 1. In a large sample of rRNA (1320 LSU sequences and 13 215 SSU sequences, varying in composition from 12% to 75% G+C), we find that base pairing constrains helical regions of RNA to lie parallel to Chargaff's Axis (where A=U and G=C), with characteristic offsets to account for non-Watson-Crick pairing, e.g. G-U interactions; however, loops and bulges are generally biased towards purines. These trends are replicated independently in the three domains of life, and in plastids, suggesting functional, rather than hereditary, constraints. However, in mitochondrial rRNA, Chargaffs Rule is violated: G and C are uncorrelated, as are A and U. This implies either extensive, uncharacterized base modification, or that base pairing is greatly reduced in mitochondrial (vs. nuclear) rRNA. We also find that the nucleotides annotated as loops and bulges differ significantly in composition. We plan to compare these features with analogous regions of artificially selected aptamers and ribozymes, and with other natural ribozymes. 2. From a sample of complete mitochondrial genomes from 133 species, we find that the compositions of coding regions, rRNA, tRNA, and unannotated regions are highly correlated. This supports the idea that nucleotide bias due to mutation greatly affects even sequences with conserved functions. Additionally, we show that even individual protein-coding genes, such as ATPase 8 and Cox3, vary greatly in sequence. Interestingly, the amino acid composition of these conserved sequences is largely predictable from the pattern of mutations. Interestingly, although bacterial and eukaryotic genomes follow the intra-strand Chargaffs Rule where G=C and A=T [Sueoka, 1992 #2268], mitochondrial genomes are not bound by this constraint. 3. We are currently compiling sequences of other classes of functional RNA for comparison with the LSU and SSU rRNA sequences, and hope to have this comparison complete by the time of the meeting. Additionally, we hope to repeat our mitochondrial analysis for all available complete nuclear genomes, and to repeat the rRNA analysis for several families of orthologous protein-coding genes.
11. Determination of vertebrate gene families Antje Krause (1*), Georgia Panopoulou (2), Steffen Hennig (2), and Martin Vingron (1,2) (1) Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany (2) Max-Planck Institut fuer Molekulare Genetik, Ihnestrasse 73, D-14195 Berlin, Germany (*) EMail: A.Krause@dkfz.de We present a clustering of invertebrate and vertebrate sequence data into well separated gene families as a prerequisite for the analysis of vertebrate evolution and functional annotation. Starting point for our work was a data set of about 14,000 EST sequences from amphioxus. The cephalochordate amphioxus is the last invertebrate chordate that separated from the vertebrate lineage and for which there is a living representative. Major duplication events during vertebrate evolution are assumed to have happened after the divergence of amphioxus. To prove this hypothesis one depends on well separated vertebrate gene families having only one orthologous representative in the invertebrates. Our clustering approach is based on the idea of Tatusov and Koonin, now widely known as the "two-way-blast" approach and realized in the COGs database. As a basis we took the set of predicted protein sequences of the three completely sequenced invertebrate organisms Saccharomyces cerevisiae (6,358 sequences from MIPS), Caenorhabditis elegans (19,099 sequences from wormpep20), and Drosophila melanogaster (13,675 sequences from FlyBase). Our clustering approach starts with an all-against-all search of the complete sets of invertebrate protein sequences. We use only the best hit of each sequence in both other invertebrate sequence sets for clustering, independently of the score or E-value. Sequence pairs having each other as best hit are supposed to be orthologous and hits in only one direction to be paralogs. Following this procedure with three different invertebrate organisms, the number of false positives can be extremely reduced and a quality for each cluster can be determined based on the structure and connectivity of the orthologous group. Different topologies can be observed in the resulting cluster set: 1. 1,418 orthologous groups are built of all three invertebrate organisms. 2. 2,048 groups contain only C.elegans and Drosophila, 136 groups only C.elegans and Yeast, and 248 groups only Drosophila and Yeast. 3. 411 groups contain multiple representatives of one organism. Families ending up in these clusters are e.g., HSP70, tubulin alpha, tubulin beta, multidrug resistance proteins, dynein heavy chain, myosin heavy chain, and actin. In a second step we added vertebrate, agnathe, and cephalochordate protein sequences from SWISS-PROT, PIR, and TrEMBL, and consensus sequences from the human UniGene data set, our inhouse mouse GeneNest EST clustering, and the amphioxus project. After searching these sequences against the invertebrate sequence set, again only best hits were used and a sequence added to an orthologous group only if all invertebrate orthologs of this group were matched. For 2,918 of all orthologous groups we were able to include at least one vertebrate protein sequence. For another 685 groups without a vertebrate protein match we found potentially related vertebrate EST consensus sequences. 931 of the amphioxus consensus sequences are successfully assigned to 645 of the orthologous groups. The Clusters of Orthologous and Paralogous SEquences (COPSE) are available for querying and browsing at: http://www.dkfz.de/tbi/services/copse/form
12. A Gossip-Based Approach to Syntenic Distance David Liben-Nowell, MIT Recently, there has been considerable interest in the definition and exploration of computational models that measure the genetic similarity between species. Many of these fit within the following family of models: one identifies a set of fundamental transformations that can alter a genome; the distance between two genomes G1 and G2 is then the length of the shortest sequence of these steps that transforms G1 into G2. Two of the most commonly-studied models within this paradigm are the reversal distance and transposition distance. Both are based on transformations that alter the linear order of genes within a chromosome, by reversing a contiguous segment of the chromosome or by moving a segment to another point in the genome. However, the order of genes within chromosomes is often unknown; furthermore, in multichromosomal species, it is not clear how to weight these intrachromosomal reorderings versus interchromosomal events that move genes from one chromosome to another. These types of considerations motivated Ferretti, Nadeau, and Sankoff [2] to define the syntenic distance model of genomic distance. This model ignores any information about gene order within chromosomes, focusing purely on the assignment of genes to chromosomes. As such, the only relevant transformations are those which result in interchromosomal movement of genes: fusions, where two chromosomes join into one; fissions, where one chromosome splits into two; and translocations, where two chromosomes exchange subsets of their genes. In this work, we describe two recent results on the syntenic distance. In joint work with Jon Kleinberg, we have shown that the syntenic diameter --- the syntenic distance between two maximally different species --- is exactly 2n-4. This result is based upon a surprising connection between syntenic distance and gossip problems in communication networks, in which we are interested in modeling the flow of information among people in a network. Further exploiting the connection between syntenic distance and gossip problems, we have given a more efficient algorithm to exactly compute the syntenic distance between two species. Computing this distance exactly is NP-Complete [1], but our algorithm is reasonably efficient in the interesting special case when the syntenic distance is small. Our algorithm requires O(2^{O(d log d)}) time when the syntenic distance is d, improving the best previous runningtime of O(2^{O(d^2)}), due to DasGupta et al [1]. THE SYNTENIC DIAMETER. The diameter problem for a genomic distance measure is todetermine the maximum distance between any pair of genomes. For rearrangement-based models, progress on the diameter problem has often led to insights into structural properties of the distance measure on general instances; unfortunately, these diameter questions have often turned out to be difficult to resolve. In recent work with Jon Kleinberg, we have shown that the syntenic diameter of the space of n-chromosome genomes is exactly 2n-4. Our proof is based on the relationship between syntenic diameter and the following gossip problem, popular among mathematicians in the 1970s. There are n people, each of whom initially knows a unique piece of information. The gossipers communicate by telephone calls, and whenever two speak, they share all of the information that they know. We aim to minimize the number of phone calls before each person knows all of the information. A number of researchers have given independent proofs that 2n-4 calls are necessary and sufficient to achieve this goal. (See [3].) For the syntenic diameter problem, a monotonicity property immediately identifies the genomes G1 and G2 that are maximally separated: every chromosome in G1 shares a gene with every chromosome of G2. Solving the syntenic diameter question reduces to determining the syntenic distance between these genomes. One can view the syntenic distance problem between genomes G1 and G2 as follows. Let the target of a chromosome C in G1 denote the set of chromosomes of G2 which share a gene with C. The goal for the syntenic distance problem is to exchange genes among the chromosomes of G1 so that the target of each chromosome becomes a unique single chromosome of G2. Intuitively, the operation of a phone call for a gossip problem is essentially a "reverse translocation": in a phone call, two gossipers take their information sets A and B and exchange information so that both know (A union B); in a translocation, two chromosomes with targets contained in (A union B) exchange genes so that they have targets A and B afterwards. This intuition can be formalized to prove that 2n-4 moves are necessary and sufficient to transform any n-chromosome genome G1 into any other n-chromosome genome G2. EXACT ALGORITHMS FOR SYNTENIC DISTANCE. As with many rearrangement-based genomic distance measures, exactly computing the syntenic distance has been shown to be NP-Complete [1]. We would still like to be able to exactly compute syntenic distances, especially in the interesting special case when the distance d is small. DasGupta et al [1] give an exact algorithm that requires O(2^{O(d^2)}) time to compute the syntenic distance. Based upon a generalization of the gossip problem above --- where gossiper i has a specified set of information S(i) that he wishes to learn --- we are able to give a faster exact algorithm to compute syntenic distance, requiring O(2^{O(d log d)}) time. We define a collection of these \emph{incomplete gossip} problems, and calculate the number of phone calls necessary to inform each person i of information S(i) via brute force. The tight connection with these gossip problems allows us to exactly compute syntenic distance using these values. Intuitively, the speed-up in our algorithm is derived from the following. The algorithm of DasGupta et al essentially enumerates all possible sequences of transformations of length d, and checks whether any of these sequences transform G1 into G2. The vast majority of the time spent in this algorithm is on translocations: for a translocation, we must select not only the input chromosomes, but also for each gene g in either input chromosome we must specify which output chromosome will contain g. In the gossip-based approach, we only need to select which people participate in each call. Once we have selected the participants, both learn whatever new information the other knows; there is no choice of different output sets. [1] B. DasGupta, T. Jiang, S. Kannan, M. Li, and Z. Sweedyk. On the complexity and approximation of syntenic distance. Discrete Applied Mathematics, 88(1-3):59-82, November 1998. [2] Vincent Ferretti, Joseph H. Nadeau, and David Sankoff. Original synteny. In 7th Annual Symposium on Combinatorial Pattern Matching, pages 159-167, 1996. [3] S. Hedetniemi, S. Hedetniemi, and A. Liestman. A survey of gossiping and broadcasting in communication networks. Networks 18:319-349, 1988.
13. DETECTING AND CHARACTERIZING FUNCTIONAL CHANGE IN PROTEINS: A GENOMIC PERSP ECTIVE David A. Liberles Department of Biochemistry and Stockholm Bioinformatics Center, Stockholm University, 10691 Stockholm, Sweden adaptive evolution/phenotype-genotype-structure correlation As biology enters the genomic era, new approaches to correlate genomic sequence, tertiary protein structure, and organismal phenotype are needed. A database (The Adaptive Evolution Database or TAED) of genes that appear to have undergone adaptive evolution or changes in function has been developed by comparing all chordate sequences found in Genbank. This analysis is based upon an estimation of the ratio of nonsynonymous to synonymous nucleotide substitution rates as a basis for detecting adaptivity or functional change. Cross-referencing of this database with other biological databases allows the big picture of how proteins evolve functionally and structurally and what this means for the organisms that contain them to emerge. For example at a structural level, an automatable approach to mapping mutations onto tertiary structures is combined with a phylogenetic analysis of sequence evolution. This approach is applied first to a test case, deoxyribonucleoside kinase to explain changes in specificity from a structural perspective and to relate them to the global picture of nucleoside metabolism and its evolution. This approach is then expanded to every other example in TAED with a homologue within 180 PAM units in the Protein Databank (PDB). Other case studies of adaptive evolution that emerge from TAED are also presented. The data presented here is a first step towards a comprehensive comparison of sequence-structure-function-phenotype analysis. The future development of more refined methods for gene family characterization and more automated methods for the end point analysis of data will augment this type of approach. The power of combining phylogenetic, biological, and structural approaches to the detection of adaptivity is demonstrated and will also become increasingly important as more metazoan genomes are sequenced.
14. GENE DUPLICATION AND THE ORIGIN OF EVOLUTIONARY NOVELTIES Michael Lynch, Department of Biology, University of Oregon, Eugene, OR Gene duplication has generally been viewed as a necessary source of material for the origin of evolutionary novelties, but it is unclear how often gene duplicates arise and how frequently they evolve new functions. Preliminary answers to these questions can now be acquired from the information inherent in completely sequenced genomes. Observations for several eukaryotic species suggest that duplicate genes arise at a very high rate, on average 0.01/gene/million years. Most duplicated genes in eukaryotes experience a brief period of relaxed selection early in their history, with a moderate fraction of them evolving in an effectively neutral manner during this period. However, the vast majority of eukaryotic gene duplicates are silenced within a few million years, with the few survivors subsequently experiencing strong purifying selection. The situation appears to be rather different in prokaryotes, where the death rate of duplicates is elevated and few changes at the protein level are tolerated. Although it has frequently been argued that the only mechanism for the permanent preservation of a duplicate gene is neofunctionalization, there is little evidence that even a moderate fraction of such genes evolve new functions. Instead, many duplicate genes appear to become preserved by subfunctionalization, with each member of the pair losing a key function of the ancestral gene. By releasing genes from pleiotropic constraints, subfunctionalization may open up new evolutionary pathways, thereby promoting the evolution of novel phenotypes. In addition, the random silencing of duplicate genes results in microchromosomal rearrangements, giving rise to interspecies genomic incompatibilities, and thereby promoting the passive origin of new species.
15. Strategy for the Multiple Whole Genome Alignment of four strains of E.coli using modified suffix arrays Bob Mau, Aaron Darling, Nicole Perna, and Fred Blattner Abstract We present a time-and-space efficient algorithm that aligns large sub-intervals of DNA sequence of closely related organism into segments we call backbone. Backbone represents sequence that can be putatively identified with the genome of the most recent common ancestor of the aligned organisms. The method is ideally suited for aligning a group of genomes possessing large regions of high sequence similarity, punctuated with numerous instances horizontally transferred sequence. Not surprisingly, this fits the profile of Enterobacteriaceae, the primary focus of the Blattner Lab's sequencing effort.Our methodology automatically identifies and places large relative inversions, translocations, and inverted translocations. A recent implemen- tation has successfully aligned four strains of E.coli: K-12 MG1655, K-12 W3110, O157:H7 EDL933(Enterohaemorraghic PEC), and CFT073(Uropathogenic EC). The prototype software finished in under 5 CPU minutes and used less than 7 Megs of memory. The critical software and algorithmic constructs will be described in some detail. Key is a novel representation of the genomic match coordinates that partitions maximal matches into disjoint groups called Canonical Maximal Exact Match(MEM) Equivalence Classes. We sketch how to finish this "rough draft" into a full fledged multiple alignment.Furthermore, we describe how some not-so-minor modifications will allow extension of the basic method to include more divergent genomes. 16. Comparing genomic DNA sequences: solved and unsolved problems Webb Miller, Pennsylvania State University Among the many uses of the forthcoming genomic sequence data from the mouse is elucidation of the human genomic sequence. The mouse sequence will be extremely valuable for completing the catalog of all human genes, and its potential for computational identification of other functional genomic regions, such as sites regulating gene expression, is unparalleled. A new World-Wide Web server, called PipMaker (http://bio.cse.psu.edu), produces accurate alignments with user-friendly visual overviews for any two long genomic DNA sequences. Moreover, one of the two sequences being compared can be in "working draft" form, and the same software can compare genomic sequences from other fairly similar species, such as C. elegans and C. briggsae. The following needs remain to be met. (1) Improved software that aligns two genomic sequences and has a rigorous statistical basis. (2) An industrial-strength gene prediction system that effectively combines genomic sequence comparisons, intrinsic sequence properties, and results from searching databases of proteins sequences and ESTs. (3) Reliable and automatic software for aligning three or more genomic sequences. (4) Better methods for displaying and browsing genomic sequence alignments. (5) Improved datasets and protocols for evaluating the correctness and performance of genomic alignment software.
17. Joseph Nadeau, Case Western Reserve University School of Medicine Genetic, expression and evolutionary mapping of a physiological pathway Anomalies in folate and homocysteine metabolism are associated with neural tube defects, cardiovascular disease, cancer, neurological deficits, and a variety of other birth defects and adult disorders. The known mutated genes in these pathways do not account for the frequency or variety of consequences of disrupted folate and homocysteine metabolism. We therefore are taking genomic approaches to understand the genetic control of folate and homocysteine metabolism. To identify mouse models for studying the involvement of these pathways in various birth defects and adult diseases, we documented significant variation in homocysteine levels among a panel of inbred strains. We then mapped most of the genes that are directly involved in folate and homocysteine metabolism, characterized the expression of these genes in organs and tissues that are at risk in folate and homocysteine metabolism, identified clusters of genes whose expression levels change in systematic ways across these organs and tissues, examined the evolutionary changes in the content of these metabolic pathways in selected organisms whose genomes have been completely sequenced, and described the patterns of DNA sequence variation for these genes in these organisms. Each of these approaches has raised interesting analytical and computional issues that are only partly resolved. This talk will therefore identify some new research problems delaing primarily with physiological pathways and will suggest some possible approaches to their resolution.
18. Computational Functional genomics Matteo Pelligrini, Protein Pathways Inc. The availability of 75 fully sequenced genomes provides us new insights into the function of many uncharacterized proteins. We implement four methods, Phylogenetic Profiles, Rosetta Stone, Gene Neighbor, and Operon reconstruction, to predict the function of proteins. These methods allow us to link together functionally related proteins by searching for conserved properties across genomes. The full set of links for a complete genome is captured in a protein functional network. These networks are used to assign protein functions and reconstruct metabolic pathways.
19. Title: Large-scale genome duplications in the model plant Arabidopsis and in the human genome Steven L. Salzberg, Ph.D. The Institute for Genomic Research Our group has developed an efficient algorithm for comparison of whole genomes and chromosomes, and using this system we have uncovered a number of previously unobserved features of microbial genomes and eukaryotic chromosomes. The MUMmer system (Delcher et al., 1999) can align DNA sequences of essentially any size, including entire eukaryotic chromosomes, in just a few minutes on a powerful, large-memory computer. We have used this system to align all microbial genomes against each other, and to align all the chromosomes of Arabidopsis thaliana, Drosophila melanogaster, and most recently of the human genome. Two surprising discoveries have emerged recently. The first is that the genomes of a number of bacteria can be aligned in both directions: that is, one of the genomes is globally similar at the DNA level to both the forward and the reverse strand of the other. When both alignments are displayed together, an X-shaped pattern is clearly visible. We have discovered X-alignments between the genomes of E. coli and V. cholerae, S. pyogenes and S. pneumoniae, M. tuberculosis and M. leprae, and others. The likely explanation for this phenomenon is much more frequent large-scale chromosome reversals than have previously been suspected. Second, our large-scale alignments of human and Arabidopsis have revealed that large-scale chromosome duplications appear to be very common evolutionary events. In Arabidopsis, the duplications cover approximately 60% of the genome in just 24 large, non-overlapping blocks (The Arabidopsis Genome Initiative, 2000). Our detailed analysis of the data reveals the strong likelihood that all blocks were duplicated simultaneously, lending support to the theory of a whole-genome duplication. Extensive gene loss followed this duplication, leaving only about 25-30% of the genes with paralogs in both copies of each duplicated segment. In the human genome, our analysis (Venter et al., 2001) reveals a large number of segmental duplications of varying ages. This talk will present an overview of this analysis with suggestions for further research necessary to untangle this highly complex picture. A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg (1999). Alignment of whole genomes. Nucleic Acids Research, 27:11, 2369-2376. The Arabidopsis Genome Initiative (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796-815. J.C. Venter et al. (2001). The sequence of the human genome. Science, in press.
20. David Sankoff, University of Montreal Conserved Clusters Versus Conserved Segments Two independent sets of recent observations on the sequences of small genomes pertain to the prevalence of short inversion as a gene order rearrangement process and to the lack of conservation of gene order within conserved gene clusters (in contrast to the familiar patterns of conserved segments in the comparison of higher eukaryotes). We propose a model of inversion where the key parameter is the length of the inverted fragment. We show that there is a qualitative difference in the pattern of evolution when the inversion length is small with respect to the cluster size and when it is large. This suggests an explanation of the lack of parallel gene order in conserved clusters and raises questions about the statistical validity of putative functionally selected gene clusters if these have only been tested against inappropriate null hypotheses.
21. Title: Detection of related genes in prokaryotes using syntenic regions Authors: Nalvo F. de Almeida Jr.(*) and Joao C. Setubal(**) (*) Department of Computer Science and Statistics Federal University of Mato Grosso do Sul, Brazil nalvo@dct.ufms.br (**) Institute of Computing, University of Campinas, Brazil setubal@ic.unicamp.br Extended abstract In this work we explore the synteny between prokaryotic genomes. One of our goals is to obtain clues for the function of hypothetical genes; another is to gain a better understanding of changes in gene organization between different species. Our method is based on a careful pairwise comparison of genes in whole genomes. We first build a binary similarity matrix M based on BLAST [1] e-values of genome G genes against genome H genes and vice versa (we use default BLAST parameter values, but do not use the low-complexity region filter). An entry m_{gh} in M is 1, for gene g in G and gene h in H, if and only if the e-values resulting from having g as query against h, and h as query against g, are both <= e-5. We refer to each m_{gh} != 0 as a _match_. We then look for physically (in each genome) consecutive matches (called _runs_) and for runs that are within k of each other, where k measures the maximum number of intervening genes in one genome, or in the other, or in both. Runs are found by a simple scanning of match lists. Note that there are two types of runs: parallel (both sets of genes are in increasing order in their respective genomes) and anti-parallel (one is increasing and the other is decreasing). Note also that a run is more general than an operon. In an operon, a series of consecutive genes is transcribed as a unit, and hence must all be on the same strand. Our concept of run allows consecutive genes in one or in the other genome to be in different strands. When we find runs of the same type that are within k and such that the flanking genes in each run have consistent strands we join them. This join operation adds the intervening genes in both genomes to the run. We use the term _clusters_ to designate joined runs. We designate a pair of genes (g,h), g in G and h in H, inside a cluster, such that nor g nor h belongs to a run, as a candidate related pair (CRP). Some of these pairs will be genes that are evolutionary related, but whose homology cannot be detected by the usual similarity methods. Some of the criteria that can be used to build the case that the genes g and h in a CRP are indeed related are: - Relative size. One would expect that functionally related genes have approximately the same size (the ratio of the smaller to the larger being no less than, say, 80%), but this is by no means a general rule. - Consistent strands. The joining of runs impose a constraint on the strands of genes belonging to the cluster. In a parallel run, the strands of each pair (g,h) belonging to the run must match. In an anti-parallel run, the strands must be of opposite sign (taking one strand as + and the other as -). We would expect the same consistency to occur in CRPs that are evolutionary related, but again there may exceptions to this. We have obtained two other kinds of comparative information using our method. The first is related to fusion and fission of genes. A candidate gene fusion event occurs when gene i in G has matches to genes j and j+1 in H. Since our definition of match implies reciprocity, genes j and j+1 are a candidate gene fission event. We again use the word "candidate" to stress the fact that each of these events must be carefully analyzed to determine whether they correspond to actual biological events. A second kind of information is a generalization of fusion/fission using the concept of runs. We can detect the following kind of event: given two parallel runs A = ((g_i,h_j),(g_i+1,h_j+1),...,(g_k,h_l)) and B = ((g_p,h_q),(g_p+1,h_q+1),...,(g_r,h_s)), it is the case that p = k+1 but q > l+1. In other words there is a consecutive series of genes in G that is split in two in H. An analogous definition can be given for two anti-parallel runs. We term such events as candidate run fusion events (from the point of view of G) and candidate run fission events (from the point of view of H). We present results of using the methodology described above on the publicly available genomes Xylella fastidiosa (XF), Escherichia coli (EC) and Pseudomonas aeruginosa (PA). We also use the current version of the Xanthomonas axonopodis pv citri (XC) genome, which is now being finished by a consortium of labs in Brazil. A summary of the results we obtained follows in the tables below. Table 1: Matches, runs, and clusters (using k = 2) -------------------------------------------------- organisms matches runs clusters ------------------------------------- XF vs. XC 5759 534 11 XF vs. PA 7750 544 7 XF vs. EC 5213 362 5 XC vs. PA 23300 1563 13 XC vs. EC 13154 908 9 PA vs. EC 23630 1410 7 ------------------------------------ (The clusters reported only include those for which there are CRPs such that the smaller gene in the pair is at least 20% in size of the larger gene in the pair.) Table 2: Hypothetical genes that are members of CRPs ---------------------------------------------------- organism # of hypo. genes ---------------------------- EC 7 PA 16 XC 19 XF 19 ---------------------------- total 61 ---------------------------- Of the 61 genes listed above, 12 appeared in more than one pairwise genome comparison. Table 3: Candidate gene fusion/fission events ------------------------------------------------------ G H fus in G (fis in H) fis in G (fus in H) ------------------------------------------------------ XF vs. XC 125 257 XF vs. PA 191 277 XF vs. EC 145 220 XC vs. PA 1088 803 XC vs. EC 527 489 PA vs. EC 914 828 ------------------------------------------------------ Table 4: Candidate run fission (in G) / fusion (in H) events. T is the number of genes between the two series of consecutive genes in G. ------------------------------------------- G H #events mean T max T min T ------------------------------------------- XF vs. XC 53 611.2 2530 1 XF vs. PA 47 800.3 2343 1 XF vs. EC 21 746.4 2760 1 XC vs. PA 278 1365.3 3870 1 XC vs. EC 57 997.3 4259 1 PA vs. EC 211 1668.6 5557 1 ------------------------------------------- Table 5: Candidate run fusion (in G) / fission (in H) events. T is the number of genes between the two series of consecutive genes in H. ------------------------------------------- G H #events mean T max T min T ------------------------------------------- XF vs. XC 80 829.8 3692 1 XF vs. PA 69 1637.1 4942 1 XF vs. EC 36 979.4 3571 1 XC vs. PA 505 1724.0 5347 1 XC vs. EC 166 1449.1 4246 1 PA vs. EC 214 1441.9 4140 1 ------------------------------------------- These results are currently being analyzed. In particular we expect to determine more candidate related pairs by inspecting the run fusion/fission events as well as runs that have isolated matches close by. Previous works related to the results presented here are referenced below. [1] S. Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25:3389--3402, 1997. [2] W. Fujibuchi, H. Ogata, H. Matsuda, and M. Kanehisa. Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Research, 28:4029--4036, 2000. [3] M. Y. Galperin and E. V. Koonin. Who's your neighbor? New computational approaches for functional genomics. Nature Biotechnology, 18:609--613, 2000. [4] H. Ogata, W. Fujibuchi, S. Goto, and M. Kanehisa. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Research, 28:4021--4028, 2000. [5] R. Overbeek, M. Fonstein, M. D'Souza, G. Pusch, and N. Maltsev. The use of gene clusters to infer functional coupling. PNAS, 96:2896--2901, 1999.
22. Computational Identification of Distal Regulatory Elements in the Human Genome David J. States, M.D., Ph.D, Washington University School of Medicine The control of eukaryotic gene expression is mediated by the cooperative action of multi-protein complexes binding to genomic DNA. Identifying these regulatory elements computationally is a challenge because they may be located tens or even hundreds of kilobases from a gene. A statistical approach for assessing the significance of transcription factor binding site clusters is described. Clusters achieve high statistical significance, correlate with evolutionary conservation of intergenic sequence and biochemical markers of regulatory regions. Most CpG dinucleotides in vertebrate genomes are methylated. Islands of hypomethylated CpG containing DNA are hypothesized to arise through protection of DNA from the action of methylases by sequence specific DNA binding proteins. Using human genomic sequences that are hypomethylated in vivo, a fifth order Markov model is developed to recognize hypomethylated like DNA sequences as a second approach to finding candidate regulatory regions. This Markov model is superior to traditional CpG island definitions in identifying hypomethylated sequences.
23. Computational Comparative Genomics Clemens Suter-Crazzolara, Guenther Kurapkat & Deborah Riley LION bioscience, Waldhoferstr. 98 69123 Heidelberg Germany Contact: suter@lionbioscience.com With the advent of world wide genome projects, scientists are confronted with a rapid increase in the amount of novel sequence information. Both the sizes and the number of biological databases grow exponentially. However, the gap between data collection and interpretation is also growing. For this reason, genome databases contain a wealth of information which is accessible, but which may remain obscured. As a result, across all sequenced species, nearly half of the potential genes can not be assigned specific roles. Intelligent systems are needed to bridge the widening gap between data collection and interpretation. To interpret data from newly sequenced pro- and eukaryotic genomes, comparative genomics is rapidly gaining importance, as genomes of distant organisms may still encode proteins with high sequence similarity. The order of genes (colinearity) in genomes is also be conserved to some extend. We have employed both these observations to create a versatile computational analysis system (genomeSCOUT), which allows for rapid identification and functional characterization of genes and proteins through genome comparison. With a number of independent methods, information about different levels of protein homology (concerning e.g. paralogs, orthologs and clusters of orthologous groups, COGs) and gene order is collected and stored in several value added databases. These databases are then used for interactive comparison of genomes and subsequent analysis. Genomes from in house sequencing projects or public sources can efficiently be added to the system. The application is based on the well established data integration system SRS (Etzold, T. et al., 1996. Methods Enzymol. 266: 114-128). This ensures: 1. Fast handling of large genomic data sets. The most complex queries give results instantly. 2. Straightforward access to a multitude of biological databases. Currently SRS allows the seamless integration of 400 biological databases. 3. Unique, SRS-based linking functions between all databases result in the accumulation of a wealth of biological information. This ensures optimal data mining. 4. Highly efficient collection of information on genes and proteins. 5. High flexibility allows effortless integration of additional public or proprietary databanks and applications such as CLUSTALW or BLAST. 6. Fully integrated and user friendly graphical representations of search results. This application can be used for projects spanning from the identification of drug targets or the optimization of microorganisms for biotechnology, to the elucidation of pathogenic principles, to in depth characterization and encyclopedic annotation of completed or partial genomes. Currently, the software is extended to allow the analysis of eukaryotic genomes (http://www.lionbioscience.com/)
24. Structure and dynamics of transcriptional regulatory networks in the yeast S. cerevisiae Saeed Tavazoie, Princeton University I will present computational methods for identifying regulatory interactions within the transcriptional network of the yeast S. cerevisiae. We have used classification algorithms, such as partitional clustering, to discover intrinsic expression patterns from whole-genome DNA chip data. A motif discovery algorithm based on Gibbs-sampling is then used to find significant motifs which are common to each expression pattern. Our approach identifies many known and putative novel cis=AD-regulatory elements. A major challenge now is to understand how these elements implement complex combinatorial regulatory programs.
25. Comparative genome analysis and molecular evolution Elizabeth Tillier, Centre for Clinical Genomics In many bacterial genomes, the leading and lagging strands have different skews in base composition; for example, an excess of guanosine compared to cytosine on the leading strand. Several factors contribute to these skews since they correlate with both the transcription direction of genes and with their replication direction. The molecular reasons for base composition skews attributable solely to replication direction are not known, but are very strong in some genomes. In particular, we find that Chlamydia genes that have switched their orientation relative to the direction of replication, for example by inversion, acquire the skewed base composition of their new "host" strand. Replication-related skews reflect a directional evolutionary force that causes predictable changes in the base composition of switched genes resulting in increased DNA and amino acid sequence divergence. Additionally, comparisons between related genomes reveal that gene order does not appear to deteriorate randomly, but rather by a series of imperfectly symmetrical translocations across an axis through the origin and termination of replication. Homologous recombination, transposable elements and illegitimate recombination could all have played a part in the genome rearrangements. Irrespective of the exact processes involved, the observation that almost all can be explained by a series of recombination events across the replication axis indicates that replication plays a central role in targeting the position of a translocation. These observations suggest that replication plays a major role in directing genome evolution.
26. Optimal Spliced Alignment Wei Zhu, Iowa State University Supplementary cDNA or EST evidence is often decisive for discriminating between alternative gene predictions derived from computational sequence inspection by any of a number of requisite programs. Without additional experimental effort, this approach must rely on the occurrence of cognate ESTs for the gene under consideration in available, generally incomplete, EST collections for the given species. In some cases, particular exon assignments can be supported by sequence matching even if the cDNA or EST is produced from non-cognate genomic DNA, including different loci of a gene family or homologous loci from different species. However, marginally significant sequence matching alone can also be misleading. We designed a novel algorithm that produces the optimal spliced alignment of a genomic DNA with a cDNA or EST based on scoring for both sequence matching and intrinsic splice site strength. By example, we demonstrate that this combined approach appears to improve gene prediction accuracy compared with current methods that rely only on either search by content and signal or on sequence similarity .A similar algorithm is available for spliced alignment of genomic DNA with protein sequences. For the case that several ESTs give overlapping alignments, we developed a post-processing procedure to assemble those sequences into a consensus alignment. This has several advantages: (1) more reliable gene prediction; (2) reduced redundancy; (3) identification of alternative splicing. A JAVA-based GUI allows convenient viewing of the data and results. I will discuss examples of genome annotation of C. elegans and D. melanogaster. REFERENCES 1. Usuka, J., Zhu, W., and Brendel, V. (2000) Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203-211 2. Usuka, J. and Brendel, V. (2000) Gene Structure Predicted by Spliced Alignment of Genomic DNA with Protein Sequences: Increased Accuracy by Different Splice Site Scoring. J. Mol. Biol. 297, 1075-1085.

Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on February 19, 2001.