DIMACS Working Group Meeting on Mathematical and Computationa Aspects Related to the Study of The Tree of Life

DIMACS Working Group Meeting on Mathematical and Computational Aspects Related to the Study of The Tree of Life

March 11-14, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:: Melvin F. Janowitz, DIMACS, melj@dimacs.rutgers.edu; Francois-Joseph Lapointe, Universite de Montreal, lapoinf@biol.umontreal.ca; F. R. McMorris, Illinois Institute of Technology, mcmorris@iit.edu; Fred Roberts, DIMACS, froberts@dimacs.rutgers.edu

Presented under the auspices of the Special Focus on Computational Molecular Biology and the Special Focus on Computational and Mathematical Epidemiology.

Abstracts:

Olaf Bininda-Emonds, Technical University of Munich

Title: Phylogenetic Supertrees: Seeing the Data for the Trees

As compared to conventional phylogenetic analyses, the obvious difference in supertree construction is that trees, rather than primary character data, are combined. This approach offers several advantages; however, it has also caused supertree construction to be strongly criticized, often to the point where the suitability of supertrees for phylogenetic inference has been questioned. The primary concern is that supertree construction loses contact with the data. As such, doubt exists regarding both the validity of the relationships indicated on any particular supertree and the phylogenetic interpretation of the supertree itself. In this talk, I review the issues involved in combining trees rather than primary character data in the context of phylogeny inference. If supertree construction is to be accepted as a viable approach for constructing phylogenies, these issues, and their relative advantages and disadvantages, need to be discussed.

D.J. Cork, S. Lembark, Illinois Institute of Technology

Title: Strategies for Assembling the Tree of Life with W-curves of Long Genomic Sequences

A major shortcoming of current gene mapping, search, alignment and treeing processes is that they attempt to perform exact matches, or alignments. One reason for this is that the common method of encoding DNA sequences is as alphabetic strings. While this matches nature in one way, it makes any sort of fuzzy matching difficult. The basic rule of fuzzy matching is to reduce the degrees of freedom within the problem space in trade for approximate matching. The problem here is that linear systems only have one degree of freedom. The W-curve describes DNA as a three-dimensional curve. The multi-dimensional space allows for fuzzy matching and can be faster for locating genes within a chromosome or comparing cDNA to chromosomes. The cDNA may be made from classes of long genomic introns and various long repetitive sequences such as are found in long genomic duplications 1,2. In some instances, these sequences may be targets, increasing our fundamental and applied understanding of the area of genomic target design.

The Hidden Markov Model ("HMM") search and mapping approach uses a state model to estimate the probability that some sequences are from the same family. These require significant "training" to build the model and can have large variances in the outcome of tests depending on how well trained the original model was. The models are also exquisitely sensitive to what material the researcher trains them on -- with a correspondingly high error rate due to bad picks. Blast/Fasta is the other common model for DNA comparisons. It attempts to align the DNA sequences along their length using a recursive method similar to string searching. It essentially treats the DNA sequences as alphabetical strings and attempts to determine if they are alike. This approach also suffers from low speed and has problems dealing with gaps, repeats, and longer pieces of DNA (over about 2K base pairs). Neither HMM's nor Blast/Fasta are good for locating genes within a chromosome. Both are too slow and break down as the number of base pairs in the gene go above long genomic oligomers. One reason for their limitations is having to operate on base pairs in sequence, performing exact mappings of the bases. They also have problems with repeat sequences, which can vary in length without changing the function of a gross DNA pattern. The net result is that neither method makes for a fast, effective sieve in locating likely positions of a gene along a chromosome or comparing cDNA to a chromosome with many introns. These methods also discard knowledge from microbiology and genetics in making their comparison. A Blast comparison of two DNA strings performs mathematical operations on the alphabetic sequences. This approach was necessary when the models were originally developed: there simply was not enough data available to go further with them. The problem with this approach today, however, is that significant amounts of information are available which can be used to direct the comparison more effectively. The net result is that no good tools for locating long genomic sequences within a chromosome -- or set of them -- exist today. What is necessary at this point is an effective sieve for locating genes along a chromosome or comparing genes with multiple -- possibly large -- repeats in them, such as that surrounding Hba 1 and Hba 2. When Applying the W-curve Visualization Methodology2, genetic sequences are stored as a three-dimensional curve. This expanded problem space can be used in several ways. The first use was as a visualization tool: color-coded shapes in space simplify manual alignment of cDNA within chromosomes. More advanced, automated techniques are also possible using the W-curve as a basis for the comparison. Replacing knowledgeable biologist with comparison utilities that include some biological knowledge should allow us to make faster, more general comparisons. We generate the three-dimensional curve from DNA sequences using a simple walk. In the X-Y plane the four corners of a square are placed on the X and Y axes. Each corner is associated with one of the four DNA bases. From the point (0,0,0) we can draw a succession of points by simply incrementing the z-axis value and moving one half of the way towards the corner, which represents a particular base. This leads to a three-dimensional curve whose vertexes track the base sequences. Comparing W-curves is an exercise in three-dimensional curve matching -- a well developed area of mathematics. Fuzzy matching works well in this situation because the curves themselves are in three dimensions, which allows a wider range of comparisons than along a line. This gives us more flexibility in how the curves are compared. The curves themselves can be generated quickly, with an average gene taking under one tenth of a second. This allows us to compare sequences of genes from a complete chromosome to an individual gene at whatever rate the researcher can compare them visually. The next step in applying W-curves to DNA comparison is automating the entire process. This involves two separate processes: developing individual curve comparisons and automating the bulk process for applying them. One goal is to implement interchangeable classes of comparison operators whose outcomes can be tailored to specific needs. For example, comparing eucaryotic DNA may involve long gaps and long intron interruptions. Locating a gene along a prokaryotic or viral genome may involve analysis of a significant number of regulatory signals, causing significant breaks where filtering must be accomplished. Repeats may have to be handled along with genes of differing length. In all of these cases, the basic process demands comparing nearness in 3 D space. Another goal is to efficiently parallalize the execution of comparisons across a Beowulf cluster. This will require an efficient scheduling mechanism for finding genes within a chromosome and dispatching the comparisons within the cluster using PVM. Our top-level program will accept a list of chromosome specifications as URL's (e.g., file or directory or something accessible via BioPerl). These will be automatically broken up into genes as necessary and farmed out to the cluster nodes for comparison. The dispatch and comparison systems are where we insert additional knowledge from biology. For example, the section of code that breaks chromosomes up into genes should have some knowledge of how different species' genes are located on the chromosome. As our comparison code evolves it will be able to handle tradeoffs in speed or close matching and apply specifics of eucaryotic or prokaryotic genes to the comparison process (e.g., ignoring introns in eucaryotes). The intent is to simplify our addition of biological knowledge to the matching process. The W-curve can offer us a faster, more flexible way to compare DNA by making fuzzy comparisons in reasonable time. The comparison mechanisms also offer a good chance to apply biological knowledge to the comparison process. The issue at this point is developing a simple, flexible framework for dispatching the comparisons. After that the comparison details can be handled incrementally. A working version will provide researchers something lacking today: the ability to perform fast, bulk searches of chromosomal DNA to find 3-D patterns of genes. Similar 3-D patterns can be aligned, compared, and transformed into a phylogenetic tree. When looking at this phylogenetic tree one remembers that it is constructed from tertiary information content and not the primary information content of DNA. This 3D-information content can be visualized and topologies may be more easily correlated with biological data used to analyze functional enzyme motif binding to targeted long genomic sequences.

1Cork, D.J., Achieving Bioconsensus with the W-curve, in Bioconsensus, Vol. 61, F. Lapointe, F.R. McMorris, M. F Janowitz, F. Roberts, eds., DIMACS Series, American Mathematical Society, Providence, R.I., in press.

2Wu, D., Cork, D.J. et al., Computer Visualization of Long Genomic Sequences, IEEE special issue: Visualization in the Sciences, p. 308-315, IEEE, 1993.

Christopher J. Creevey, David Fitzpatrick, Melissa Pentony, Rhoda Kinsella, James O. McInerney and Mark Wilkinson, National University of Ireland, Maynooth

Title: Search for the Bacterial Phylogeny - A Supertree Approach

The problem of inferring bacterial phylogenies is aggravated by the apparent frequency of lateral gene transfer. While this means that although not all genes in an organism will have a unique shared history, it is unlikely that lateral gene transfer has been so rampant that a meaningful bacterial phylogeny cannot be inferred. We have used complete bacterial genomes in order to accurately reconstruct phylogenetic relationships between the bacteria. We have developed a method for using incomplete datasets and we have also developed a method for testing alternative tree topologies.

William H. E. Day, Port Maitland, Nova Scotia

Title: What do Mathematicians Think Biologists Want from Supertrees? An Axiomatic Perspective

Biologists understand evolutionary processes well enough to have a fair idea of what they want from supertree methods to estimate and synthesize evolutionary relationships. They recognize that it would be useful to characterise or design supertree methods in terms of their properties or axioms, yet the educational systems are such that biologists may not have acquired the mathematical skills necessary to undertake such axiomatic analyses. Mathematicians can help: they like to view such problems in terms of formal models and axioms, yet the educational systems are such that their familiarity with the biological underpinnings of supertree research may be sketchy and/or simplistic. If biologists and mathematicians wish to collaborate on supertree problems, they might begin with the premise that many relevant properties are so inadequately defined, and their interrelationships so poorly understood, that usually it is impossible to obtain interesting nontrivial formal results. To address this problem, I will describe a basic framework in which agreement, consensus and supertree problems can be formulated, and in which some of the more important supertree concepts might be given precise specifications. If biologists and mathematicians find this approach relevant, we might discuss later how to extend or refine it to meet the needs of individual researchers.

Richard Desper Title: Balanced Minimum Evolution: a fast algorithm for building large trees

The Balanced Minimum Evolution method is a fast new algorithm for building large phylogenetic trees, based on estimates of evolutionary distances. This algorithm builds a tree on n taxa in O(n^2 log n) running time. In contrast, the fastest widely-used distance method, Neighbor-Joining, runs in O(n^3) time. The BME method is based on assigning branch lengths as a function of average distances between subsets of taxa. While the Ordinary least-squares method also uses average distances, with each taxon weighed equally, the BME method gives subtrees equal weight instead. In fact, the BME method is a special type of weighted least-squares, where the variances of leaf-to-leaf distance estimates are assumed to be exponential as a function of the topological distances within the tree. The BME method also guarantees non-negative branch lengths, which are biologically meaningless. Also, the method can easily be shown to be consistent. Simulations have shown BME trees to be superior topologically to trees produced by the leading distance methods and their variants.

Oliver Eulenstein, Iowa State University

Title: A Comparative Study of Flip-Supertrees

Supertree methods assemble small phylogenies into larger trees, or supertrees. In spite of much recent interest in supertrees, there are few supertree methods available. Incompatibilities among source trees with different taxa sets complicates supertree construction. The flip problem is an error correcting approach to resolve incompatibilities, with the minimal number of elementary modifications in the input trees. Flip-supertrees could only be constructed and thus evaluated for a few (<20) taxa, since the flip problem is NP-complete. To evaluate the performance of the flip problem for large supertrees and to make flip-supertrees available to phylogeneticists we implemented a heuristic for flip-supertrees. We will present a comparative simulation study of the flip heuristic with other supertree methods. In addition we will show that sampling input trees using bicliques can improve the quality of resulting supertrees. We will present an empirical simulation study in which we evaluated our sampling strategy.

H.M. Hubey, Montclair State University

Title: Dimension Reduction via Dimensional Analysis and Multiplicative Neural Networks, and Unified Datamining and Datawarehousing

A complete and unified method combining Boolean Algebra, fuzzy logic, modified Karnaugh maps, neural network type training and nonlinear transformation to create a mathematical system which can be thought of as a multiplicative (logical-AND) neural network that can be customized to recognize various types of data clustering. The method can thus be used for: (1) displaying high dimensional data, especially very large datasets; (2) recognizing patterns and clusters, with the level of approximation controllable by the user; (3) approximating the patterns in data to various degrees; (4) preliminary analysis for determining the number of outputs of the novel neural network shown in this manuscript; (5) creating an unsupervised learning network (of the multiplicative or AND kind) that can be used to specialize itself to clustering large amounts of high-dimensional data, and finally; (6) reducing high dimensional data to basically three-dimensions for intuitive comprehension by wrapping the data on a torus. The method can easily be extended to include vector time series. The natural space for high dimensional data using the natural Hamming metric is a torus. The specifically constructed novel neural network can then be trained or fine-tuned using machine-learning procedures on the original data or the approximated/normalized data. Furthermore we can determine approximately the minimal dimensionality of the phenomena that the data represent.

Since the method uses a unique hashing algorithm (for storage and data manipulation), one that works with associative access, it can be used for other techniques such as nearest-neighbor methods. It lends itself to hardware acceleration and thus can be used for very large scale projects such as streaming data for "homeland defense" or for biological databases. The method is an ideal platform for integration of data warehousing, and datamining.

Francois-Joseph Lapointe, Claudine Levasseur, and Olivier Gauthier, Departement de sciences biologiques, Universite de Montreal

Title: Average Trees, Supertrees and Splitstrees

The average consensus procedure is a method that takes as input a profile of trees P defined on a common set of objects, and returns a consensus tree, with branch lengths, that is closest from P. The same procedure can be used, however, in a supertree setting to combine trees defined on different sets of objects. The computation of average supertrees is not as simple as the computation of average consensus trees. For one, the combination of path-length distance matrices representing trees with overlapping sets of objects often leads to incomplete distance matrices defined on the global set of objects. Moreover, the averaging of distance matrices representing such trees is greatly affected by the scaling of path-length distances. To investigate the effect of these factors on the accuracy of average supertrees, a series of simulations have been carried out to distinguish among cases involving the combination of homogeneous or heterogeneous trees. Our results show that average supertrees computed from path-length distances are not as accurate as those computed from topological distances when heterogeneous trees are combined. Thus, we will introduce two distinct approaches to further examine the behavior of average consensus in the supertree setting. A validation framework will be presented to detect weakly-supported nodes in average supertrees, and a split decomposition graph will be used to identify areas of incompatibilities when combining trees. These approaches can also be generalized to other supertree methods that ignore branch lengths.

Roderic D. M. Page, University of Glasgow

Title: Supertrees: Algorithms and Databases

Supertrees are touted as one possible tool for constructing the tree of life (ToL), or at least some large parts of it. For this to be feasible we need efficient algorithm for supertree construction. At present there are few such algorithms, and what we have (e.g., MincutSupertree) have serious flaws. A major challenge is the development of fast, robust methods for agglomerating trees. Such methods could be incorporated into phylogenetic databases as a tool enabling users to assemble their own ToL. Phylogenetic databases will play a key role in developing our knowledge of the ToL, whether the ToL is constructed using supertrees, supermatrices, or something else. However, existing databases lack integrity of both taxon and data names. This makes it virtually impossible to undertake large scale analyses of real data. For example, given a properly constructed database, generating "challenge" datasets along the lines of the graph DIMACS challenges would be easy and potentially very useful for the phylogeny algorithm community. At present, this is not possible.

Andrzej Polanski, Rice University

Title: Lengths of Branches in the Coalescence Tree and Frequencies of Single Nucleotide Polymorphisms under Evolution with Time-Varying Population Size

Statistical inference from random samples of chromosomes taken from population depends on probability distributions of times in the coalescence process. In the case of a population which has evolved with constant size over many generations, times between coalescence events are independent exponential random variables. In the case of evolution with time - varying population size, times in the coalescence tree are no longer independent. Joint probability density function of coalescence times involves a sequential dependence of times of events that happen later in the tree on those which happen earlier.

In the presentation we show expressions of marginal distributions for times in the coalescence process with time - varying population size, which follow from the form of their joint probability distribution. The presented method also allows computation of joint probability distribution for pairs, triples, etc. of coalescence times.

We show application of the expressions for the expected lengths of branches in the coalescence tree to computing sampling distributions of Single Nucleotide Polymorphisms frequencies in populations with time varying size.

For large genealogies (of size n > 50 ) the described method cannot be directly applied because expressions contain coefficients which become very large. We present a method to stabilize numerical computations based on transforming equations to the form with non - diverging coefficients and we give expressions, obtained with the use of methods of hypergeometric summation, to compute these coefficients.

Fred S. Roberts, DIMACS, Rutgers University

Title: The Tree of Life: Challenges for Discrete Mathematics and Theoretical Computer Science

The tree of life problem raises new challenges for mathematics and computer science just as it does for biological science. This talk will lay out some of the challenges for math and CS, with emphasis on discrete math and theoretical CS. In addition to phylogenetic tree reconstruction issues, the talk will deal with database issues, nomenclature, setting up a species bank, digitization of natural history collections, interoperability, and the many applications of research on the tree of life.

Li-San Wang, The University of Texas at Austin

Title: Genome Rearrangement Phylogeny

Evolution operates on whole genomes through mutations that change the order and strandedness of genes within the genomes. These ``genome rearrangement" events are examples of ``rare genomic changes,'' which have low frequency and high signal-to-noise ratio. Thus analyses of gene-order data present new opportunities for discoveries about deep evolutionary events, provided that sufficiently accurate methods can be developed to reconstruct evolutionary trees.

In this talk I will present our group's results in genome rearrangement phylogeny reconstruction by combining both distance-based and parsimony-based tree reconstruction methods. We first developed true evolutionary distance estimators for genome rearrangement evolution; in our simulation study we obtain highly accurate trees by using these new distance estimators, even when the amount of evolution in the dataset is high. Our new software GRAPPA uses these new distance-based methods together with lowerbound techniques in order to achieve huge speedup over the previous implementation by Blanchette and Sankoff. We analyzed the Campanulaceae chloroplast gene order dataset with 13 taxa; GRAPPA was able to discover 216 most parsimonious trees within 2 hours.

This is joint work with Robert Jansen and Tandy Warnow (University of Texas) and Bernard Moret (University of New Mexico).

Mark Wilkinson, Natural History Museum, London

Title: What do Biologists Want from Supertrees?

Although a variety of supertree methods have been proposed, our understanding of these methods is limited. In turn, this limits the potential for biologists that seek to construct supertrees to make informed choices among the available methods. I distinguish between supertree methods that offer a conservative synthesis of the relationships that are agreed upon or uncontradicted by all the input trees, and more liberal meta-analytic supertree methods which have the potential to resolve conflict. Based on collaborations with a number of workers, and focusing primarily on meta-analytic supertrees, I introduce a series of properties that might be considered desirable by biologists and highlight where it is known that particular methods do or do not satisfy them. For biologists, the primary aim of meta-analytic supertree construction is to produce accurate phylogenies and most of the properties relate to this prime objective. Secondary aims pertain to the practicality of supertree methods, particularly their speed. From a mathematical perspective the properties I discussed are more or less imprecisely formulated and their inter-relationships are unclear. Mathematicians, computer scientists and biologists must work together to improve our understanding of supertree methods and the potential for really useful supertree analyses.

DIMACS Homepage

Contacting the Center

Document last modified on March 10, 2003.

DIMACS Working Group Meeting on Mathematical and Computational Aspects Related to the Study of The Tree of Life

March 11-14, 2003 DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Abstracts:

March 11-14, 2003
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ