DIMACS Tutorial and Workshop on Bioconsensus II

October 2, 2001 (Tutorial)
October 3 - 5, 2001 (Workshop)
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Tutorial Organizers:
Fred McMorris, Illinois Institute of Technology, mcmorris@iit.edu
William H.E. Day, whday@istar.ca
Workshop Organizers:
Mel Janowitz, Rutgers University, melj@dimacs.rutgers.edu
Francois Lapointe, Universite de Montreal, lapoinf@ere.umontreal.ca
Fred McMorris, Illinois Institute of Technology, mcmorris@iit.edu
Boris Mirkin, University of London, mirkin@dcs.bbk.ac.uk
Fred Roberts, Rutgers University, froberts@dimacs.rutgers.edu
Presented under the auspices of the Special Year on Computational Molecular Biology.



Title: Exact and polynomial time algorithms for supertrees
David Bryant. Dept Mathematics AND Dept Computer Science,
McGill University.
Abstract: A supertree is just a consensus tree for trees with unequal
taxon sets. The problem is that many of the standard axioms and methods
designed for consensus trees do not extend directly to supertrees, and
when they do we end up with NP-hard consensus methods. In this talk I show
that when we bound the number of input trees, we can construct optimal
consensus supertrees in polynomial time. Without the bound the problems become
NP-hard. The key to the algorithms is a combinatorial connection between
bipartitions of the input trees and bipartitions of candidate supertrees.

2. The W-curve: Trying to reach genomic bioconsensus in numerically mapped chaos space Doug J. Cork, BCPS Dept., Illinois Institute of Technology, Chicago cork@iit.edu Abstract The standard W-curve was developed by Cork and coworkers in 1993 to represent a given DNA sequence, which is a word in the alphabet {A, C, G, T}. Each base is assigned a point in the plane as follows: A = dA = (-1, -1); C = dC + (-1,1); G = dG = (1, -1); T = dT = (1,1). Moving from the 5' end to the 3' end of a strand, we then have that a DNA sequence is simply a sequence x = x1,x2...xn, where xi l {dA,dC,dG,dT} for each i = 1,2,...,n. Now a family of iterated function systems F can be defined by Xi =F(Xi-1,xi) = k-1Xi-1+ k-1xi where xi is as defined above, X0 = (0,0), and k is a positive real number. The standard W-curve is obtained by plotting the points (Xi,i) in 3-space and connecting each point with the next by a straight-line segment. The result is a representation that allows the visualization of the sequence in order to identify both global and local patterns within the sequence. An example of this can be found at www.iit.edu/~houstep/project.html. This algorithm is especially useful for finding embedded patterns in long genomic sequences, and shows significant potential as a tool for clustering these patterns. Comparative genomic analysis at its most fundamental level involves alignment of strings of DNA. Many useful and powerful tools such as Blast and Clustal W are able to respectively, search for, and align similar strings of DNA from a variety of species. However, the interesting gene motif patterns cannot be properly visualized within the information contact embedded in simple strings. More problematic is the question of whether we will be able to crystallize long genomic sequences and analyze their true secondary nd tertiary structures. It is, of course, these putative motifs that are binding to the 3D structures of proteins and inducing replication and transcription events. The W-curve is a numerical mapping algorithm that allows one to geometrically visualize the information content of genomic motifs. Patterns of Alu, Lines, SINEs, and duplication sequences may be easily visualized with the W-curve. It is our hope that this pattern recognition algorithm will lead to visualization tools to track the evolutionary history of motif patterns. The combinatorics of DNA motifs through crossover-recombination events will be more easily followed as we continue to sequence more and more genomes. In my lab we are currently collaborating with mathematicians and computer scientists to develop and test tools for analyzing genomic sequences [1,2]. Pattern recognition algorithms such as the W-curve, should allow the investigator to visualize long genomic sequences in geometrical space. The primary level of information content in DNA string alignment programs loses important tertiary levels of DNA information content that may be more consistent with true DNA-protein motif structures. The biologist may track the evolutionary history of long genomic sequences as well as embedded symmetrical patterns within these long repeats. It is suggested that these patterns could have significant function at the level of cross over recombination events occurring during proposed lateral transfer of genes between species through evolution. In order to confirm such a hypothesis, the following is a short list of ongoing tasks in our lab: 1. Correlate the structure of the information content of the W-curve projections of many similar genomic sequences with their actual function. Consensus methods will be applied to this problem. 2. Establish axiomatic definitions of appropriate molecular metrics (or dissimilarity coefficients) between W-curves of similar genes sequenced and visualized from different biological species. 3. Use the measures developed to generate phylogenetic trees of genomic sequences. 4. Compare the topology of W-curve based trees with conventional distance and character based phylogenetic trees. 5. To compare the topologies of gene trees with species trees generated by W-curve tree building techniques vs. those generated with conventional phylogenetic approaches. 6. Establish a mathematical function, which will transform the information content of the genomic W-curve into a true tertiary motif class of double helix. References [1] D. Wu and D. J.Cork et al., "Computer Visualization of Long Genomic Sequences", Proceedings of the Conference on Visualization 1993, San Jose, California, October- 25-29, 1993, IEEE Press, pp. 308-315. [2]H. J. Jeffrey, "Chaos game visualization of sequences", Computer and Graphics, 16, no. 1, pp. 25-33,1992. [3] www.iit.edu/~houstep/project.html
3. CONFIDENCE INTERVALS ON CONSENSUS TREES Guy Cucumel1 and Francois-Joseph Lapointe2 1 Ecole des sciences de la gestion, Universite du Quebec a Montreal C.P. 8888, Succ. Centre-ville, Montr=E9al, Qc, H3C 3P8 Cucumel.Guy@UQAM.CA 2 Departement de sciences biologiques, Universite de Montreal C.P. 6128, Succ. Centre-ville, Montr=E9al, Qc, H3C 3J7 Francois-Joseph.Lapointe@Umontreal.CA A consensus function takes as input a profile of trees representing the relationships among overlapping or identical sets of objects and returns a tree that is in some sense closer to the entire profile. A single tree is obtained by most consensus functions, altough this solution may not always be representative of the input profile of trees. If a unique consensus tree is not suitable, is it possible to obtain two trees that represent in some sense the lower and a upper bounds of the consensus? In other words is it possible to construct a confidence interval on consensus trees? In the present paper, we will propose a method based on a bootstrap approach. New profiles of trees will be generated and two trees will be derived from them to define an interval on consensus trees.
4. Phylogenetic Analysis as Meta-Analysis Alan de Queiroz, University of Colorado at Boulder A meta-analysis is a quantitative analysis of a collection of results from previous studies. In the past three decades the use of meta-analyses has become common, especially in the social sciences. Hillis (1995, Syst. Biol. 44: 3-16) noted that quantitative studies of congruence among phylogenetic trees resulting from analyses of different data sets are meta-analyses. I expand on this notion by placing various kinds of phylogenetic methods within the framework of meta-analysis. This framework emphasizes three major categories of methods: "total evidence," which involves combining the raw data from different data sets and thus is not a meta-analysis; average consensus, weighted matrix representation and other methods of combining trees that take into account some measure of node support and are thus "effect-size meta-analyses"; and more typical consensus methods and unweighted matrix representation, which are "vote-counting meta-analyses." Approaches that combine the raw data use more information than "effect-size meta-analyses" which in turn use more information than "vote-counting meta-analyses." Thus, in phylogenetics as in other fields, the use of effect-size and vote-counting meta-analyses require some additional justification. Regardless of whether one uses "total evidence" or some form of meta-analysis, the analysis of multiple phylogenetic data sets presents a major problem not encountered in most social science studies. In social science studies, there is typically no single correct answer to the question of interest; rather, the effect of the main independent variable is contingent on other factors, and a major goal is to estimate how these other factors affect the results. For example, in an analysis of the effects of psychotherapy, one might want to know how the training of the psychotherapist influences the results. In contrast, the goal of most phylogenetic analyses is to obtain the best estimate of a single "truth", the true tree. As in the social sciences, different phylogenetic studies may represent somewhat different populations, but in phylogenetics, the primary goal is not to explore these differences, but to somehow correct for them to infer the true tree. Some other analyses of multiple studies have a similar structure. For example, meta-analyses to estimate the mass of a kind of particle in physics attempt to correct for certain kinds of measurement error in some of the individual studies with the goal of estimating the single true mass of the particle. In phylogenetics, the necessary corrections are much more complicated than in this example from physics, and there is great disagreement on how to proceed. In particular, it is often unclear what kinds of misleading properties the data exhibit. In this context, the method of strict consensus is appealing in part because it makes few assumptions about the nature of these potentially misleading properties. However, as many have pointed out, strict and other traditional consensus methods tend to be overly conservative (because they are vote-counting procedures). The meta-analyst Rubin has suggested that the future of meta-analysis lies in trying to extrapolate results to an ideal data set, e.g., one with infinite sample size. Studies of the stability of phylogenetic results with the addition of new data might be interpreted as attempts at such an extrapolation (although some of the original investigators might blanch at such an inductive exercise).
5. Oliver Eulenstein, Iowa State Flipping to Achieve Consensus There are now several thousand evolutionary trees each of which constitutes a small subtree of the tree of life. We present an approach that allows us to integrate subtrees into larger trees in an effort to assemble the supertree of life. A set of trees that intersect pairwise in their leaf sets is termed a "grove". ntersections are used to assemble supertrees from the trees in a grove. Topological inconsistency due to errors in phylogenetic estimation and evolutionary idiosyncracies complicate the task of estimating supertrees. o tackle this problem we solve the following optimization problem: iven grove and a distance measure D among trees; find a tree -- a upertree -- hat has the smallest distance D to the trees in the grove. We introduce ew measure for D, called flip-distance and show its performance using imulations. Complexity results for the problem are presented. We show hat the flip-distance preserves semistrict consensus of the trees in a rove -- a fundamental requirement for supertrees.
6. Alternatives to Consensus in Phylogenetic Inference Joe Felsenstein, Department of Genetics University of Washington, Seattle Abstract: When consensus methods are used to infer an overall phylogeny, they suffer from arbitrariness, and it becomes impossible to choose among many methods. What is needed is an overall criterion that can guide the choice. By contrast, Total Evidence approaches have used such a criterion, overall parsimony, but they suffer from inability to respond to different processes that may be going on in different parts of the data set, such as different loci. Using a statistical model of evolution that allows for heterogeneity of processes in different loci, we can discover how to make a an overall maximum likelihood estimate that responds appropriately to the heterogeneity. In the limit when rates of change are small, it leads to parsimony weighting schemes that improve on overall parsimony. Further into the limit one reaches overall parsimony. To plant our feet firmly on a statistical foundation, we may need to abjure mathematical enjoyment and most of the techniques of making consensus trees.
7. Junhyong Kim, Yale University Super-trees and consensus trees from distance embedding Evolutionary tree graphs, called phylogenies, are commonly used to delineate genealogical relationships between biological objects. Often two or more tree graphs are of interest, say as a confidence set, and it can be desirable to obtain a consensus representation, again in terms of a tree graph. Similarly, when assembling very large graphs only a collection of strictly smaller subgraphs may be available in which case it is desirable to assemble the subgraphs into a larger "super-tree." Consensus and super- trees can be constructed with combinatorial edit operations on the graphs themselves. Alternatively, an invertable function can be constructed that maps the tree graphs to a space with a well defined algebraic structure. For example, one common map involves so-called parsimony encoding which maps tree graphs to the vector space of binary state frequencies. In this paper, I present a different map to the space of additive distance square matrices, discuss its properties, and other possible extensions to the problem of consensus and super tree construction.
8. HOW MANY CONSENSUS TREES? Francois-Joseph Lapointe1 and Guy Cucumel2 1 Departement de sciences biologiques, Universite de Montreal C.P. 6128, Succ. Centre-ville, Montr=E9al, Qc, H3C 3J7 Francois-Joseph.Lapointe@Umontreal.CA 2 Ecole des sciences de la gestion, Universite du Quebec a Montreal C.P. 8888, Succ. Centre-ville, Montr=E9al, Qc, H3C 3P8 Cucumel.Guy@UQAM.CA A consensus function takes as input a profile of trees representing the relationships among overlapping or identical sets of objects and returns a tree that is in some sense closer to the entire profile. A single tree is obtained by most consensus functions, although this solution may not always be representative of the input profile of trees. If a unique consensus tree is not suitable, how many consensus trees are needed? In the present paper, we will evaluate different criteria for selecting the number of consensus trees required to accurately represent a profile of input trees. These criteria will be used to modify consensus functions so as to return multiple trees that are in some sense closer to the entire profile. Statistical tests will also be proposed to determine the relative significance of unique and multiple consensus trees. Finally, an application to phylogenetic trees using the average consensus will be presented to illustrate the procedure.
9. TOTAL EVIDENCE, CONSENSUS, GLOBAL CONGRUENCE AND SIMULATIONS Claudine Levasseur 1,2 and Francois-Joseph Lapointe1,3 1 Departement de sciences biologiques, Universite de Montreal C.P. 6128, Succ. Centre-ville, Montr=E9al, Qc, H3C 3J7 2 Claudine.Levasseur@Umontreal.CA 3 Francois-Joseph.Lapointe@Umontreal.CA The analysis of multiple data sets in phylogenetic is leaded by two major approaches: combining the data (total evidence) or combining the trees (consensus methods). Both have been widely praised and criticized over the years for their respective strengths and weaknesses. We know that they differ greatly on their premises and philosophical backgrounds but what about the phylogenetic accuracy? Are the results from those approaches as different as some claim? In this paper, simulated data will help us to assess whether the use of a consensus method that accounts for branch lengths (e.g., average= consensus) combined with validation of the total evidence tree can lead to congruent results. Furthermore, we hypothesize that phylogenetic accuracy can be increased when those approaches are used jointly in a so-called global congruence approach.
10. Boris Mirkin, University of London Clusters of Orthologous Proteins as Characters in Tree Building A cluster of orthologous groups of proteins (COG) is a set of closely related proteins from different organisms that bears a phylogenetic signal about the set of organisms under consideration. In the COG project at NCBI (http://www.ncbi.nlm.nih.gov/COG/), there have been currently distinguished 3166 COGs in a set of fully sequenced genomes represented by 26 clades. The issue addressed in this talk is whether the data on presence/absence of genomes in COGs (COG phylogenetic profiles) carry any phylogenetic signal at all. The following example may help to explain the problem. Among species x, i, and j, species x and j are believed to be closer relatives to each other than to i. The COG contents do not confirm this: there are 336 different COGs containing x and i, and only 176 containing x and j. Moreover, 163 (93%) of these 176 also contain i. This makes it extremely difficult for any method based on the phylogenetic profiles to correctly identify the relationships among these three species (Mirkin and Koonin, 2000). Two ways for overcoming the issue have been explored. The first involves removing those COGs that are not consistent with the true species tree and thus can be considered as bearing a signal of lateral transfers rather than inheritance. The second involves a nontraditional concept of character compatibility and employs removal of COGs involved in a divergence event from other divergence events. The results found with these approaches indeed better correspond to the current model of evolutionary history of the genomes, especially in deep branching.
11. Robert Powers, University of Louisville Consensus $n$-trees, weak independence, and veto power Let $f$ be a consensus function on a set of $n$-trees. Then $f$ satisfies weak independence if whenever two profiles "agree" on a 3 element subset $X$, then the corresponding consensus outputs are compatible on $X$. If $f$ satisfies weak independence and ternary Pareto, then $f$ is a weak dictatorship. This fact is an extension of a similar result given in the theory of social choice.
12. Thousands of candidates and a handful of voters - An application of voting theory from pharmaceutical development Tommy Ratliff, Wheaton College The goal of lead optimization in pharmaceutical development is to reduce a library of hundreds of thousands of compounds to a subset of likely candidates before any laboratory testing is done. There are five main criteria used, which are assigned different weights depending on the purpose of the pharmaceutical. These criteria can naturally be interpreted as a small number of voters in an election with hundreds of thousands of candidates. This motivates our discussion of whether or not more consistency is forced on the outcomes when there are many more candidates than voters.
13. Consensus List Colorings of Graphs and Physical Mapping of DNA Fred S. Roberts Rutgers University Piscataway, NJ, USA In graph coloring, one assigns a color to each vertex of a graph so that neighboring vertices get different colors. We shall talk about a consensus problem relating to graph coloring and discuss the applicability of the ideas to the DNA physical mapping problem. In many applications of graph coloring, one gathers data about the acceptable colors at each vertex. A list coloring is a graph coloring so that the color assigned to each vertex belongs to the list of acceptable colors associated with that vertex. We consider the situation where a list coloring cannot be found. If the data contained in the lists associated with each vertex are made available to individuals associated with the vertices, it is possible that the individuals can modify their lists through trades or exchanges until the group of individuals reaches a set of lists for which a list coloring exists. We describe several models under which such a consensus set of lists might be attained. In the physical mapping application, the lists consist of the sets of possible copies of a target DNA molecule from which a given clone was obtained.
14. Mark Wilkinson, The Natural History Museum of London and Joe Thorley, School of Biological Sciences, University of Bristol and Department of Zoology, The Natural History Musuem Reduced Consensus and Supertree Methods Abstract: Reduced cladistic consensus (RCC) methods were developed in the field of phylogenetics to avoid the insensitivity of the strict consensus and the ambiguity of the Adams consensus. These problems and how they are addressed by RCC methods will be reviewed. Subsequent work has shown there to be a family of RCC methods (strict, semi-strict, majority-rule) with potential in different domains of phylogenetics and perhaps some broader implications for consensus theory. The basic ideas behind RCC methods may also be of some use in the context of supertree construction. We describe our disatisfaction with the matrix representation with parsimony methods of supertree construction, our motivation (and limited success) in developing reduced supertree methods and some additional thoughts on alternative approaches to this important problem.
15. Li-San Wang, University of Texas at Austin Clustering and Consensus in Phylogenetic Analysis Conventional ways of processing multiple trees returned by phylogenetic inferences involve consensus tree methods over the whole set of trees. The single-consensus approach drops too much information, and is usually susceptible to small outliers (e.g. strict consensus). This motivates us to investigate clustering and visualization methods to fully utilize the whole dataset. However, the space of phylogenetic trees, a high dimensional space with very discrete coordinates and extra tree constraints, is interesting to investigate in its own right. There has been little study in this direction except the phylogenetic island method. In this talk we present our framework for the study of clustering of tree spaces and devise parameters that represent certain properties of a set of trees. We then propose the class of "bi-criterion" optimization problems, and show certain subclasses of this family can be solved optimally. We compare conventional clustering methods, our new method, and the phylogenetic island method by applying them to biological datasets; we then represent the results visually.
16. Consensus, combination, and supertrees: integrating diverse data to reconstruct evolutionary trees John J. Wiens Section of Amphibians and Reptiles, Carnegie Museum of Natural History, Pittsburgh, PA 15213-4080; e-mail: wiensj@carnegiemuseums.org A long-standing but ongoing debate in phylogenetic analysis concerns how diverse datasets, such as different genes or different types of data (molecules, morphology), should be integrated to estimate the phylogeny of a group of organisms. A key point in this debate is whether datasets that give conflicting estimates of phylogeny should be combined only using trees from the separate datasets (using consensus methods) or whether the character data themselves should be combined. I show that given a large phylogeny (i.e., with many species), both the best and worst conditions for both combined and consensus analysis may occur in different parts of the same tree. Thus, universal application of a single approach to an entire tree may guarantee the failure of that method in at least some parts of the tree. Methods for integrating diverse datasets for large phylogenies are discussed. The supertree method is a consensus approach that is becoming increasingly popular for integrating large datasets that do not include identical sets of taxa. I use simulations to show that one of the major motivations of this approach, the desire to avoid missing data cells, may be somewhat misguided. Highly incomplete taxa can be included and accurately placed in combined analyses, and missing data cells are not by themselves misleading.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center

Document last modified on September 28, 2001.