DIMACS Workshop on Reticulated Evolution

September 20 - 21, 2004
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:: Mel Janowitz, DIMACS, melj@dimacs.rutgers.edu; Randy Linder, University of Texas, rlinder@mail.utexas.edu; Bernard Moret, University of New Mexico, moret@cs.unm.edu

Presented under the auspices of the Special Focus on Computational Molecular Biology and the Special Focus on Computational and Mathematical Epidemiology.

Abstracts:

Jonna Coombs, Rutgers University

Title: Horizontal Gene Transfer in the Environment

Horizontal Gene Transfer (HGT) in the contemporary microbial world is a process that governs pathogenicity, the evolution of antibiotic resistance, and the ability of microbes to decontaminate anthropogenically-impacted environments. Microbiologists have traditionally relied upon two types of approaches, prospective or retrospective, to study HGT in the environment. In the prospective approach, HGT is measured in real time in the laboratory using simulated ecosystems (microcosms). In the retrospective approach, the past occurrence of HGT is examined through phylogenetic analysis of genes, or the search for signatures of gene transfer in fully sequenced microbial genomes. While computational tools are well-integrated into the retrospective approach to HGT in the environment, the prospective approach is currently lacking. Mathematical modeling that combines HGT with environmental parameters would greatly affect our ability to predict and manipulate HGT in the environment. Thus, computational tools in environmental modeling, statistical analysis, and HGT simulations are needed. New ideas and applications are vital to the understanding of HGT as it affects public and environmental health.

Keith Crandall, Brigham Young University
Steven M. Woolley, Brigham Young University and Washington University
David Posada, Universidad de Vigo, Vigo, Spain
Mark Clement, Brigham Young University

Title: Comparison of methods for estimating reticulate evolutionary relationships through computer simulation

We compared five methods of phylogenetic inference using simulated data with known histories. Two of the methods (Neighbor Joining and Maximum Parsimony) infer bifurcating phylogenies, while the other three (TCS, Splitstree, and Minimum Spanning Network) can infer reticulate, multifurcating network genealogies. We simulated preliminary data to determine which parameters of the simulation model are important and the relative performance of the chosen algorithms. The phylogenies were compared using a topological comparison (Robinson-Foulds), a branch length comparison (Branch Score), and a count of the number of unique trees inferred. We found that maximum parsimony was able to recover the true tree most frequently, while Splitstree and TCS were the second most accurate. Also, we found large differences in total number of each trees inferred by each method. In fact, on more divergent data sets, the number of trees inferred by Splitstree and Arlequin was prohibitively high (in some cases more than 1,000,000!).

J. Peter Gogarten and Olga Zhaxybayeva, University of Connecticut

Title: Probability mapping and bipartiton analysis to study genome histories

Maximum likelihood and posterior probability mapping are visualization techniques used to study the mosaic nature of prokaryotic genomes [1]. However, as originally developed this approach has two shortcomings: posterior probabilities tend to overestimate the support for individual tree topologies; and the approach is limited to the analysis of only four genomes at a time, giving rise to problems resulting from low taxon sampling. We have addressed these problems by mapping the more conservative bootstrap support values, and by analyzing datasets extended by addition of homologs from reference genomes [2]. Better taxon sampling combined with subtree analyses prevents the inconsistencies associated with four-taxon analyses, but retains the power of visual representation.

In many instances it is interesting to analyze more than four genomes simultaneously. We have extended the probability mapping approach to 5 genomes [3], and we have adapted bipartition analyses (so-called Lento plots) to the analyses of several genomes [3, 4]. The resulting plots allow to detect and summarize majority or plurality signals, and they allow to focus attention on those gene families that retain phylogenetic information that conflicts with the plurality signal.

Zhaxybayeva O, Gogarten JP. 2002. Bootstrap, Bayesian probability and maximum likelihood mapping: Exploring new tools for comparative genome analyses. BMC Genomics 3: 4
Zhaxybayeva O, Gogarten JP. 2003. An improved probability mapping approach to assess genome mosaicism. BMC Genomics 4: 37
Zhaxybayeva O, Hamel L, Raymond J, Gogarten J. 2004. Visualization of the phylogenetic content of five genomes using dekapentagonal maps. Genome Biology 5: R20
Zhaxybayeva O, Lapierre P, Gogarten JP. 2004. Genome mosaicism and organismal lineages. Trends Genet 20: 254-60

Dan Gusfield, University of California at Davis

Title: Phylogenetic Networks with Constrained and Unconstrained Recombination

A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. With the growth of genomic data, much of which does not fit ideal tree models, and the increasing appreciation of the genomic role of such phenomena as recombination, recurrent and back mutation, horizontal gene transfer, gene conversion, and mobile genetic elements, there is greater need to understand the algorithmics and combinatorics of phylogenetic networks.

Wang et al. studied the problem of constructing a phylogenetic network for a set of n binary sequences derived from a known ancestral sequence, when each site in the sequence can change state at most once in the network, and recombination between sequences is allowed. They showed that the problem of finding a phylogenetic network that minimizes the number of recombination events is NP-hard, but gave a polynomial-time algorithm (O(nm + n^4)-time, for n sequences of length m each) that was intended to determine whether the sequences could be derived on a phylogenetic network in which the recombination cycles are node disjoint. We call such a network a ``galled-tree". The paper by Wang et al. is seminal in defining this problem and asserting that it has an efficient solution. Unfortunately, the algorithm given there is incomplete and only constitutes a sufficient, but not a necessary, test for the existence of a galled-tree for the data.

In this talk we do several things. By more deeply analyzing the combinatorial constraints on galled-trees, we obtain an algorithm that runs in O(nm + n^3)-time and is guaranteed to be both a necessary and sufficient test for the existence of a galled-tree for the data. We show how to relax the assumption that we know the ancestral sequence. We show that when there is a galled-tree, the algorithm constructs a ''reduced" galled-tree that minimizes the number of mutations occurring on recombination cycles. We prove that the algorithm produces a reduced galled-tree that minimizes the number of recombinations needed for the data, over all possible phylogenetic networks for the data, even if multiple crossovers, or different ancestral sequences are allowed. Hence, when a galled-tree exists, the problem of minimizing the number of recombinations can be solved efficiently. The effect is that the galled-tree is a phylogenetic network that explains the input sequences with the ``littlest deviation" from a true tree model. We show that the reduced galled-tree is ``nearly-unique", but when it is not unique, the algorithm also obtains a count of the number of galled-trees that exist for the input data, and can create these in linear time for each one, starting from the canonical galled-tree. Finally, we consider phylogenetic networks where the recombination networks are not constrained in any way. We discuss new, efficiently computed lower bounds on the number of recombination events needed.

Joint work with Satish Eddhu, Chuck Langley, and Dean Hickerson

Jody Hey, Rutgers University

Title: The Challenge of Measuring Gene Flow Between Diverging Populations

Gene flow between populations causes genetic variation to be shared. However if two populations have recently separated, or if two new species have recently diverged, both new entities are expected to share variation that was present in the ancestral population, regardless of gene flow. Thus assessing the role of gene flow in the early stages of species formation is very difficult. A new analytical protocol and a new empirical protocol, that together should overcome this difficulty, are described.

In our procedure, haplotype data are collected for loci that include a region of unique DNA sequence linked to a simple STR region. The two different components offset each other's shortcomings. We have included this two-mutation-rate model into the program for fitting the isolation with gene flow model. Examples are described from chimpanzees and from closely related species of cichlid fish from Lake Malawi.

Daniel Huson, Center for Bioinformatics Tuebingen (ZBIT), Tuebingen University, Germany

Title: Phylogenetic Networks - Visualization Techniques for Incompatible Data and Models of Evolution

There is increasing interest in the development and application of methods that produce "phylogenetic networks" from bio-molecular data. The term "phylogenetic network" is used in at least two different ways. For example, splits graphs as produced by methods such as split decomposition, neighbor-net, or the Z-supernetwork method, are a type of phylogenetic network that are primarily used to display phylogenetic data in such a way that different and incompatible signals are not forced onto a tree but are rather displayed simultaneously. In contrast to this, e.g. ancestor-recombination graphs and hybridization graphs are phylogenetic networks that represent putative recombination or hybridization szenarios. In this talk we will give a survey of methods that produce splits graphs and then will discuss some of the links between splits graphs and graphs of the second type.

Randy Linder, University of Texas and Bernard M.E. Moret, University of New Mexico

Title: Network (Reticulated) Evolution: Biology, Models, and Algorithms

This tutorial will present the state of knowledge in biology regarding reticulate evolution, review the proposed models for evolutionary networks, relate classical tree reconstruction to network reconstruction, introduce recent computational work in the latter, and identify some of the main challenges in data collection, modeling, and algorithm design that will have to be overcome in order to produce accurate simulation and reconstruction software. The tutorial will also address other evolutionary processes that can give the appearance of reticulation even though their underlying nature is tree-like.

Attendees will learn the extent to which reticulation impacts evolution and, in particular, what effect reticulation has on diseases, crops, and the environment; they will appreciate the main challenges involved in modeling and reconstructing network evolution and be brought up-to-date on the state-of-the-art in computational methods for detecting reticulation and recombination, assessing reconstructed networks, and attempting computational reconstructions.

BIOSKETCHES:

C. Randal Linder is Associate Professor of Integrative Biology in the School of Biological Sciences at the University of Texas. His research, funded under 5 separate NSF grants, focuses on two areas in biology: phylogenetics, with special emphasis on reticulate evolution, and the evolutionary significance of oil composition in seeds. His lab is developing benchmark biological systems that can be used to validate methods for reconstructing reticulate evolution. Dr. Linder has published papers on a wide variety of topics in evolutionary biology, has extensive experience lecturing on these topics, and has won the U. of Texas Teaching Excellence award.

Bernard M.E. Moret is Professor of Computer Science and of Electrical and Computer Engineering at the University of New Mexico. His research, funded under 6 separate NSF grants, a Sloan Foundation award, and an IBM grant, is aimed at developing new, high-performance methods for phylogenetic reconstruction. He is leading a group of 15 institutions (including UC Berkeley, UC San Diego, UT Austin, and Yale) on a very large, 5-year NSF project to develop a computational infrastructure to support attempts to reconstruct the Tree of Life. Over the last three years, he has published over 30 refereed papers in computational biology. He has taught at UNM for 23 years, developed two dozen new senior and graduate classes and seminars in algorithms and experimentation, won every department, college, and university teaching award, authored two textbooks, and given invited lecture series in Europe on algorithm engineering, phylogenetic reconstruction, and high-performance computing.

The two lecturers are frequent collaborators; in particular, they are jointly funded by the NSF to study network reconstruction.

Vladimir Makarenkov, Universite du Quebec a Montreal

Title: Detecting Horizontal Gene Transfers Using Discrepancies in Species and Gene Classifications

Two models for detection of horizontal gene transfers have been recently considered by Makarenkov et al. (2004) and Boc et al. (2004). Both models use a distance approach and are based on the reconciliation of the topologies of the gene and species phylogenetic trees built for the same set of organisms. The first model assumes partial gene transfer; it is based on the computation and optimization of the minimum path-length distances in a directed network. In this model, the phylogenetic tree is transformed into a connected and directed graph in which a pair of species can be linked through several paths. The second model assumes complete transfer; the species phylogenetic tree is gradually transformed into the gene phylogenetic tree by adding to it a horizontal gene transfer in each step. During this transformation, only a tree topology is taken into account and modified.

Boris Mirkin and T. Fenner, School of Computer Science and Information Systems, Birkbeck University of London, UK
E. Koonin and Y. Wolf with National Center for Biotechnology Information, NLM, NIH, Bethesda, Ma.

Title: Directed Scenarios of Gene Histories over an Evolutionary Tree as Virtual Reticulation Events

An evolutionary tree T over a set of organisms I is a rooted binary tree whose leaves are one-to-one labeled by elements of I. Given a phyletic profile of a family of genes that are supposedly descended from a common origin node on the tree, the problem is to build a most plausible scenario of its history over the tree including events of its emergence, inheritance along the tree, loss and horizontal transfer across the tree. The phyletic profiles of genes are captured in the Clusters of Orthologous Group (COGs) of proteins (http://www.ncbi.nlm.nih.gov/cog). Two types of information that can be extracted from the ~5,000 available COGs will be discussed here: (a) the profile itself, that is, the subset C of organisms in I in which the given COG is present, and (b) the distance matrix d_C between species in C determined by comparison of the protein sequences in the analyzed COG.

Mirkin, Fenner, Galperin and Koonin (2003) have developed a method for parsimoniously mapping a profile C onto the set of nodes of T to assign three types of events, gain, inheritance, and loss of a gene, in the most parsimonious way, that is by minimising the inconsistency function pG+L, where G is the number of gains, L is the number of losses, and p the gain penalty weight to be chosen externally. The events of gene emergence and horizontal transfer cannot be distinguished in this setting, thus, they both are referred to as gains. In the referred paper, the value of p was chosen based on the analysis of the contents of the root of the tree (the Last Universal Common Ancestor) according to the parsimonious scenarios found at different p values. It appeared that, among all solutions found at p ranging from 0.1 to 10, the root contents at p=1 best corresponded to a plausible minimal set of genes. However, with p=1, too many COGs have many globally optimal scenarios leading to major uncertainties in the reconstructions.

In this work, given a COG, we employ not only its profile C but the within-COG distance matrix d_C as well to further address this problem. To match d_C with a model distance matrix, we put timing into the consideration.

First, the evolutionary tree is considered timed, that is, each node is assigned with its time point on the scale from 0 to 100 consistent with the tree topology in such a way that all leaves are timed at 0 and the root at 100. Second, a gene history scenario is considered directed so that the oldest gain node is defined as the emergence node and each other gain is considered the target of a horizontal gene transfer from a source, that is, an older node among those containing the gene in question. Given a directed scenario over a timed tree, a model path distance matrix between leaves in C can be uniquely derived following the tree topology and source-target transfer directions. In fact, every directed scenario can be considered as an instruction for changing the tree topology by joining each of its target nodes to the edge leading to its source, which explains the title of the talk.

Supplementing the principle of maximum parsimony with the principle of maximum correlation, we arrive at the problem of building such a directed scenario that maximises correlation between the empirical matrix d_C and that model based. We will discuss theoretical, algorithmic and computational developments with regard to this problem. In particular, of 4872 gene evolutionary histories built according to the principle of maximum correlation, 3975 correspond to p=1, 602 to p=2, and 295 to p=3.

Bernard M.E. Moret, University of New Mexico

Title: Reconstructing Reticulate Evolution

This tutorial will review the current state of research in reconstruction and assessment methods for reticulate evolution -- that is, evolution that does not always follow simple branching patterns, but may involve events such as recombination, lateral transfer, and hybridization, which induce a more general directed acyclic graph structure. An earlier tutorial will have reviewed what biologists know about reticulate evolution; this tutorial will focus on computational methods and the models on which they are based, comparing tree-based approaches with extensions that have been proposed for networks. Two areas have seen a fair amount of work to date: how to detect the presence of reticulation events in a dataset and how to compare two evolutionary networks; we will review both in some detail. Because work in the area is just starting, a large part of the presentation will be devoted to some of the main challenges that will have to be overcome in order to produce accurate simulation and reconstruction software. The tutorial will also address other evolutionary processes that can give the appearance of reticulation even though their underlying nature is tree-like (such as the gene tree/species tree problem).

Luay K. Nakhleh, Rice University

Title: Phylogenetic Networks: Modeling, Reconstructibility, and Accuracy

Phylogenetic networks model the evolutionary history of sets of organisms when events such as hybrid speciation and horizontal gene transfer occur. In spite of their widely acknowledged importance in evolutionary biology, phylogenetic networks have so far been studied mostly for specific datasets.

We present a general definition of phylogenetic networks in terms of directed acyclic graphs (DAGs) and a set of conditions. Further, we distinguish between model networks and reconstructible ones and characterize the effect of extinction and taxon sampling on the reconstructibility of the network.

Simulation studies are a standard technique for assessing the performance of phylogenetic methods. A main step in such studies entails quantifying the topological error between the model and inferred phylogenies. We discuss three ways of characterizing phylogenetic networks: based on the set of splits they induced, the set of trees they induce, and the set of tripartitions they induce. Using these characterizations, we describe three different measures of topological accuracy, and analyze their properties.

This is joint work with Randy Linder, Bernard Moret, and Tandy Warnow.

John R Wakeley, Harvard University

Title: Estimating pedigrees from genomic sequence data: Issues and ideas.

The issues surrounding the inference of bi-parental genealogies from large-scale genomic polymorphism studies will be introduced and discussed. Issues include: (1) the null structure of pedigrees, (2) the ancestral recombination graph, and (3) the idea of "perfect" data.

Tandy Warnow, The University of Texas at Austin

Title: Reconstructing Phylogenetic Networks: a Separate-Analysis Approach

In a 1997 seminal paper, Wayne Maddison observed that when a network underlies the evolutionary history of the dataset, each site evolves down a tree contained within the network. This suggests a "separate analysis" approach to inferring networks: first identify recombination-free segments of gene sequences, then construct estimates of trees for each such segment, and finally, combine the trees into a network on the full dataset. Maddison's original paper showed how to do this for the case where the network has one reticulation and the two gene trees are correctly estimated, leaving open the cases where the network contains more than one reticulation, or the gene trees have some error.

In this talk I will present our progress on extending Maddison's original approach. I will show how a network with one reticulation can be estimated from inaccurate estimates of its two gene trees, and how a galled-tree network with more than one reticultion can be estimated from two correct gene trees.

This is joint work with Luay Nakhleh, Randy Linder, and Katherine St. John.

Contacting the Center
Document last modified on September 10, 2004.