DIMACS Working Group on Reticulated Evolution

September 22, 2004
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Mel Janowitz, DIMACS, melj@dimacs.rutgers.edu
Randy Linder, University of Texas, rlinder@mail.utexas.edu
Bernard Moret, University of New Mexico, moret@cs.unm.edu
Presented under the auspices of the Special Focus on Computational Molecular Biology and the Special Focus on Computational and Mathematical Epidemiology.


Jonna Coombs, Rutgers University

Title: Horizontal Gene Transfer in the Environment

Horizontal Gene Transfer (HGT) in the contemporary microbial world is a process that governs pathogenicity, the evolution of antibiotic resistance, and the ability of microbes to decontaminate anthropogenically-impacted environments. Microbiologists have traditionally relied upon two types of approaches, prospective or retrospective, to study HGT in the environment. In the prospective approach, HGT is measured in real time in the laboratory using simulated ecosystems (microcosms). In the retrospective approach, the past occurrence of HGT is examined through phylogenetic analysis of genes, or the search for signatures of gene transfer in fully sequenced microbial genomes. While computational tools are well-integrated into the retrospective approach to HGT in the environment, the prospective approach is currently lacking. Mathematical modeling that combines HGT with environmental parameters would greatly affect our ability to predict and manipulate HGT in the environment. Thus, computational tools in environmental modeling, statistical analysis, and HGT simulations are needed. New ideas and applications are vital to the understanding of HGT as it affects public and environmental health.

Samuel Handelman, Columbia University

Title: Computational Evaluation of Interspecific Gene Transfer between Fully Sequenced Genomes

We report an approach intended to evaluate the occurence of interspecific gene transfer during eubacterial evolution. We report the results of this analysis on 5,438 stringently curated ortholog groups that we call Clusters of Reciprocal Sequence Homologs (CRSH's) drawn from 110 completely sequenced eubacterial genomes. Our approach employs a maximum likelihood method to calibrate the inherent sequence conservation level at individual positions in CRSH multiple sequence alignments. This calibration enables us to use the mean level of interspecific sequence similarity among all CRSHes to infer a phylogeny web, which is a relation graph lacking tree properties. The observed statistical variability in CRSH divergence levels between genomes is consistent with the hypothesis that gene-transfer events occur but with decreasing frequency between more remotely related bacterial species. We are currently investigating possible relationships between gene transfer events and genomic organization.

Boris Mirkin and T. Fenner, School of Computer Science and Information Systems, Birkbeck University of London, UK
E. Koonin and Y. Wolf with National Center for Biotechnology Information, NLM, NIH, Bethesda, Ma.

Title: Directed Scenarios of Gene Histories over an Evolutionary Tree as Virtual Reticulation Events

An evolutionary tree T over a set of organisms I is a rooted binary tree whose leaves are one-to-one labeled by elements of I. Given a phyletic profile of a family of genes that are supposedly descended from a common origin node on the tree, the problem is to build a most plausible scenario of its history over the tree including events of its emergence, inheritance along the tree, loss and horizontal transfer across the tree. The phyletic profiles of genes are captured in the Clusters of Orthologous Group (COGs) of proteins (http://www.ncbi.nlm.nih.gov/cog). Two types of information that can be extracted from the ~5,000 available COGs will be discussed here: (a) the profile itself, that is, the subset C of organisms in I in which the given COG is present, and (b) the distance matrix d_C between species in C determined by comparison of the protein sequences in the analyzed COG.

Mirkin, Fenner, Galperin and Koonin (2003) have developed a method for parsimoniously mapping a profile C onto the set of nodes of T to assign three types of events, gain, inheritance, and loss of a gene, in the most parsimonious way, that is by minimising the inconsistency function pG+L, where G is the number of gains, L is the number of losses, and p the gain penalty weight to be chosen externally. The events of gene emergence and horizontal transfer cannot be distinguished in this setting, thus, they both are referred to as gains. In the referred paper, the value of p was chosen based on the analysis of the contents of the root of the tree (the Last Universal Common Ancestor) according to the parsimonious scenarios found at different p values. It appeared that, among all solutions found at p ranging from 0.1 to 10, the root contents at p=1 best corresponded to a plausible minimal set of genes. However, with p=1, too many COGs have many globally optimal scenarios leading to major uncertainties in the reconstructions.

In this work, given a COG, we employ not only its profile C but the within-COG distance matrix d_C as well to further address this problem. To match d_C with a model distance matrix, we put timing into the consideration.

First, the evolutionary tree is considered timed, that is, each node is assigned with its time point on the scale from 0 to 100 consistent with the tree topology in such a way that all leaves are timed at 0 and the root at 100. Second, a gene history scenario is considered directed so that the oldest gain node is defined as the emergence node and each other gain is considered the target of a horizontal gene transfer from a source, that is, an older node among those containing the gene in question. Given a directed scenario over a timed tree, a model path distance matrix between leaves in C can be uniquely derived following the tree topology and source-target transfer directions. In fact, every directed scenario can be considered as an instruction for changing the tree topology by joining each of its target nodes to the edge leading to its source, which explains the title of the talk.

Supplementing the principle of maximum parsimony with the principle of maximum correlation, we arrive at the problem of building such a directed scenario that maximises correlation between the empirical matrix d_C and that model based. We will discuss theoretical, algorithmic and computational developments with regard to this problem. In particular, of 4872 gene evolutionary histories built according to the principle of maximum correlation, 3975 correspond to p=1, 602 to p=2, and 295 to p=3.

Robbie Young, Monterey Bay Aquarium Research Institute and
David Draper, University of California at Santa Cruz

Title: Bayesian Analyses of Hybrid Populations

Genetic covariances can be informative about underlying evolutionary processes operating in hybrid populations, and as such, have received considerable attention in the population genetics literature. We develop a Bayesian, Monte Carlo methodology to first test for significant disequilibria in a hybrid population and then build upon this methodology by fitting an immigration model to the data. We incorporate data from both parental species and the hybrid population. These algorithms provide methodological stepping-stones for the incorporation and analysis of more complex evolutionary models, including models of selection and assortative mating. We discuss methodology for testing alternative evolutionary models and for assessing model adequacy.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on September 2, 2004.