DIMACS Center, CoRE Building, Rutgers University

**Organizers:****Allen Rodrigo**, University of Auckland, a.rodrigo@auckland.ac.nz**Mike Steel**, University of Canterbury, M.Steel@math.canterbury.ac.nz

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens, June 19 - 20, 2006.

DIMACS Working Group on Phylogenetic Trees and Rapidly Evolving Diseases II, June 23, 2006.

DIMACS Working Group on Phylogenetic Trees and Rapidly Evolving Diseases I, September 7 - 8, 2004.

Title: The evolution of infectious diseases: larger datasets require better methods

Molecular evolutionary studies can provide insights into the spread of infectious diseases and inform infection control measures. However, for highly recombinogenic organisms such as N. gonorrhoeae, phylogenetic analysis is difficult. Therefore, we performed a population genetic analysis of gonococcal isolates obtained over a 15-year interval in Baltimore MD, where gonorrhea is highly prevalent. Categorical and quantitative analyses of genetic differentiation revealed temporal structuring of the gonococcal population. The historical demography of N. gonorrhoeae reconstructed from sequence data showed a correlation with trends in the number of reported cases of N. gonorrhoeae and may reflect the influence of social and demographic factors and the impact of antimicrobial resistance on the molecular epidemiology of gonorrhea in Baltimore over the past two decades.

Title: Inferring phylogeography and population history from viral sequences

I will present some preliminary results on modeling phylogeography via an isolation-by-distance approach. Both simulations and analysis will be presented including a recently published data set of FIV viruses in cougars (Puma concolor) that show strong spatial correlation of genotypes.

Title: A Markov Chain Monte Carlo Expectation Maximization algorithm for statistical analysis of DNA sequence evolution with neighbour-dependent substitution rates

The evolution of DNA sequences can be described by discrete state continous time Markov processes on a phylogenetic tree. We consider neighbour-dependent evolutionary models where the instantaneous rate of substitution at a site depends on the states of the neighbouring sites. Neighbour-dependent substitution models are analytically intractable and must be analysed using either approximative or simulation-based methods. We describe statistical inference of neighbour-dependent models using a Markov Chain Monte Carlo Expectation Maximization (MCMC-EM) algorithm. In the MCMC-EM algorithm, the high-dimensional integrals required in the EM algorithm are estimated using MCMC sampling. The MCMC sampler requires simulation of sample paths from a continuous time Markov process, conditional on the beginning and ending states and the paths of the neighbouring sites. An exact path sampling algorithm is developed for this purpose.

Title: Linking dynamic and evolutionary models of persistent infection

A large body of mathematical theory has been developed to characterize persistent viral infections within vertebrate hosts. Most of the theory can be classified as either "dynamical models" that predict the population dynamic interaction between virus and host cells or "population genetic models" that predict gene sequence evolution of the pathogen. These two bodies of theory can be linked by considering the demography of the viral population. Gene sequence evolution is usually modeled as a mutation-limited process in which the rate of evolution is proportional to the mutation rate per replication cycle and the number of replication cycles (pathogen generations) per unit time. The latter is clearly dependent on dynamical parameters such as the clearance rate of free virus or the death rate of infected cells. Here, I review analytical methods that explicitly link dynamical and population genetic theories. These methods are extended to consider the evolutionary consequences of internal host structure, the tendency for a virus to infect different cell types within multiple different compartments (e.g. tissue types). Each sort of structure can substantially impact the rate of viral evolution.

Title: Intrahost sequence diversity dynamics of HIV virus population

Quantifying the dynamics of intrahost HIV sequence evolution is essential to understand the interaction between the virus population and the immune system. Previous studies that have looked at divergence and diversity over time and the relationship of sequence evolution and immune system function have come to different conclusions. In this study we have unified all those results into a comprehensive model. Reanalyzing env sequence data, we developed a sequence evolution model where each virus sequence variant is represented by the distance from the founder strain. Virus sequence evolution is dictated by the probability of a successful mutation to become fixed and the fitness of the new mutant, i.e. the total number of replicated viruses per virus per unit time. The model suggests that the saturation of divergence and decrease of diversity in the later stage of an infection are attributed to a decrease in the probability of successful mutations to be fixed as a function of the distance from the founder strain. This prediction is confirmed by estimating the evolutionary rate from a maximum likelihood (ML) tree. Importantly, the evolutionary rate was stable or increased when the level of CD4+ T-cells was fairly stable while it decreased when the CD4+ T-cell counts went down. Thus, the decline of the evolutionary rate correlates with that of CD4+ counts, and appears to be connected to the fading selection pressure of the host immune functions during disease progression. This implies a dynamic, rather than static, interaction of HIV-1 sequence evolution and host immune functions.

Title: New tools for molecular epidemiology and HIV classification at the HIV database

I will present some new tools and developments on existing tools for mainly classification of HIV sequences at the HIV database. Accurate, or at least systematic, classification of HIV (and any other pathogen) sequences is important because it is used for tracking the pandemic, designing potential vaccines, and understanding the evolution of the pathogen. Equally important is quality control of deposited sequences, and the possibility to remove offensive sequences which otherwise might muddle or destroy analyses. I will show some of these tools and their implication on analyses of HIV sequences.

Title: Evolution of Foot-and-Mouth Disease Virus: An Analysis of Recombination and Selection at the Amino Acid Level

Foot-and-mouth disease virus (FMDV) is a debilitating, highly contagious, pathogen of economically important livestock. FMDV (Picornaviridae: Aphthovirus) is presently enzootic in all continents except Australia and North America and includes seven distinct serotypes: Euroasistic serotypes Asia1, A, C, and O and South African Territories (SAT) serotypes SAT1, SAT2, and SAT3. The effect of natural selection at the amino acid level for three serotypes A, C and O was assessed employing two different methods. First by comparing the fixation rates of nonsynonymous (dN) amino acid replacement to synonymous (dS) silent substitutions (?=dN/dS) across lineages on a site-by-site basis using the CODEML program in the PAML package (Anisimova et al., 2001; Yang, 1995; Yang et al., 1995; Yang et al., 2000). Omega measures the selective pressure at the amino acid level, with ? = 1 signifying neutral evolution, ? < 1 purifying selection, and ? >1 positive selection. Secondly, selection was evaluated by analyzing 31 quantitative structural and biochemical properties using the model of McClellan and McCracken (2001) as implemented in TreeSAAP ver3.2 (Woolley et al., 2003). TreeSAAP infers nonsynonymous changes along a phylogeny using maximum likelihood ancestral state reconstruction. Nonsynonymous changes are then examined for intensity of change relative to 31 amino acid properties. Finally, the program maps positively selected changes back to the amino acid position.

Title: Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation

The nucleotide substitution process often differs at sites in both coding and non-coding regions possibly due to the influence of selection. In particular, some sites may evolve at much faster, or slower, rates (e.g., first and second versus third positions in codons, etc). We propose a Bayesian MCMC method that allows very complex substitution models to be implemented, including models in which the substitution rate varies continuously across sites. The novel aspect of our technique is the use of uniformization of the Markov substitution process to integrate over substitution events along the branches of a phylogeny without specifying the particular transitions that occurred on a branch. An advantage of this formulation of the transition probabilities is that it allows efficient augmentation of the data in a MCMC analysis by treating the substitution events as random variables in the chain and eliminating the need to numerically calculate the transition probabilities in complex substitution models by use of matrix exponentiation. The method is evaluated by using simulated data to examine the accuracy of inferred site-specific rates and branch lengths under a simple substitution model with gamma distributed rate variation. The performance is compared to that of existing methods implemented in the program PAML. The method can be readily implemented for use with much more complex substitution models and has the potential to greatly simplify estimation of site-specific rates under such models.

Title: Characterizing Changes the Rate of Substitution in Influenza A and Dengue

Rapidly evolving pathogens such as RNA viruses, together with high-throughput sequencing technology, present the opportunity to examine a specific evolutionary process in situ. This promises to allow refinement of models of sequence evolution as well as rigorous, quantitative testing of hypothesis. Here, we present TREBLE 2.0, a method for inferring the rate of nucleotide substitution along each branch under a 'broken' molecular clock. We then apply the method to two viruses: influenza A and Dengue. The result is of two drastically different patterns of evolution, characterized by the rate of nucleotide substitution. Connections with the transmission dynamics of these viruses are explored.

Title: Inferring Speciation Times Under an Episodic Molecular Clock

A recently developed Markov chain Monte Carlo algorithm for Bayesian estimation of species divergence times is extended to allow variable evolutionary rates among lineages. The method can use heterogeneous data from multiple gene loci and accommodate multiple fossil calibrations, with flexible statistical distributions used to describe fossil uncertainties. The prior for divergence times without fossil calibrations is specified by use of a birth-death process with species sampling. The prior for lineage-specific substitution rates is specified using either a model with autocorrelated rates among adjacent lineages (based on a geometric Brownian motion model of rate drift), or a model with independent rates among lineages specified by a common log-normal probability distribution. We develop an infinite-sites theory for predicting the asymptotic uncertainty of divergence time estimates (e.g., for sequences of infinite length). Simulations are used to study the influence of among-lineage rate variation, and the number of loci sampled, on the uncertainty of divergence time estimates. We apply our new algorithms to empirical data sets that are known to contain among-lineage rate variation and compare the results with those obtained in previous Bayesian and likelihood analyses.

Title: Inferring changes in evolutionary processes with measurably evolving populations

In this talk, I discuss new methods that explore models which permit changes in evolutionary parameters. These models are applied to samples from Measurably Evolving Populations, and allow for (1) changing rates/models of evolution both as functions of external covariates, or as discrete events, and (2) changing demographic models. I also discuss where this research is taking us.

Title: Novel uses of discrete mathematics in molecular phylogenetics

Concepts from discrete mathematics often provide useful tools for modeling and analyzing confounding evolutionary phenomena, such as recombination, lineage sorting, horizontal gene transfer, and the resolution of deep, short divergences in a tree. In this talk we describe some of the ways in which techniques from combinatorics can provide useful tests and algorithms for studying such phenomena. Discrete mathematics is also relevant to the development of approaches for tree reconstruction based on new types of genome data.

Title: Phylogenetic Mapping of Recombination Hot-Spots via a Spatially Smoothed Change-Point Process

We present a Bayesian framework for inferring spatial preferences of recombination from multiple putative recombinant nucleotide sequences. Detection of homologous recombination with phylogenetic models of evolution has been an active area of research for the last 15 years. However, only recently attempts of summarizing information from several instances of recombination have been made. We propose a Bayesian hierarchical model that allows for simultaneous inference of recombination break-point locations and spatial variation in recombination frequency. The dual multiple change-point model for phylogenetic recombination detection resides at the lowest level of our hierarchy under the umbrella of a common prior on break-point locations. The hierarchical prior allows for information about spatial preferences of recombination to be shared among individual datasets. To overcome the sparseness of break-point data, dictated by the modest number of available recombinant sequences, we a priori impose a biologically relevant correlation structure on recombination location log-odds via a Gaussian Markov random field hyper-prior. To examine the capabilities of our model to recover individual break-points and spatial variation in recombination frequency, we simulate recombination from a predefined distribution of break-point locations. We then proceed with the analysis of 42 HIV gag recombinants and identify a recombination hot-spot in the Capsid gene. RNA stem-loop elements, located in this region, support the hypothesized involvement of local secondary structure in promoting recombination. (joint work with Vladimir N. Minin)

Title: Evolutionary dependence among sequence sites in viruses

Probabilistic models of nucleotide evolution typically ignore the phenotypic consequences of sequence changes. We discuss our attempts to incorporate some aspects of phenotype into models of sequence evolution. The models that we are investigating were designed for interspecific analyses. However, crude population genetic interpretations can be assigned to these models. We are trying to apply our approaches to viral evolution, but we still have substantial obstacles to overcome.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on June 20, 2006.