DIMACS Tutorial and Workshop on Bioconsensus II
October 2, 2001 (Tutorial)
October 3 - 5, 2001 (Workshop)
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ
- Tutorial Organizers:
- Fred McMorris, Illinois Institute of Technology, mcmorris@iit.edu
- William H.E. Day, whday@istar.ca
- Workshop Organizers:
- Mel Janowitz, Rutgers University, melj@dimacs.rutgers.edu
- Francois Lapointe, Universite de Montreal, lapoinf@ere.umontreal.ca
- Fred McMorris, Illinois Institute of Technology, mcmorris@iit.edu
- Boris Mirkin, University of London, mirkin@dcs.bbk.ac.uk
- Fred Roberts, Rutgers University, froberts@dimacs.rutgers.edu
Presented under the auspices of the Special Year on Computational Molecular Biology.
Abstracts:
1.
Title: Exact and polynomial time algorithms for supertrees
David Bryant. Dept Mathematics AND Dept Computer Science,
McGill University.
Abstract: A supertree is just a consensus tree for trees with unequal
taxon sets. The problem is that many of the standard axioms and methods
designed for consensus trees do not extend directly to supertrees, and
when they do we end up with NP-hard consensus methods. In this talk I show
that when we bound the number of input trees, we can construct optimal
consensus supertrees in polynomial time. Without the bound the problems become
NP-hard. The key to the algorithms is a combinatorial connection between
bipartitions of the input trees and bipartitions of candidate supertrees.
2.
The W-curve: Trying to reach genomic bioconsensus in
numerically mapped chaos space
Doug J. Cork, BCPS Dept., Illinois Institute of Technology, Chicago
cork@iit.edu
Abstract
The standard W-curve was developed by Cork and coworkers in 1993 to
represent a given DNA sequence, which is a word in the alphabet {A, C, G,
T}. Each base is assigned a point in the plane as follows: A = dA = (-1,
-1); C = dC + (-1,1); G = dG = (1, -1); T = dT = (1,1). Moving
from the 5' end to the 3' end of a strand, we then have that a DNA sequence
is simply a sequence
x = x1,x2...xn,
where xi l {dA,dC,dG,dT} for each i = 1,2,...,n. Now a family of iterated
function systems F can be defined by
Xi =F(Xi-1,xi) = k-1Xi-1+ k-1xi
where xi is as defined above, X0 = (0,0), and k is a positive real
number. The standard W-curve is obtained by plotting the points (Xi,i)
in 3-space and connecting each point with the next by a straight-line segment.
The result is a representation that allows the visualization of the sequence
in order to identify both global and local patterns within the sequence.
An example of this can be found at www.iit.edu/~houstep/project.html.
This algorithm is especially useful for finding embedded patterns in long
genomic sequences, and shows significant potential as a tool for clustering
these patterns.
Comparative genomic analysis at its most fundamental level involves
alignment of strings of DNA. Many useful and powerful tools such as
Blast and Clustal W are able to respectively, search for, and align
similar strings of DNA from a variety of species. However, the interesting
gene motif patterns cannot be properly visualized within the information
contact embedded in simple strings. More problematic is the question of
whether we will be able to crystallize long genomic sequences and analyze
their true secondary nd tertiary structures. It is, of course, these putative
motifs that are binding to the 3D structures of proteins and inducing
replication and transcription events. The W-curve is a numerical mapping
algorithm that allows one to geometrically visualize the information content
of genomic motifs. Patterns of Alu, Lines, SINEs, and duplication sequences
may be easily visualized with the W-curve. It is our hope that this pattern
recognition algorithm will lead to visualization tools to track the
evolutionary history of motif patterns. The combinatorics of DNA motifs
through crossover-recombination events will be more easily followed as we
continue to sequence more and more genomes. In my lab we are currently
collaborating with mathematicians and computer scientists to develop and test
tools for analyzing genomic sequences [1,2]. Pattern recognition algorithms
such as the W-curve, should allow the investigator to visualize long genomic
sequences in geometrical space. The primary level of information content in
DNA string alignment programs loses important tertiary levels of DNA
information content that may be more consistent with true DNA-protein motif
structures. The biologist may track the evolutionary history of long genomic
sequences as well as embedded symmetrical patterns within these long repeats.
It is suggested that these patterns could have significant function at the
level of cross over recombination events occurring during proposed lateral
transfer of genes between species through evolution. In order to confirm
such a hypothesis, the following is a short list of ongoing tasks in our lab:
1. Correlate the structure of the information content of the W-curve
projections of many similar genomic sequences with their actual function.
Consensus methods will be applied to this problem.
2. Establish axiomatic definitions of appropriate molecular metrics (or
dissimilarity coefficients) between W-curves of similar genes sequenced
and visualized from different biological species.
3. Use the measures developed to generate phylogenetic trees of genomic
sequences.
4. Compare the topology of W-curve based trees with conventional distance
and character based phylogenetic trees.
5. To compare the topologies of gene trees with species trees generated
by W-curve tree building techniques vs. those generated with conventional
phylogenetic approaches.
6. Establish a mathematical function, which will transform the information
content of the genomic W-curve into a true tertiary motif class of double
helix.
References
[1] D. Wu and D. J.Cork et al., "Computer Visualization of Long Genomic
Sequences", Proceedings of the Conference on Visualization 1993, San Jose,
California, October- 25-29, 1993, IEEE Press, pp. 308-315.
[2]H. J. Jeffrey, "Chaos game visualization of sequences", Computer
and Graphics, 16, no. 1, pp. 25-33,1992.
[3] www.iit.edu/~houstep/project.html
3.
CONFIDENCE INTERVALS ON CONSENSUS TREES
Guy Cucumel1 and Francois-Joseph Lapointe2
1 Ecole des sciences de la gestion, Universite du Quebec a Montreal
C.P. 8888, Succ. Centre-ville, Montr=E9al, Qc, H3C 3P8
Cucumel.Guy@UQAM.CA
2 Departement de sciences biologiques, Universite de Montreal
C.P. 6128, Succ. Centre-ville, Montr=E9al, Qc, H3C 3J7
Francois-Joseph.Lapointe@Umontreal.CA
A consensus function takes as input a profile of trees representing the
relationships among overlapping or identical sets of objects and returns
a tree that is in some sense closer to the entire profile. A single tree
is obtained by most consensus functions, altough this solution may not
always be representative of the input profile of trees. If a unique
consensus tree is not suitable, is it possible to obtain two trees
that represent in some sense the lower and a upper bounds of the
consensus? In other words is it possible to construct a confidence
interval on consensus trees?
In the present paper, we will propose a method based on a bootstrap
approach. New profiles of trees will be generated and two trees will
be derived from them to define an interval on consensus trees.
4.
Phylogenetic Analysis as Meta-Analysis
Alan de Queiroz, University of Colorado at Boulder
A meta-analysis is a quantitative analysis of a collection of results
from previous studies. In the past three decades the use of
meta-analyses has become common, especially in the social sciences.
Hillis (1995, Syst. Biol. 44: 3-16) noted that quantitative studies
of congruence among phylogenetic trees resulting from analyses of
different data sets are meta-analyses. I expand on this notion by
placing various kinds of phylogenetic methods within the framework of
meta-analysis. This framework emphasizes three major categories of
methods: "total evidence," which involves combining the raw data from
different data sets and thus is not a meta-analysis; average
consensus, weighted matrix representation and other methods of
combining trees that take into account some measure of node support
and are thus "effect-size meta-analyses"; and more typical consensus
methods and unweighted matrix representation, which are
"vote-counting meta-analyses." Approaches that combine the raw data
use more information than "effect-size meta-analyses" which in turn
use more information than "vote-counting meta-analyses." Thus, in
phylogenetics as in other fields, the use of effect-size and
vote-counting meta-analyses require some additional justification.
Regardless of whether one uses "total evidence" or some form of
meta-analysis, the analysis of multiple phylogenetic data sets
presents a major problem not encountered in most social science
studies. In social science studies, there is typically no single
correct answer to the question of interest; rather, the effect of the
main independent variable is contingent on other factors, and a major
goal is to estimate how these other factors affect the results. For
example, in an analysis of the effects of psychotherapy, one might
want to know how the training of the psychotherapist influences the
results. In contrast, the goal of most phylogenetic analyses is to
obtain the best estimate of a single "truth", the true tree. As in
the social sciences, different phylogenetic studies may represent
somewhat different populations, but in phylogenetics, the primary
goal is not to explore these differences, but to somehow correct for
them to infer the true tree. Some other analyses of multiple studies
have a similar structure. For example, meta-analyses to estimate the
mass of a kind of particle in physics attempt to correct for certain
kinds of measurement error in some of the individual studies with
the goal of estimating the single true mass of the particle. In
phylogenetics, the necessary corrections are much more complicated
than in this example from physics, and there is great disagreement on
how to proceed. In particular, it is often unclear what kinds of
misleading properties the data exhibit. In this context, the method
of strict consensus is appealing in part because it makes few
assumptions about the nature of these potentially misleading
properties. However, as many have pointed out, strict and other
traditional consensus methods tend to be overly conservative (because
they are vote-counting procedures).
The meta-analyst Rubin has suggested that the future of meta-analysis
lies in trying to extrapolate results to an ideal data set, e.g., one
with infinite sample size. Studies of the stability of phylogenetic
results with the addition of new data might be interpreted as
attempts at such an extrapolation (although some of the original
investigators might blanch at such an inductive exercise).
5.
Oliver Eulenstein, Iowa State
Flipping to Achieve Consensus
There are now several thousand evolutionary trees each of which
constitutes a small subtree of the tree of life. We present an
approach that allows us to integrate subtrees into larger trees in
an effort to assemble the supertree of life.
A set of trees that intersect pairwise in their leaf sets is termed a
"grove". ntersections are used to assemble supertrees from the trees in
a grove.
Topological inconsistency due to errors in phylogenetic estimation and
evolutionary idiosyncracies complicate the task of estimating
supertrees. o tackle this problem we solve the following optimization
problem: iven grove and a distance measure D among trees; find a tree
-- a upertree -- hat has the smallest distance D to the trees in the grove.
We introduce ew measure for D, called flip-distance and show its
performance using imulations. Complexity results for the problem are
presented. We show hat the flip-distance preserves semistrict consensus
of the trees in a rove -- a fundamental requirement for supertrees.
6.
Alternatives to Consensus in Phylogenetic Inference
Joe Felsenstein, Department of Genetics
University of Washington, Seattle
Abstract:
When consensus methods are used to infer an overall phylogeny, they suffer
from arbitrariness, and it becomes impossible to choose among many methods.
What is needed is an overall criterion that can guide the choice. By
contrast, Total Evidence approaches have used such a criterion, overall
parsimony, but they suffer from inability to respond to different processes
that may be going on in different parts of the data set, such as different
loci. Using a statistical model of evolution that allows for heterogeneity
of processes in different loci, we can discover how to make a an overall
maximum likelihood estimate that responds appropriately to the heterogeneity.
In the limit when rates of change are small, it leads to parsimony
weighting schemes that improve on overall parsimony. Further into the
limit one reaches overall parsimony. To plant our feet firmly on a statistical
foundation, we may need to abjure mathematical enjoyment and most of the
techniques of making consensus trees.
7.
Junhyong Kim, Yale University
Super-trees and consensus trees from distance embedding
Evolutionary tree graphs, called phylogenies, are commonly used to delineate
genealogical relationships between biological objects. Often two or more
tree graphs are of interest, say as a confidence set, and it can be
desirable to obtain a consensus representation, again in terms of a tree
graph. Similarly, when assembling very large graphs only a collection of
strictly smaller subgraphs may be available in which case it is desirable to
assemble the subgraphs into a larger "super-tree." Consensus and super-
trees can be constructed with combinatorial edit operations on the graphs
themselves. Alternatively, an invertable function can be constructed that
maps the tree graphs to a space with a well defined algebraic structure. For
example, one common map involves so-called parsimony encoding which maps
tree graphs to the vector space of binary state frequencies. In this paper,
I present a different map to the space of additive distance square matrices,
discuss its properties, and other possible extensions to the problem of
consensus and super tree construction.
8.
HOW MANY CONSENSUS TREES?
Francois-Joseph Lapointe1 and Guy Cucumel2
1 Departement de sciences biologiques, Universite de Montreal
C.P. 6128, Succ. Centre-ville, Montr=E9al, Qc, H3C 3J7
Francois-Joseph.Lapointe@Umontreal.CA
2 Ecole des sciences de la gestion, Universite du Quebec a Montreal
C.P. 8888, Succ. Centre-ville, Montr=E9al, Qc, H3C 3P8
Cucumel.Guy@UQAM.CA
A consensus function takes as input a profile of trees representing the
relationships among overlapping or identical sets of objects and returns
a tree that is in some sense closer to the entire profile. A single tree
is obtained by most consensus functions, although this solution may not
always be representative of the input profile of trees. If a unique
consensus tree is not suitable, how many consensus trees are needed?
In the present paper, we will evaluate different criteria for selecting
the number of consensus trees required to accurately represent a profile
of input trees. These criteria will be used to modify consensus functions
so as to return multiple trees that are in some sense closer to the entire
profile. Statistical tests will also be proposed to determine the relative
significance of unique and multiple consensus trees. Finally, an application
to phylogenetic trees using the average consensus will be presented to
illustrate the procedure.
9.
TOTAL EVIDENCE, CONSENSUS, GLOBAL CONGRUENCE AND SIMULATIONS
Claudine Levasseur 1,2 and Francois-Joseph Lapointe1,3
1 Departement de sciences biologiques, Universite de Montreal
C.P. 6128, Succ. Centre-ville, Montr=E9al, Qc, H3C 3J7
2 Claudine.Levasseur@Umontreal.CA
3 Francois-Joseph.Lapointe@Umontreal.CA
The analysis of multiple data sets in phylogenetic is leaded by two major
approaches: combining the data (total evidence) or combining the trees
(consensus methods). Both have been widely praised and criticized over
the years for their respective strengths and weaknesses. We know that
they differ greatly on their premises and philosophical backgrounds but
what about the phylogenetic accuracy? Are the results from those approaches
as different as some claim?
In this paper, simulated data will help us to assess whether the use of
a consensus method that accounts for branch lengths (e.g., average=
consensus) combined with validation of the total evidence tree can lead
to congruent results. Furthermore, we hypothesize that phylogenetic accuracy
can be increased when those approaches are used jointly in a so-called global
congruence approach.
10.
Boris Mirkin, University of London
Clusters of Orthologous Proteins as Characters in Tree Building
A cluster of orthologous groups of proteins (COG) is a set of closely
related proteins from different organisms that bears a phylogenetic
signal about the set of organisms under consideration. In the COG
project at NCBI (http://www.ncbi.nlm.nih.gov/COG/), there have been
currently distinguished 3166 COGs in a set of fully sequenced genomes
represented by 26 clades. The issue addressed in this talk is whether
the data on presence/absence of genomes in COGs (COG phylogenetic
profiles) carry any phylogenetic signal at all. The following example
may help to explain the problem. Among species x, i, and j, species x
and j are believed to be closer relatives to each other than to i. The
COG contents do not confirm this: there are 336 different COGs
containing x and i, and only 176 containing x and j. Moreover, 163
(93%) of these 176 also contain i. This makes it extremely difficult
for any method based on the phylogenetic profiles to correctly
identify the relationships among these three species (Mirkin and
Koonin, 2000).
Two ways for overcoming the issue have been explored. The first involves
removing those COGs that are not consistent with the true species tree
and thus can be considered as bearing a signal of lateral transfers
rather than inheritance. The second involves a nontraditional concept of
character compatibility and employs removal of COGs involved in a
divergence event from other divergence events. The results found with
these approaches indeed better correspond to the current model of
evolutionary history of the genomes, especially in deep branching.
11.
Robert Powers, University of Louisville
Consensus $n$-trees, weak independence, and veto power
Let $f$ be a consensus function on a set of $n$-trees. Then $f$
satisfies weak independence if whenever two profiles "agree" on a 3
element subset $X$, then the corresponding consensus outputs are
compatible on $X$. If $f$ satisfies weak independence and ternary
Pareto, then $f$ is a weak dictatorship. This fact is an extension of a
similar result given in the theory of social choice.
12.
Thousands of candidates and a handful of voters -
An application of voting theory from pharmaceutical development
Tommy Ratliff, Wheaton College
The goal of lead optimization in pharmaceutical development is to
reduce a library of hundreds of thousands of compounds to a subset of likely
candidates before any laboratory testing is done. There are five main
criteria used, which are assigned different weights depending on the purpose
of the pharmaceutical. These criteria can naturally be interpreted as a
small number of voters in an election with hundreds of thousands of
candidates. This motivates our discussion of whether or not more consistency
is forced on the outcomes when there are many more candidates than voters.
13.
Consensus List Colorings of Graphs and Physical Mapping of DNA
Fred S. Roberts
Rutgers University
Piscataway, NJ, USA
In graph coloring, one assigns a color to each vertex of a graph
so that neighboring vertices get different colors. We shall talk about a
consensus problem relating to graph coloring and discuss the
applicability of the ideas to the DNA physical mapping problem. In
many applications of graph coloring, one gathers data about the
acceptable colors at each vertex. A list coloring is a graph
coloring so that the color assigned to each vertex belongs to the list
of acceptable colors associated with that vertex. We consider the
situation where a list coloring cannot be found. If the data
contained in the lists associated with each vertex are made available
to individuals associated with the vertices, it is possible that the
individuals can modify their lists through trades or exchanges until
the group of individuals reaches a set of lists for which a list
coloring exists. We describe several models under which such a
consensus set of lists might be attained. In the physical mapping
application, the lists consist of the sets of possible copies of a
target DNA molecule from which a given clone was obtained.
14.
Mark Wilkinson, The Natural History Museum of London and
Joe Thorley, School of Biological Sciences, University of Bristol and
Department of Zoology, The Natural History Musuem
Reduced Consensus and Supertree Methods
Abstract:
Reduced cladistic consensus (RCC) methods were developed in the field of
phylogenetics to avoid the insensitivity of the strict consensus and the
ambiguity of the Adams consensus. These problems and how they are
addressed by RCC methods will be reviewed. Subsequent work has shown there
to be a family of RCC methods (strict, semi-strict, majority-rule) with
potential in different domains of phylogenetics and perhaps some broader
implications for consensus theory. The basic ideas behind RCC methods may
also be of some use in the context of supertree construction. We describe
our disatisfaction with the matrix representation with parsimony methods of
supertree construction, our motivation (and limited success) in developing
reduced supertree methods and some additional thoughts on alternative
approaches to this important problem.
15.
Li-San Wang, University of Texas at Austin
Clustering and Consensus in Phylogenetic Analysis
Conventional ways of processing multiple trees returned by phylogenetic
inferences involve consensus tree methods over the whole set of trees.
The single-consensus approach drops too much information, and is usually
susceptible to small outliers (e.g. strict consensus). This motivates us
to investigate clustering and visualization methods to fully utilize the
whole dataset. However, the space of phylogenetic trees, a high
dimensional space with very discrete coordinates and extra tree
constraints, is interesting to investigate in its own right.
There has been little study in this direction except the phylogenetic
island method. In this talk we present our framework for the study of
clustering of tree spaces and devise parameters that represent certain
properties of a set of trees. We then propose the class of "bi-criterion"
optimization problems, and show certain subclasses of this family can be
solved optimally. We compare conventional clustering methods, our new
method, and the phylogenetic island method by applying them
to biological datasets; we then represent the results visually.
16.
Consensus, combination, and supertrees: integrating diverse data to
reconstruct evolutionary trees
John J. Wiens
Section of Amphibians and Reptiles, Carnegie Museum of Natural History,
Pittsburgh, PA 15213-4080; e-mail: wiensj@carnegiemuseums.org
A long-standing but ongoing debate in phylogenetic analysis
concerns how diverse datasets, such as different genes or different types
of data (molecules, morphology), should be integrated to estimate the
phylogeny of a group of organisms. A key point in this debate is whether
datasets that give conflicting estimates of phylogeny should be combined
only using trees from the separate datasets (using consensus methods) or
whether the character data themselves should be combined. I show that
given a large phylogeny (i.e., with many species), both the best and worst
conditions for both combined and consensus analysis may occur in different
parts of the same tree. Thus, universal application of a single approach
to an entire tree may guarantee the failure of that method in at least some
parts of the tree. Methods for integrating diverse datasets for large
phylogenies are discussed.
The supertree method is a consensus approach that is becoming
increasingly popular for integrating large datasets that do not include
identical sets of taxa. I use simulations to show that one of the major
motivations of this approach, the desire to avoid missing data cells, may
be somewhat misguided. Highly incomplete taxa can be included and
accurately placed in combined analyses, and missing data cells are not by
themselves misleading.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on September 28, 2001.