DIMACS Workshop on Bioconsensus

October 25 - 26, 2000
DIMACS Center, CoRE Building, Busch Campus, Rutgers University, Piscataway, NJ

Mel Janowitz, Rutgers University, melj@dimacs.rutgers.edu
Francois-Joseph LaPointe, University of Montreal, lapoinf@ere.umontreal.ca
Fred McMorris, Illinois Institute of Technology, mcmorris@iit.edu
Boris Mirkin, University of London, mirkin@dcs.bbk.ac.uk
Fred Roberts, Rutgers University, froberts@dimacs.rutgers.edu
Presented under the auspices of the Special Focus on Computational Molecular Biology. <
ABSTRACTS




Title: Combining trees to combining data.

Bernard R. Baum
Agriculture and Agri-Food Canada, Research Branch
Eastern Cereal and Oilseed Research Center
Ottawa, Ontario, Canada 
 
Data from different sources of the same organisms are 
increasingly obtained for phylogenetic studies. Phylogenetic inference is 
often estimated from each data set separately and then combined after the 
degree of congruence between the cladograms was assessed. Combinations are 
made in two different approaches (1) total evidence and (2) consensus. A 
third way, originally presented in 1990 and published in 1992 is here 
revisited and justified. The third way consists of first estimating 
phylogenetic relationships for each data set separately, and treating each 
cladogram as a character tree which is translated into binary coded 
factors. This is followed by adjoining the binary coded matrices and 
subjecting them to a cladistic analysis. Arguments against the use of total 
evidence and consensus approaches are discussed.



Title: A New Tool To Help Generate Genomic Consensus--The W-Curve

Douglas J. Cork
Department of Biological, Chemical and Physical Sciences
Illinois Institute of Technology
 
Finding a consensus DNA string from a set of multiple aligned DNA strings
may be accomplished by using the rule of plurality at each base position.
This may be ambiguous, not concise, and information may be lost. Molecular 
biologists often compensate for this by looking at each sequence, base by 
base, and correcting for misinformation which may appear when all sequences 
are aligned and consensus is sought. When asking the biologist how this was 
done, the reply is often based on bench experience with the gene or encoded 
protein.
 
Ideally, the molecular biologist would wish to have many 3-D protein
structures of the expressed gene, where site specific mutagenesis, and
crystallization of each mutant protein can be correlated with the
corresponding mutated and aligned DNA strings.
 
This is not the case with the vast majority of procaryotic and eucaryotic
genes and proteins. Because of this, we suggest utilizing a 3-D numerical
mapping algorithm, called a W-Curve, to generate 3-D informational topology
of the DNA strings. Construction of W-Curves will be explained, and are
based on a chaos game representation of the DNA string projected onto each
successive nucleotide position. Each similar local and global DNA string
will generate W-Curves and can then be aligned by difference visualization
scanning and gnu plot analysis. Consensus is ultimately developed. Local
subsequences are sometimes eliminated in order to align more
global parts of the DNA strings. For example, base positions 42 though 60
may not have similar W-Curve topologies. However, positions 1-41 and 61
through 150 appear to have similar repeating patterns, symmetries or
asymmetries in their W-Curves.

W-Curves can be considered as an aid in consensus formation in the following
way: Initial conventional homology search for strings (via BLAST)) and
multiple string alignment (via dynamic linear programming methods such as
CLUSTALW) are conducted with the DNA strings. The same CLUSTALW multiple
string alignments are visualized with W-CURVES in place of the strings. If
local or global dissimilarities in W-Curve toplogies are found, parts of the
sequences may be truncated until similar W-CURVES can be visualized as a
consensus.
 
Finally, the aligned W-Curves are examined with a nearest neighbor distance
matrix, resulting in the formation of a phylogenetic tree. Example long
genomic sequences using this approach will be shown. The algorithm can be
downloaded from the following website: http://www.iit.edu/~cork. Click on
computer visualization of long genomic sequences.



Title:  Phylogenetic consensus as a vote-counting procedure:  Too 
conservative and too simple, but possibly still indispensable

Alan de Queiroz
EPO Biology and University Museum, University of Colorado

A consideration of phylogenetic consensus as a subset of the general 
analytical approach of vote-counting highlights shortcomings of consensus 
methods in the context of analyzing diverse data sets.  It is widely 
recognized that vote-counting methods tend to be conservative, and this is 
an obvious problem with consensus methods.  As with other vote-counting 
methods, the conservatism of consensus arises in part through the 
unavoidable loss of information entailed in summarizing data as "yes" or 
"no" votes (presence/absence of clades).  Less widely recognized is the 
fact that some vote-counting methods can converge on a zero probability of 
rejecting a false null hypothesis as more data sets are added, which is 
clearly an undesirable property.  An analogous situation in phylogenetics 
is the increasing probability of an unresolved tree as more data sets are 
included in a strict or semistrict consensus analysis.  Thus, consensus 
methods, like other vote-counting procedures, can be not only extremely 
conservative but inappropriately conservative.  Finally, vote-counting 
procedures are generally applied only to relatively simple 
problems.  However, with even moderate numbers of taxa, phylogenetic 
estimation is an extremely complex problem, and the relative simplicity of 
common consensus methods is not designed to deal with this complexity.

In phylogenetics, consensus as applied to the problem of diverse data sets 
has fallen out  of favor, and it seems likely that the excessive 
conservatism and simplicity of consensus are partly responsible for this 
trend.  However, consensus embodies the important property of corroboration 
of hypotheses from independent sources (which may also be a property of 
vote-counting procedures in other contexts). This characteristic of 
consensus may explain why consensus "thinking" remains prevalent in 
phylogenetic studies, even while formal consensus analyses have become 
uncommon.


Title: Toward a Unified Theory of Relational Representations of Closed Set
Systems and Phylogentic Trees
 
Richard Cramer-Benjamin, Gary D. Crown and Melvin F. Janowitz
 
Previous work has shown how to place classical consensus theory and
consensus theory on hierarchical trees into a common framework. The
approach is by representing appropriate closed set systems in terms of
suitably defined binary relations. Recent work of Bandelt and Dress,
and more recently work by Dress, Huber and Moulton show how to provide
a similar relational representation for phylogenetic trees. Some
background will be presented for both approaches. An attempt will be
made to formulate a model that includes both of them. Such a model
would hopefully allow consensus theory in the social sciences to apply
directly to phylogenetic trees and beyond.

 
Title: Structural Domain Parsing: The Role of Consensus Reasoning

Casimir A. Kulikowski, CS Department, Rutgers University
Ilya Muchnik, CS Department and DIMACS, Rutgers University
HwaSeob J. Yun, CS Department, Rutgers University
Gaetano Montelione, and Molecular Biology and
   Biochemistry Department and CABM, Rutgers University
 
There has been much work in the past five years on the development of
automatic methods for defining protein sequence domains, which are
fragments of sequences with a high proportion of conservative
positions, helpful in evolutionary and functional analyses of
proteins.  In contrast, methods for finding hints about possible
structural domains from sequence data have largely consisted of
correlation analyses between known structural domains and domains
constructed from sequence data.  Effective techniques for detecting
signals of structural domains from sequence data would be extremely
valuable for protein analysis, gene finding, and drug design.

We have developed just such a set of methods for detecting structural
domains based on HMMs built from a subset of sequence-continuous Dali
Domain Dictionary (DDD) domains. In the process of testing our predictions
against an independent set of Scop domains we have carried out both HMM
and Blast matches in a preliminary study with good results.
To obtain the greatest reliability in such structural domain parsing,
however, it is necessary to develop machine learning procedures based on
consensus results from different domain knowledge sources (such as DDD and
Scop) and inference methods (such as HMMs and Blast search matches).
Two main concepts underpin our consensus reasoning: 1) independent detection
results from the different sources and homology searches on entire protein
sequence data must be carried out at a sufficiently large number of
confidence levels so as to produce a large enough set of
multiple, plausible candidate structural domains that could be present
within the full sequence; and 2) consensus construction needs to be
iteratively refined so as to construct a covering set of probable
candidate domains for the sequence.
 

Title: The Evolution fo Consensus in Phylogenetics: Where do we go from here?
 
Francois-Joseph Lapointe (1,3), Claudine Levasseur (1,4) and Guy Cucumel (2,5)
 
  (1) Departement de sciences biologiques, Universite de Montreal
  (2) Ecole des sciences de la gestion, Universite du Quebec a Montreal
  (3) lapoinf@ere.umontreal.ca
  (4) levassec@magellan.umontreal.ca
  (5) cucumel.guy@uqam.ca
 
In the beginning, there were trees. Then trees started to grow and multiply.
They became so large and numerous, that it became impossible to deal with a
single tree at a time. Consensus methods were developed to compare those
trees, prune them or regraft one into another. Then consensus methods
started to multiply...

In this paper, we will look at recent developments in the field of consensus
and their applications to phylogenetic studies. Topology-based consensus
techniques will be compared to new methods that take into account branch
lengths. The debate on character congruence versus taxonomic congruence in
phylogenetics will be revisited, using consensus techniques for weighted
trees. The validation of consensus trees will also be discussed, in addition
to generalized consensus methods for reticulograms. Applications of consensus
methods will be presented and recommendations will be provided about the uses
and misuses of bioconsensus techniques.
 

Title: Compatibility analysis as a consensus method

F.R. McMorris
Chair, Department of Applied Mathematics
Illinois Institute of Technology 
 
One of the goals of compatibility analysis is to take a given
collection of taxonomic characters defined on a set of entities and produce
a (large) subset of the characters consisting of characters that are all
mutually 'consistent' on some tree, or other type of discrete structure.
It will be noted that compatibility analysis methods can be viewed formally
as consensus methods. After doing this, and giving some background on
compatibility, the talk will conclude with very recent results on
"consensus with contraints" as it applies to stratigraphic constraints in
compatibility methods.

 
Title: Combining character and distance analysis for building inter-genome trees
       with linear binary hierarchies
 
B. Mirkin, Birkbeck College, London, UK
 E. Koonin, NCBI, NIH, Bethesda, USA
 
The theory of linear binary hierarchies was applied to comparative analysis of
the collection of completely sequenced genomes of bacteria, archaea and yeast.
The presence/absence of proteins in the Clusters of Orthologous Groups of
proteins (COGs) was used as the criterion for building trees by using the
linear binary hierarchy approach. This method combines features of the two
major tree-building approaches, parsimony and distance analysis. According to
the parsimony principle, the most important characters linked to each
divergence event are identified; the tree-building process itself, however, is
based on a distance approach, namely maximization of the distances between the
centroids of the clusters that are being split. The use of the distances
provides for relatively fast computations.

The linear hierarchy approach introduces several features that have not been
previously used in tree analysis. In particular, each divergence event is
evaluated by its contribution to the total data scatter, which allows the
splitting process to be terminated when the contributions are comparable with
the noise in data. Another feature is a new inter-genome similarity measure
that takes into account not only the overall co-occurrence of genomes in COGs
(as, for instance, Jaccard coefficient), but the representation of individual
COGs in the genome collection. The resulting tree, which is strongly supported
by bootstrap analysis, has an unexpected topology in that the first 
bifurcation separates free-living bacteria (and related, moderately degraded 
parasites) from archaea, eukaryotes and highly degraded parasitic bacteria. 
The approach allows automatic delineation of the set of COGs that make the 
principal contribution to each bifurcation, and their biological relevance 
can be subsequently explored.
 

Title: Medians and Means as Consensus Methods for Molecular Sequences
 
Fred S. Roberts, Rutgers University
 
In molecular biology, we are often given a variety of possible
molecular sequences, obtained by different subjective or objective
methods or different investigators or under different criteria, and
are asked to obtain a single sequence that is in some sense a
consensus of these different alternatives. We consider a method
proposed by Waterman [1989] based on a certain model of the notion of
"pattern" in a molecular squence and show that some well known
consensus methods such as the median and mean procedures of Kemeny and
Snell and others are special cases of the Waterman method and that in
turn Waterman's method with its preferred choice of parameters is in
fact the median. We characterize the parameters in Waterman's method
for which we obtain these consensus procedures, explore the axiomatic
basis for Waterman's method, and mention some open questions. (This
is joint work with Boris Mirkin.)
Other Workshops
DIMACS Homepage
Contacting the Center
Document last modified on October 23, 2000.
DIMACS Workshop on Bioconsensus

October 25 - 26, 2000 DIMACS Center, CoRE Building, Busch Campus, Rutgers University, Piscataway, NJ

ABSTRACTS

October 25 - 26, 2000
DIMACS Center, CoRE Building, Busch Campus, Rutgers University, Piscataway, NJ