DIMACS Workshop on Bioconsensus

October 25 - 26, 2000
DIMACS Center, CoRE Building, Busch Campus, Rutgers University, Piscataway, NJ

Mel Janowitz, Rutgers University, melj@dimacs.rutgers.edu
Francois-Joseph LaPointe, University of Montreal, lapoinf@ere.umontreal.ca
Fred McMorris, Illinois Institute of Technology, mcmorris@iit.edu
Boris Mirkin, University of London, mirkin@dcs.bbk.ac.uk
Fred Roberts, Rutgers University, froberts@dimacs.rutgers.edu
Presented under the auspices of the Special Focus on Computational Molecular Biology. <


Title: Combining trees to combining data.

Bernard R. Baum
Agriculture and Agri-Food Canada, Research Branch
Eastern Cereal and Oilseed Research Center
Ottawa, Ontario, Canada 
Data from different sources of the same organisms are 
increasingly obtained for phylogenetic studies. Phylogenetic inference is 
often estimated from each data set separately and then combined after the 
degree of congruence between the cladograms was assessed. Combinations are 
made in two different approaches (1) total evidence and (2) consensus. A 
third way, originally presented in 1990 and published in 1992 is here 
revisited and justified. The third way consists of first estimating 
phylogenetic relationships for each data set separately, and treating each 
cladogram as a character tree which is translated into binary coded 
factors. This is followed by adjoining the binary coded matrices and 
subjecting them to a cladistic analysis. Arguments against the use of total 
evidence and consensus approaches are discussed.

Title: A New Tool To Help Generate Genomic Consensus--The W-Curve Douglas J. Cork Department of Biological, Chemical and Physical Sciences Illinois Institute of Technology Finding a consensus DNA string from a set of multiple aligned DNA strings may be accomplished by using the rule of plurality at each base position. This may be ambiguous, not concise, and information may be lost. Molecular biologists often compensate for this by looking at each sequence, base by base, and correcting for misinformation which may appear when all sequences are aligned and consensus is sought. When asking the biologist how this was done, the reply is often based on bench experience with the gene or encoded protein. Ideally, the molecular biologist would wish to have many 3-D protein structures of the expressed gene, where site specific mutagenesis, and crystallization of each mutant protein can be correlated with the corresponding mutated and aligned DNA strings. This is not the case with the vast majority of procaryotic and eucaryotic genes and proteins. Because of this, we suggest utilizing a 3-D numerical mapping algorithm, called a W-Curve, to generate 3-D informational topology of the DNA strings. Construction of W-Curves will be explained, and are based on a chaos game representation of the DNA string projected onto each successive nucleotide position. Each similar local and global DNA string will generate W-Curves and can then be aligned by difference visualization scanning and gnu plot analysis. Consensus is ultimately developed. Local subsequences are sometimes eliminated in order to align more global parts of the DNA strings. For example, base positions 42 though 60 may not have similar W-Curve topologies. However, positions 1-41 and 61 through 150 appear to have similar repeating patterns, symmetries or asymmetries in their W-Curves. W-Curves can be considered as an aid in consensus formation in the following way: Initial conventional homology search for strings (via BLAST)) and multiple string alignment (via dynamic linear programming methods such as CLUSTALW) are conducted with the DNA strings. The same CLUSTALW multiple string alignments are visualized with W-CURVES in place of the strings. If local or global dissimilarities in W-Curve toplogies are found, parts of the sequences may be truncated until similar W-CURVES can be visualized as a consensus. Finally, the aligned W-Curves are examined with a nearest neighbor distance matrix, resulting in the formation of a phylogenetic tree. Example long genomic sequences using this approach will be shown. The algorithm can be downloaded from the following website: http://www.iit.edu/~cork. Click on computer visualization of long genomic sequences.
Title: Phylogenetic consensus as a vote-counting procedure: Too conservative and too simple, but possibly still indispensable Alan de Queiroz EPO Biology and University Museum, University of Colorado A consideration of phylogenetic consensus as a subset of the general analytical approach of vote-counting highlights shortcomings of consensus methods in the context of analyzing diverse data sets. It is widely recognized that vote-counting methods tend to be conservative, and this is an obvious problem with consensus methods. As with other vote-counting methods, the conservatism of consensus arises in part through the unavoidable loss of information entailed in summarizing data as "yes" or "no" votes (presence/absence of clades). Less widely recognized is the fact that some vote-counting methods can converge on a zero probability of rejecting a false null hypothesis as more data sets are added, which is clearly an undesirable property. An analogous situation in phylogenetics is the increasing probability of an unresolved tree as more data sets are included in a strict or semistrict consensus analysis. Thus, consensus methods, like other vote-counting procedures, can be not only extremely conservative but inappropriately conservative. Finally, vote-counting procedures are generally applied only to relatively simple problems. However, with even moderate numbers of taxa, phylogenetic estimation is an extremely complex problem, and the relative simplicity of common consensus methods is not designed to deal with this complexity. In phylogenetics, consensus as applied to the problem of diverse data sets has fallen out of favor, and it seems likely that the excessive conservatism and simplicity of consensus are partly responsible for this trend. However, consensus embodies the important property of corroboration of hypotheses from independent sources (which may also be a property of vote-counting procedures in other contexts). This characteristic of consensus may explain why consensus "thinking" remains prevalent in phylogenetic studies, even while formal consensus analyses have become uncommon.
Title: Toward a Unified Theory of Relational Representations of Closed Set Systems and Phylogentic Trees Richard Cramer-Benjamin, Gary D. Crown and Melvin F. Janowitz Previous work has shown how to place classical consensus theory and consensus theory on hierarchical trees into a common framework. The approach is by representing appropriate closed set systems in terms of suitably defined binary relations. Recent work of Bandelt and Dress, and more recently work by Dress, Huber and Moulton show how to provide a similar relational representation for phylogenetic trees. Some background will be presented for both approaches. An attempt will be made to formulate a model that includes both of them. Such a model would hopefully allow consensus theory in the social sciences to apply directly to phylogenetic trees and beyond.
Title: Structural Domain Parsing: The Role of Consensus Reasoning Casimir A. Kulikowski, CS Department, Rutgers University Ilya Muchnik, CS Department and DIMACS, Rutgers University HwaSeob J. Yun, CS Department, Rutgers University Gaetano Montelione, and Molecular Biology and Biochemistry Department and CABM, Rutgers University There has been much work in the past five years on the development of automatic methods for defining protein sequence domains, which are fragments of sequences with a high proportion of conservative positions, helpful in evolutionary and functional analyses of proteins. In contrast, methods for finding hints about possible structural domains from sequence data have largely consisted of correlation analyses between known structural domains and domains constructed from sequence data. Effective techniques for detecting signals of structural domains from sequence data would be extremely valuable for protein analysis, gene finding, and drug design. We have developed just such a set of methods for detecting structural domains based on HMMs built from a subset of sequence-continuous Dali Domain Dictionary (DDD) domains. In the process of testing our predictions against an independent set of Scop domains we have carried out both HMM and Blast matches in a preliminary study with good results. To obtain the greatest reliability in such structural domain parsing, however, it is necessary to develop machine learning procedures based on consensus results from different domain knowledge sources (such as DDD and Scop) and inference methods (such as HMMs and Blast search matches). Two main concepts underpin our consensus reasoning: 1) independent detection results from the different sources and homology searches on entire protein sequence data must be carried out at a sufficiently large number of confidence levels so as to produce a large enough set of multiple, plausible candidate structural domains that could be present within the full sequence; and 2) consensus construction needs to be iteratively refined so as to construct a covering set of probable candidate domains for the sequence.
Title: The Evolution fo Consensus in Phylogenetics: Where do we go from here? Francois-Joseph Lapointe (1,3), Claudine Levasseur (1,4) and Guy Cucumel (2,5) (1) Departement de sciences biologiques, Universite de Montreal (2) Ecole des sciences de la gestion, Universite du Quebec a Montreal (3) lapoinf@ere.umontreal.ca (4) levassec@magellan.umontreal.ca (5) cucumel.guy@uqam.ca In the beginning, there were trees. Then trees started to grow and multiply. They became so large and numerous, that it became impossible to deal with a single tree at a time. Consensus methods were developed to compare those trees, prune them or regraft one into another. Then consensus methods started to multiply... In this paper, we will look at recent developments in the field of consensus and their applications to phylogenetic studies. Topology-based consensus techniques will be compared to new methods that take into account branch lengths. The debate on character congruence versus taxonomic congruence in phylogenetics will be revisited, using consensus techniques for weighted trees. The validation of consensus trees will also be discussed, in addition to generalized consensus methods for reticulograms. Applications of consensus methods will be presented and recommendations will be provided about the uses and misuses of bioconsensus techniques.
Title: Compatibility analysis as a consensus method F.R. McMorris Chair, Department of Applied Mathematics Illinois Institute of Technology One of the goals of compatibility analysis is to take a given collection of taxonomic characters defined on a set of entities and produce a (large) subset of the characters consisting of characters that are all mutually 'consistent' on some tree, or other type of discrete structure. It will be noted that compatibility analysis methods can be viewed formally as consensus methods. After doing this, and giving some background on compatibility, the talk will conclude with very recent results on "consensus with contraints" as it applies to stratigraphic constraints in compatibility methods.
Title: Combining character and distance analysis for building inter-genome trees with linear binary hierarchies B. Mirkin, Birkbeck College, London, UK E. Koonin, NCBI, NIH, Bethesda, USA The theory of linear binary hierarchies was applied to comparative analysis of the collection of completely sequenced genomes of bacteria, archaea and yeast. The presence/absence of proteins in the Clusters of Orthologous Groups of proteins (COGs) was used as the criterion for building trees by using the linear binary hierarchy approach. This method combines features of the two major tree-building approaches, parsimony and distance analysis. According to the parsimony principle, the most important characters linked to each divergence event are identified; the tree-building process itself, however, is based on a distance approach, namely maximization of the distances between the centroids of the clusters that are being split. The use of the distances provides for relatively fast computations. The linear hierarchy approach introduces several features that have not been previously used in tree analysis. In particular, each divergence event is evaluated by its contribution to the total data scatter, which allows the splitting process to be terminated when the contributions are comparable with the noise in data. Another feature is a new inter-genome similarity measure that takes into account not only the overall co-occurrence of genomes in COGs (as, for instance, Jaccard coefficient), but the representation of individual COGs in the genome collection. The resulting tree, which is strongly supported by bootstrap analysis, has an unexpected topology in that the first bifurcation separates free-living bacteria (and related, moderately degraded parasites) from archaea, eukaryotes and highly degraded parasitic bacteria. The approach allows automatic delineation of the set of COGs that make the principal contribution to each bifurcation, and their biological relevance can be subsequently explored.
Title: Medians and Means as Consensus Methods for Molecular Sequences Fred S. Roberts, Rutgers University In molecular biology, we are often given a variety of possible molecular sequences, obtained by different subjective or objective methods or different investigators or under different criteria, and are asked to obtain a single sequence that is in some sense a consensus of these different alternatives. We consider a method proposed by Waterman [1989] based on a certain model of the notion of "pattern" in a molecular squence and show that some well known consensus methods such as the median and mean procedures of Kemeny and Snell and others are special cases of the Waterman method and that in turn Waterman's method with its preferred choice of parameters is in fact the median. We characterize the parameters in Waterman's method for which we obtain these consensus procedures, explore the axiomatic basis for Waterman's method, and mention some open questions. (This is joint work with Boris Mirkin.)
Other Workshops
DIMACS Homepage
Contacting the Center
Document last modified on October 23, 2000.