Combining trees to combining data.
Bernard R. Baum
Agriculture and Agri-Food Canada, Research Branch, Eastern Cereal and 
Oilseed Research Center, 960 Carling Avenue, Ottawa, Ontario, Canada K1A 0C6

Abstract: Data from different sources of the same organisms are 
increasingly obtained for phylogenetic studies. Phylogenetic inference is 
often estimated from each data set separately and then combined after the 
degree of congruence between the cladograms was assessed. Combinations are 
made in two different approaches (1) total evidence and (2) consensus. A 
third way, originally presented in 1990 and published in 1992 is here 
revisited and justified. The third way consists of first estimating 
phylogenetic relationships for each data set separately, and treating each 
cladogram as a character tree which is translated into additive binary 
coded factors. This is followed by adjoining the binary coded matrices and 
subjecting them to a cladistic analysis. This third way might define a new 
kind of consensus. Arguments against the use of total evidence and 
consensus approaches are discussed.

Introduction
         
The central tenet in comparative biology, especially in evolution 
and systematics, is homology. Observed similarities may have been acquired 
by convergence of evolutionary units (EU), or retained after divergence of 
EUs. When two EUs are compared and are found to be similar, their organs or 
their gene sequences may be similar due to convergence (analogous) or to 
retention (homologous) at least in part.
         
Patterson (1988) decoupled the concept of homology for both 
morphology and molecular biology. He laid down three tests of homology: 
similarity, conjunction, and congruence. Similarity is the traditional 
method of comparative biology, and is an abstract concept of a one to one 
correspondence, loosely speaking, when two organisms are compared. 
Conjunction (Patterson 1988) is a situation when two homologues are found 
together in one organism. Congruence occurs when two or more organisms 
possess the same homologue which also circumscribes a monophyletic group, 
and in this sense it is also equivalent to synapomorphy. When one of the 
three tests fails the result is a negative a homological relation.
         
In molecular biology when testing for homology, in addition to the 
above one needs to take into account whether a gene duplication has 
occurred before or after a speciation event. Briefly, a gene duplication 
prior to speciation may result in two paralogous genes after speciation 
provided that the two copies of the gene have diverged. The extent of 
evolutionary divergence of gene copies is a result of many possibilities 
depending of the nature of the gene. In multigene families paralogues are 
known to evolve in concert (Arnheim 1983) or to retain varying degrees of 
haplome identity (for instance Baum & al. 1998, Baum & Bailey 2000), in 
other words the degree of gene conversion varies. A gene duplication after 
speciation results in orthologous genes.
         
In classical biology, including morphology especially, 
phylogenetic inferences are based on character trees yielding organism 
trees. In molecular biology the inferences were often based on aligned 
genes or a portion of genes, yielding gene trees that may not be congruent 
with organism trees.  When inferences of the same organisms are made from 
different data sets the resulting trees are often different. The trees may 
be different when two character suites of the same kind, e.g. morphology or 
anatomy, are compared, or when different kinds of data sets, e.g. 
morphology and immunology, are compared which is often the case. The trees 
are often different when different molecular data sets are compared or when 
morphological data sets are compared with molecular ones. With the 
increasing use of molecular methods it is expected that data from different 
molecules, or genes or a portion of them, will generate different 
hypotheses of phylogenetic relationships between the different gene trees 
of the same taxa, such as chloroplast DNA of species of tomato (Palmer & 
Zamir 1982) and their mitochondrial DNA sequence divergence (McLean & 
Hanson 1986). And in the wheat group (Triticeae) two different arrays of 
the 5S RNA genes, the ITS sequences, the chloroplast genome and isozyme 
data, all generated by various authors were examined by Kellog et al. 
(1996) who attempted to explain the incongruence between the ITS and the 
chloroplast DNA gene trees, the two 5S rDNA arrays with the chloroplast 
gene trees, other differences as well as the congruence between the 5S rDNA 
short array gene tree with the ITS gene tree. There are many examples of 
differences, some slight and some incongruent, between two molecular data 
sets, such as the ITS and NDHF trees in the tribe Episcieae (Smith 2000) 
and discrepancies between the chloroplast DNA and the ITS trees in Heuchera 
L. (Soltis and Kuzoff 1995). In my work of the 5S rRNA genes in the 
Triticeae I was able to detect more than two arrays of the multigene 
family. In this case every array, or orthologous sequences, needs to be 
analyzed separately first and the resulting gene trees need to be combined. 
It would certainly be inappropriate to combine paralogous sequences with 
orthologous sequences. This would result in "misbehaved" trees or 
nonsensical ones (Fig. 1).
         
These different or conflicting hypotheses of phylogenetic 
relationships of the same organisms need to be  resolved. Conflicting 
hypotheses can be viewed from two different angles, conflict between 
character congruence, i.e. "total evidence" (for instance Kluge 1989), and 
taxonomic congruence, i.e. consensus (for instance Nelson 1979). In the 
first approach, i.e. total evidence, all the data available are taken 
together by adjoining the data matrices to maximize character congruence, 
whereas in the second approach the resulting evolutionary hypotheses, i.e. 
the trees obtained, are combined into a consensus using one of the 
different kinds of consensus methods (see Shao and Sokal 1986).
         
The total evidence approach does not lend itself to the 
combination of different kinds of data in the same analysis (Swofford & 
Olsen 1990). For instance, character data cannot be combined with distance 
data. In this case the approach often taken is to seek a consensus 
cladogram obtained from the different data sets Jones & al. (1993). If two 
or more data sets can be combined, then one data might overwhelm the other, 
such as in combining a morphological data matrix with a DNA data matrix 
with the unavoidable result that "the sheer number of possible characters 
at the DNA base-pair level of study necessarily favors the molecular 
approach" (Kluge 1983). Notwithstanding, the analysis of single data sets 
or combined data sets often yield multiple equally likely solution. These 
are subjected to some sort of consensus analysis as a rule. Alternatively 
each data set is analyzed separately and a consensus of trees is sought as 
in the case of the data that cannot be combined.
         
Increasingly, investigations use more than one gene to infer 
organismal trees. It is indeed mostly  desirable to use as many genes as 
reasonably possible to infer organismal phylogenies as opposed to gene 
trees (Nei 1987) since DNA sequence data from many different loci that have 
evolved independently enable one to infer species trees more accurately 
(Pamilo & Nei 1988). Most studies use the "total evidence" approach to 
achieve this end, i.e. the analysis of each data set separately, then the 
adjoining of data matrices directly, and often followed by the computation 
of a consensus tree. Or when the data are of different kinds, such as 
distance data with character data, a consensus between different topologies 
is computed.
         
Consensus methods are of limited value for combining data sets. 
Consensus when used in this sense results in information loss, descriptive 
power loss and resolution loss (Baum 1992), and when there is a large 
disagreement among trees, consensus trees become unresolved. Although 
currently used to resolve conflicting topologies, consensus methods were 
designed to express the degree of agreement between classifications or to 
measure the degree of congruence among them. This degree of agreement is 
represented by the consensus index.

The method I have suggested a different approach to achieve the same goal as 
"total evidence". The method combines different trees by regarding each as 
a character tree (Baum 1992) and in this sense they are topological units 
or elements of the n-tree (Bobisud & Bobisud 1972, Margush & McMorris 
1981). Thus, the method does not use consensus of topologies obtained from 
among the different data sets and does not adjoin raw or original data 
matrices, although consensus is still permissible from trees obtained from 
the same data for other purposes. In the method the character trees drawn 
from the different cladograms are assembled into a data matrix which is 
subsequently subjected to a phylogenetic [cladistic] analysis.
The steps of the method are summarized as follows.

Step 1: Generate or obtain trees from different data sets, i.e. different 
genes or different data sources. If a data set yields a number of 
topologies, find a consensus tree;

Step 2: Select the trees to be combined;

Step 3: Root the trees. Some of the selected trees may have been obtained 
without a root and need to be all rooted with the same organism. If some 
trees were rooted differently, then re-root;

Step 4: Code each tree by additive binary coding (Farris & al. 1970). This 
is now regarded as a character tree, called by Brooks (1981) taxon tree.

Step 5: Combine character trees, i.e. the binary coded trees into one data 
matrix by adjoining;

Step 6: Carry out a cladistic analysis on the binary coded matrix.

Discussion

The degree of similarity between each of the trees from the separate data 
sets prior to combination and the tree obtained from the combined data sets 
may be evaluated using one the  consensus methods. If the trees prior to 
combination are very different from each other, then the use of the strict 
consensus method would yield an unresolved tree (Rohlf 1982).
         
Using the method described above to combine data achieves the 
following desirable properties. 1) The number of trees to be combined is 
unlimited; 2) the information about each data kind of the same organisms is 
retained; 3) the combination of distance data with character data is made 
possible; 4) the differential weighting of the different data in form of 
"character trees" is made possible as in conventional analyses; and 5) 
missing data is accommodated simply by coding those missing binary factors 
conventionally as missing data (example of two data set with unequal 
representation of organisms in Fig. 2); 6) computations are more economical 
compared to large matrices of adjoined raw data as in the "total evidence" 
approach.
         
Although the method of combining data was first presented in an 
address at the Fourth International Congress of Systematic and Evolutionary 
Biology, 1990, it was published later (Baum 1992) and independently by 
Ragan (1992a) under "matrix representation with parsimony". The method of 
combining data Baum (1992) and Ragan (1992a) discussed above has been 
related to Brooks' (1990) parsimony analysis (BPA) and to "assumption zero" 
of component analysis (Nelson & Ladiges 1991). The application of the 
method summarized here is entirely different, i.e. the combining of data 
and especially large DNA sequence matrices from different genes and from 
different arrays of multigene families, whereas BPA and component analysis 
were designed specifically to deal with co-evolutionary and biogeographic 
data. Moreover, as alluded above, the method allows the inclusion of 
non-binary-coded data not deriving from trees (Ragan 1992b). 
Mathematically, the method summarized here (Baum 1992 and Ragan 1992a) 
defines a new consensus tree as well, and its properties have not yet been 
investigated.
         
Williams (1994) offered a different interpretation of the 
characters, or matrix elements, defined as topological units. If the 
'matrix element', i.e. a component, is understood as a node and the 
cladogram understood as the specified relationship, "then cladistic 
analysis, in its most general form, may be understood as the effort to 
synthesize character data and cladograms in an identical fashion so their 
combination discover nodes", i.e. homology, with the greatest support. 
Williams further stated that "if all data pertinent to cladogram 
construction are of a general form, then the distinction between taxonomic 
congruence and character congruence may fade, and the requirement for 
differing sets of techniques for character analysis ("total evidence") and 
cladogram analysis ("taxonomic congruence") may be irrelevant." Williams 
then argued on this basis against Rodrigo's (1993) assertion that my method 
"does not combine data, if data are seen, in each case, to express a 
relationship."
         
Consensus methods were not designed to combine data. Their use is 
to express the degree of agreement or to measure congruence among 
classifications, and to provide a consensus index (Day & McMorris 1985). 
The consensus index is still important to assess  the degree of difference 
between cladograms generated from the different genes or other different 
data sets of the same organisms, pairwise.
         
In conclusion it is desirable to conduct phylogenetic analyses 
from varying sources of data together in order to obtain organism trees 
based on as much information as possible and thus more stable and robust 
trees. For this, data need to combined from various sources while retaining 
the information of each; we (Baum 1992, Ragan 1992a) provided an attempt to 
combine data in this fashion. In molecular data additional care needs to 
taken to ensure that only orthologous sequences be taken for analysis in 
general, and in the case of multigene families that the combination of the 
different arrays, i.e. the paralogues, be combined using the method above, 
i.e. by their as character trees . Consensus methods developed so far are 
not appropriate for combining data of different kinds of the same organisms.

Literature cited

Arnheim, N. 1983. Concerted evolution of multigene families. Pp. 38-61 In 
Nei, M. & Koehn, R.K. eds. Evolution of genes and proteins. Sinauer 
Associates, Boston.

Baum, B.R. 1992. Combining trees as a way of combining data sets for 
phylogenetic inference, and the desirability of combining gene trees. Taxon 
41: 3-10.

Baum, B.R. and Appels, R 1992. Evolutionary change at the 5S DNA loci of 
species in the Triticeae. Pl. Syst. Evol. 183: 195-208.

Baum, B.R. and L.G. Bailey. 2000. The 5S rRNA gene diversity in Kengyilia 
rigidula (Keng & S.L. Chen) J.L. Yang, Yen & Baum (Poaceae: Triticeae): 
possible contribution of the H genome to the origin of Kengyilia Genome 43: 
79-85.

Baum, B.R., Johnson, D.A. and Bailey, L.G. 1998. Analysis of 5S rDNA units 
in the Triticeae: the potential to assign sequence units to haplomes. pp. 
85-96. In: A.A. Jaradat (ed.) Triticeae III. Science Publishers Inc., New 
Hampshire, USA.

Bobisud, H.M. & Bobisud, L.E. 1972. A metric for classification. Taxon 21: 
607-613.

Brooks, D.R. 1981. Hennig's parasitological method: a proposed solution. 
Syst. Zool. 30: 229-249.

Brooks, D.R. 1990. Parsimony analysis in historical biogeography and 
coevolution: methodological and theoretical update. Syst. Zool. 39: 14-30.
Day, W.H. & McMorris, F.R. 1985. A formalization of consensus index 
methods. Bull. Math. Biol. 47: 215-229.1

Farris, J.S., Kluge, A.G. & Eckardt, M.J. 1970. A numerical approach to 
phylogenetic systematics. Syst. Zool. 19: 172-191.

Kellogg, E.A., Appels, R. & Mason-Gamer, R.J. 1996. When genes tell 
different stories: the diploid genera of Triticeae (Gramineae). Syst. Bot. 
21: 321-347.

Jones, T.R., Kluge, A.G. & Wolf, A.J. 1993. When theories and methodologies 
crash: a phylogenetic reanalysis of of the North American ambystomatid 
salamanders. Syst. Biol. 42: 92-102.

Kluge, A. 1983. Cladistics and the classification of the great apes. Pp. 
151-177 in: Ciochon, R.L. & Corrucini, R.S. eds. New interpretations of ape 
and human ancestry. Plenum Press, New York.

Kluge, A.G. 1989. A concern for evidence and a phylogenetic hypothesis of 
relationships among Epicrates (Boidae, Serpentes). Syst. Zool. 38: 7-25.
McLean, P.E. & Hanson, M.R. 1986. Mitochondrial DNA sequence divergence 
among Lycopersicon and related Solanum species. Genetics 112: 649-667.
Margush, T. & McMorris, F.R. 1981. Consensus n-trees. Bull. Math. Biol. 43: 
239-244.

Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, 
New York.

Nelson, G. 1979. Cladistic analysis and synthesis: principles and 
definitions with a historical note on Adanson's Famille des plantes 
(1763-1764). Syst. Zool. 28: 1-21.

Nelson, G. & Ladiges, P.Y. 1991. Three-area statements: standard 
assumptions for biogeographic analysis. Syst. Zool. 40: 470-485.

Palmer, J.D. & Zamir, D. 1982. Chloroplast DNA evolution and phylogenetic 
relationships in Lycopersicon. Proc. Natl. Acad. Sci. U.S.A. 79: 5006-5010.

Pamilo, P. & Nei, M. 1988. Relationships between gene trees and species 
trees. Mol. Biol. Evol. 5: 568-583.

Patterson, C. 1988. Homology in classical and molecular biology. Mol. Biol. 
Evol. 5: 603-625.

Ragan, M.A. 1992a. Phylogenetic inference based on matrix representation of 
trees. Mol. Phylogen. Evol. 1: 53-58.

Ragan, M.A. 1992b. Matrix representation in reconstructing phylogenetic 
relationships among the eukaryotes. BioSystems 28: 47-55.

Rodrigo, A.G. 1993. A comment on Baum's method for combining phylogenetic 
trees. Taxon 42: 631-636.

Rohlf, F.J. 1982. Consensus indices for comparing classifications. Math. 
Biosci. 59: 131-144.

Shao, K.-T. & Sokal, R.R. 1986. Significance tests of consensus indices. 
Syst. Zool. 35: 582-590.

Smith, J.F. 2000. Phylogenetic resolution within the tribe Episcieae 
(Gesneriaceae): congruence of ITS and NDHF sequences from parsimony and 
maximum-likelihood analyses. Amer. J. Bot. 87: 883-897.

Soltis, D.E. & Kuzoff, R.K. 1995. Discordance between nuclear and 
chloroplast phylogenies in the Heuchera group (Saxifragaceae). Evolution 
49: 727-742.

Swofford, D.L. & Olsen, G.J. 1990. Phylogeny reconstruction. Pp 411-501 in: 
Hillis, D.M. & Moritz, C. eds. Molecular systematics. Sinauer Associates, 
Sunderland.

Williams, D.M. 1994. Combining trees and combining data. Taxon 43: 449-453.
Captions to figures

Fig. 1. Tree obtained from the analysis of the two 5S rDNA arrays together. 
Paralogous sequences of the same organisms appear on different branches. 
Copied from Baum and Appels (1992) by permission.

Fig. 2. Example of a combined data matrix with unequal representation of 
organisms in the two data sets. Unequal representation is coded as missing 
data. Copied from Baum (1992) by permission.