DIMACS TR: 96-19

The number of nucleotide sites needed to accurately reconstruct large evolutionary trees

Authors: Mike Steel, Laszlo A. Szekely, Peter L. Erdos


Biologists seek to reconstruct evolutionary trees for increasing number of species, $n$, from aligned genetic sequences. How fast the sequence length $N$ must grow, as a function of $n$, in order to accurately recover the underlying tree with probability $1-\epsilon$, if the sequences evolve according to simple stochastic models of nucleotide substitution? We show that for a certain model, a reconstruction method exists for which the sequence length $N$ can grow surprisingly slowly with $n$ (sublinearly for a wide range of parameters, and even as a power of $\log n$ in a narrow range, which roughly meets the lower bound from information theory). By contrast a more traditional technique (maximum compatibility) provably requires $N$ to grow faster than linearly in $n$. Our approach is based on a new, and computationally efficient approach for reconstructing phylogenetic trees from aligned DNA sequences.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1996/96-19.ps.gz
DIMACS Home Page