DIMACS Theory of Computing Seminar


On the Approximability of Numberical Taxonomy (Fitting Distances by Tree Metrics)


Martin Farach
Rutgers University


CoRE Building, Room 431
Busch Campus, Rutgers University


4:30 PM
Thursday, October 12, 1995


One of the most common methods for clustering numeric data involves fitting the data to a tree metric, which is defined by a weighted tree spanning the points of the metric, the distance between two points being the sum of the weights of the edges of the path between them. Not surprisingly, this problem, the so-called {\em Numerical Taxonomy} problem, has received a great deal of attention. Fitting distances by trees is an important problem in many areas. For example, evolutionary biologists often compare the DNA sequences of pairs of species, and thus get an estimate of the evolutionary time which has elapsed since the species separated by a speciation event. A table of pairwise distances is thus constructed. The problem is then to reconstruct the underlying evolutionary tree. Dozens of heuristics for this problem appear in the literature every year.

We consider the problem of fitting an n by n distance matrix D by a tree metric T. Let p be the distance to the closest tree metric, that is, p is the minimum over T of the maximum difference between T and D. First we present a quadratic time algorithm for finding an additive tree T such that whose distance from D is at most 3p. Second we show that it is NP-hard to find a tree T such that is distance (9/8)p from D. This paper presents the first algorithm for this problem with a performance guarantee.

Joint work with Richa Agarwala, Vineet Bafna, Babu Narayanan, Mike Paterson and Mikkel Thorup.

Document last modified on October 4, 1995