DIMACS TR: 99-56

An Approach of Information Theory to Multiple Sequence Comparison

Authors: Weiwu Fang, Fred S. Roberts and Zhengrong Ma


Multiple sequence comparison is a basic problem for molecular biology and other sciences. In this paper, we introduce the concept of complete information set and some measurement principles for measuring multiple sequence discrepancy. Based on them, we present a new measurement method satisfying the principles for comparing multiple sequences. We show that this method can effectively distinguish different random sequences or DNA sequences, for example, distinguish DNA sequences of length 8000 by comparisons of 6-8 symbol strings or protein sequences of length 8000 by comparisons of 3-4 strings. It can also measure slight changes of a sequence, e.g., insertion or deletion of a symbol. We apply it in the study of molecular evolution; the results show that there is a hierarchic relationship among the cytochrome C protein sequences of different species, much as that in taxonomy; moreover, these results are consistent with previous studies.

Key Words: multiple sequence comparison, entropy, DNA, measurement

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1999/99-56.ps.gz
DIMACS Home Page