DIMACS TR: 2004-55
Optimal Protein Encoding
Authors: Logan Everett and Endre Boros
ABSTRACT
A common task in Bioinformatics is the search of sequence databases
for matching sequences. In protein sequence databases, searching is hindered by
both the increased amount of data and the complexity of sequence similarity
metrics. Protein similarity is not simply a matter of character matching, but
rather is determined by a matrix of scores assigned to every match and mismatch
(Henikoff and Henikoff, 1992). One strategy to increase search speed is to map
sequences into a binary space where the Hamming distance between strings is
comparable to the similarities of the original sequences (Halperin, Buhler,Karp,
Krauthgamer and Westover, 2003). Within binary Hamming spaces, statistically
proven sampling methods can be used for fast, reliably sensitive searches
(Buhler, 2002). However, mapping the protein alphabet to a binary Hamming space
often comes with a certain level of distortion. We have employed Linear
Programming techniques to model and study the nature of these mapping schemes.
Specifically, we have found the theoretically minimum distortion achievable for
several biological scoring matrices, as well as corresponding optimal encoding
weights. We have also analyzed the use of these encoding weights to generate
pseudo-random binary encodings that approach the theoretically minimum
distortions.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2004/2004-55.pdf
DIMACS Home Page