DIMACS TR: 2004-55

Optimal Protein Encoding

Authors: Logan Everett and Endre Boros

A common task in Bioinformatics is the search of sequence databases for matching sequences. In protein sequence databases, searching is hindered by both the increased amount of data and the complexity of sequence similarity metrics. Protein similarity is not simply a matter of character matching, but rather is determined by a matrix of scores assigned to every match and mismatch (Henikoff and Henikoff, 1992). One strategy to increase search speed is to map sequences into a binary space where the Hamming distance between strings is comparable to the similarities of the original sequences (Halperin, Buhler,Karp, Krauthgamer and Westover, 2003). Within binary Hamming spaces, statistically proven sampling methods can be used for fast, reliably sensitive searches (Buhler, 2002). However, mapping the protein alphabet to a binary Hamming space often comes with a certain level of distortion. We have employed Linear Programming techniques to model and study the nature of these mapping schemes. Specifically, we have found the theoretically minimum distortion achievable for several biological scoring matrices, as well as corresponding optimal encoding weights. We have also analyzed the use of these encoding weights to generate pseudo-random binary encodings that approach the theoretically minimum distortions.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2004/2004-55.pdf
DIMACS Home Page