DIMACS Seminar on Math and CS in Biology


The Statistics of Local Sequence Similarities and The Choice of Protein Alignment Scoring Systems


Stephen Altschul
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health


CoRE Building, Room 431
Busch Campus, Rutgers University


11:00 AM
Monday, February 12, 1996


One simple form of protein sequence comparison aligns only segments of the sequences being compared, and employs a "substitution matrix" to specify a score for each aligned pair of amino acids [1,2]. Within the past six years, a powerful statistical theory [3,4] has emerged for local alignments lacking gaps [5]. Its main features [6] are: First, that any substitution matrix is implicitly (if not explicitly) tailored to locating alignments with a specific frequency of aligned residue pairs; Second, that alignment scores may be scaled so that they are expressed as bits of information; Finally, that the information needed to distinguished an alignment from chance is directly proportional to the log of the search space size. An extension of the theory yields the ability to assess the significance of a collection of high-scoring segment pairs [7]. Once gaps are allowed, the distribution of alignment scores has not been established analytically. However, computational experiments strongly suggest that the same basic theory covers this broader class of alignments. While the relevant statistical parameters can not be calculated from first principles, they may be estimated by random simulation [8,9] or database search [10,11]. How best to choose gap costs is an important open question, currently amenable only to empirical study [12,13].

[1] Smith, T.F. & Waterman, M.S. (1981) J Mol Biol 147:195-197.
[2] Pearson, W.R. & Lipman, D.J. (1988) Proc Natl Acad Sci USA 85:2444-2448.
[3] Karlin, S. & Altschul, S.F. (1990) Proc Natl Acad Sci USA 87:2264-2268.
[4] Dembo, A., Karlin, S. & Zeitouni, O. (1994) Ann Prob 22:2022-2039.
[5] Altschul, S.F. et al. (1990) J Mol Biol 215:403-410.
[6] Altschul, S.F. (1991) J Mol Biol 219:555-565.
[7] Karlin, S. & Altschul, S.F. (1993) Proc Natl Acad Sci USA 90:5873-5877.
[8] Waterman, M.S. & Vingron, M. (1994) Proc Natl Acad Sci USA 91:4625-4628.
[9] Altschul, S.F. & Gish, W. (1996) Meth Enzymol 266:460-480.
[10] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) CABIOS 4:67-71.
[11] Mott, R. (1992) Bull Math Biol 54:59-75.
[12] Pearson, W.R. (1995) Prot Sci 4:1145-1160.
[13] Vogt, G., Etzold, T. & Argos, P. (1995) J Mol Biol 249:816-831.

Document last modified on February 6, 1996