DIMACS Series in
Discrete Mathematics and Theoretical Computer Science

VOLUME EIGHT
TITLE: "MATHEMATICAL METHODS OF ANALYSIS OF BIOPOLYMER SEQUENCES"
EDITORS: Simon Gindikin
Published by the American Mathematical Society


Ordering Information

This volume may be obtained from the AMS or through bookstores in your area.

To order through AMS contact the AMS Customer Services Department, P.O. Box 6248, Providence, Rhode Island 02940-6248 USA. For Visa, Mastercard, Discover, and American Express orders call 1-800-321-4AMS.

You may also visit the AMS Bookstore and order directly from there. DIMACS does not distribute or sell these books.



PREFACE


This collection was prepared mainly by participants of the seminar on mathematical methods in the analysis of biopolymer sequences, which worked for several years at the Laboratory of Molecular Biology and Bioorganic Chemistry (now the Institute of Physical and Chemical Problems in Biology) at Moscow State University. I think that it would of interest to American and European specialists in this area to become acquainted with the work of their colleagues form Russia. It si not my intention here to discuss the peculiarities of doing science in Russia, with the extremely limited access to computers, banks of sequences, scientific information, etc. I hope that western scientists would never find themselves in the situation when this experience might become useful. Instead, I would like to say a few words about the seminar and the papers in this collection.

The participants of the seminar could be divided roughly into three categories. First, there were mathematicians who work successfully in theoretical mathematics, but are also interested in biology and have some experience in it. This group included Dmitrii Fuchs, Andrei Leontovich, Nicolai Vasiliev. I must mention that the laboratory has the tradition, coming from I.M. Gelfand, of the participation of mathematicians in biological research, and not as apprentices helping with computations and mathematical models, but as competent team members who understand raw biological data.

The second group consisted of several younger mathematicians who already started working professionally in mathematical and computational aspects of molecular biology (among them Leonid Brodsky, Mikhail Gelfand, Pavel Pevzner).

The third group included young biologists with classical biological education, who worked successfully in experimental biology, and believed in the computer as a tool in biological research. Here I should mention Konstantin Chumakov, Alexander Gorbalenya, Eugene Koonin. I was surprised to see how resolute they were to spend at the computer time they might have used for experiments, how easily they discussed the reelevant mathematics, wrote necessary computer programs, worked together with mathematicians in designing and implementing new algorithms reflecting biological aspects of the problem. The idea that unified all participants was that the problem of reading and deciphering biological texts requires a very serious, in certain respect rather unusual mathematics. The term "mathematical biology" should be used here very cautiously, especially since it will inevitably bring parallels with the term "mathematical physics". However, even with this caution in mind, the phantom of this new "mathematics of life" could be seen at the horizon, even if only occasionally.

When a mathematician wants (or is forced) to work in applications, the highest reward for him is a possiblity to keep abreast with modern mathematics, when no outside considerations have to be used to justify his activity. We were under impression that this was the case with analysis of biolopolymer sequences. Our biologists friends were absolutely confident that the problems arising here could not resolved without advancing some new mathematical methods and techniques.

The seminar worked for several years. I hope that this collection reflects its activity, possibly not completely, but in a sufficiently representative way.

The first group of papers in this collection is devoted to purely mathematical problems related to biolopolymer sequences. These problems are mostly related to discrete mathematics or to mathematical statistics. I would like to mention specifically the role of Andrei Leontovich, who spent a lot of time thinking about the relevant aspects of mathematical statistics. He formulated several problems that are undoubtedly interesting and new in their theoretical aspect; their solutions would immediately lead to important biological applications. Leotovich's paper in the collection contains a proff that the so called Daihoff matrix, used traditionally in the evaluation of the similarity of sequence fragments is, in some sense, optimal for this purpose. The paper of V. Piterbarg answers certain questions raised by Leontovich about the Bernoulli sequences related to the choice of an optimal window for the analysis of similarity of fragments.

The next part contains two papers. A large survey by M. Gelfand, that includes also an extensive biobliography, adresses a very important problem of separating those regions in DNA sequences that might be responsible for certain biological functions of a gene. It is my impression that the paper by E. Koonin on certain general principles of comparative analysis of biological sequences might be interesting not only to biologists, but to mathematicians as well. I think that its author represents a generation of biologists who have worked extensively with computers, and at the same time are quite experienced in this experimental work.

The next four papers describe new methods for solution of important problems in sequence analysis. In the paper by Davydov et al. the problem of the structual and functional classification of proteins using physical-chemical profiles is discussed. A new approach to the prediction of protein-encoding genes is described in the paper by M. Gelfand. P. Pevzner studies physical mapping for biopolymer sequences and discusses the corresponding mathematical problems. M. Roytberg suggests a construction of a near-optimal aligning, that leads to the fast aligning algorithm.

The last part of the collection consists of papers describing the corresponding software. Work at the seminar went in parallel with the design of a new software package for the analysis of biological sequences. L. Brodsky, who was the main driving force behind this project (nameed GeneBee), involved into this activity several mathematicians, biologists, and programmers. Unable to evaluate this project professionally, I must only say that the corresponding algorithms use several original mathematical ideas, and that the project itself is very popular among the biologists of the (as they now say) the former Soviet Union.

I hope that the publication of the present book will help further integration of Russian scientists into the world community of their colleagues actively working in the development and applications of mathematical methods in molecular biology.

All papers in this collection are in their final form, and their English versions will not be submitted for publication elsewhere.


TABLE OF CONTENTS




Forward								ix

Preface								xi

I. Some Problems of Mathematical Statistics

On the Optimality of the Dayhoff Matrix for Computing the	
   Similarity Score Between Fragments of Biological Sequences
	A.M. Leontovich						1

On the Distribution of the Maximum Similarity Score for 
   Fragments of Two Random Sequences
	V.I. Piterbarg						11

II. Surveys and Methodology

Computer Functional Analysis of Nucleotide Sequences:  Problems
   and Approaches
	M.S. Gelfand						19

Comparative Analysis of Biopolymer Sequences:  Reflections on
   the Validity of the Methodology and the Underlying General 
   Principles
	Eugene V. Koonin					63

III.  Methods and Algorithms

A New Method for Recognition of Structural and Functional Motifs
   in Protein Sequences Based on the Principal Component Analysis 
   of Profiles of Physico-Chemical Properties
	D. R. Davydov, I Erlikh, E.E. Klotz, and V.G. Fridman	75

Prediction of Protein-Coding Regions in DNA of Higher Eukaryotes
	M.S. Gelfand						87

DNA Physical Mapping, Flows in Networks and Minimum Cycles
   Mean in Graphs
	Pavel Pevzner						99

Fast Algorithm for Optimal Aligning of Symbol Sequences
	M.A. Roytberg						113

IV. Software

GeneBee: The Program Package for Biopolymer Structure Analysis
	L.I. Brodsky, A.V. Vassilyev, Ya. L. Kalaydzidis, 
          Yu. S. Osipov, R.L. Tatuzov, and S. I. Feranchuk	127

Data Compression and the Use of Vector Operations for the Fast
   Search Problems in Analysis of Biological Sequences
	R. L. Tatuzov and T.A. Mamikonova			141

Organization and Storage of Information in Molecular Genetics 
   Data Banks
	R.L. Tatuzov and T.A. Mamikonova			145


Index Index of Volumes
DIMACS Homepage
Contacting the Center
Document last modified on October 28, 1998.