DIMACS TR: 2001-01

Alignment scores in a regularized support vector classification method for fold recognition of remote protein families

Authors: Vadim Mottl, Sergey Dvoenko, Oleg Seredin, Casimir Kulikowski, Ilya Muchnik


One of fundamental principles of molecular biology says that the primary structure of a protein, i.e. sequence of amino acid residues forming its polypeptide chain, carries an essential amount of information for unambiguous establishing its spatial structure. Despite the fact that each protein has its own spatial structure, it is typical phenomenon that the fold pattern remains basically the same within large groups of evolutionarily allied proteins, so that the "number" of essentially different spatial structures is much less than that of known proteins. Since spatial structures are classified in that or other manner, the estimation of the spatial structure of a given protein reduces to its allocating over a finite set of classes, i.e. the problem falls into the competence area of pattern recognition.

The traditional methodology of pattern recognition presupposes that the object whose class-membership is to be recognized is represented by vector of some numerical features and is considered as a point in the respective linear vector space. However, the actual diversity of amino acid properties that may play an important part in forming the spatial structure of a protein is so immensely rich, that the choice of suitable numerical properties makes a special problem which is the key one here. We consider here an alternative featureless approach to recognition of spatial structure of proteins. It is proposed to judge about the membership of a protein in one of the classes of spatial structures immediately on the basis of measuring the proximity of its amino acid chain to those of some other proteins whose spatial structure is known.

To infer the decision rule of recognition from a training sample of proteins of known structure, we apply the traditional support vector method of machine learning on the basis of treating amino acid chains as elements of a Hilbert space, i.e. linear space with inner product, in what role the pairwise alignment scores are used. The inevitable difficulty of the small size of training samples is overcome by a special regularization technique that makes use of some available a priori information on the sought-for decision rule.

The proposed approach to the problem of fold class recognition illustrated by results of processing a collection of 396 mutually distant protein domains of 51 fold classes chosen from the SCOP database.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2001/2001-01.ps.gz

DIMACS Home Page