DIMACS Working Group Meeting on Informatics of Protein Classification

December 15, 2000
DIMACS Center, Rutgers University, PIscataway, NJ

Casimir Kulikowski, Rutgers University, kulikows@cs.rutgers.edu
Guy Montelione, Rutgers University, guy@cabm.rutgers.edu
Ilya Muchnik, Rutgers University, muchnik@dimacs.rutgers.edu
Presented under the auspices of the Special Year on Computational Molecular Biology.

Co-sponsored by The New Jersey Commission on Science and Technology Initiative on Structural Genomics and Bioinformatics and by Rutgers Bioinformatics Initiative.



Building the protein family invariant: correlation matrix

Boris Galitsky & Sergey Shelepin
iAskWeb, Inc.
681 Main Str.
Waltham, MA

We build the invariant of protein family with respect to specific amino
acids, sequence, length, residue variability and quality of data set
representation. This invariant is based on revealing of correlation
between residues, reflecting the specific diversity and evolution of a
protein family. Our method of computation of correlation allows to
discover the following properties of protein families, which remained
unclear using the traditional estimation of residue correlation:

 1.        For any position, if it is correlated with the specific set of
highly correlated positions, than it is likely correlated with the 
majority of other positions. Vice versa, if it is not correlated with 
this set, it is unlikely correlated with the rest of positions.
 2.        Correlation strength weakly depends on the residue
 3.        Correlation strength does not depend on the distance between
 positions in the sequence.
 4.        Protein correlation matrix has the well-structured rectangle
block texture.

  Based on the built invariant, one can distinguish an arbitrary protein
family from the set of sequences of the other nature (not actual
  Each protein family has specific geometric properties of the
correlation matrix, which are used to relate a set of sequences to a
protein family without knowing the specific residues. Besides,
correlation matrices can be used as a framework for protein families
classification, complementary to the ones based on the specific folding
or motifs.

2. Analyis of Genomes and Transcriptomes in terms of the Occurrence of Protein Parts and Features Mark Gerstein Molecular Biophysics & Biochemistry Department Yale University New Haven, CT 06520 My talk will focus on analyzing genomes and gene-expression data in terms of the finite list of protein "parts". Depending on context, a part could be a structural fold or sequence superfamily. I will touch on the following topics: * How one can compare different genomes in terms occurrence of various parts in them. And how this idea can be extended to compare the representation of parts in the genome versus the transcriptome. In particular, this allows one to see what protein features are enriched in highly expressed proteins. * How one can analyze the relationship between where a part is located and its transcriptome occurrence -- i.e. between a protein's subcellular localization and its level of gene expression. We extend this work to develope a formal Bayesian system for predicting subcellular localization, partially based on gene expression data. * To what degree is protein function and protein-protein interactions related to similarities in the level of gene expression. Based on developing a statistical significance formalism, I will argue that while there is a definite relationship for certain classes of protein functions and protein-protein interactions, the relationship is not general and global. The absence of correlation is principally due to the inconsistent way protein function is defined. REFERENCES http://bioinfo.mbb.yale.edu M Gerstein & R Jansen (2000). "The current excitement in bioinformatics, analysis of whole-genome expression data: How does it relate to protein structure and function?" Curr. Opin. Struc. Biol. (in press). A Drawid, R Jansen & M Gerstein (2000). "Gene Expression Levels are Correlated with Protein Subcellular Localization," Trends in Genetics 16: 426-430. A Drawid & M Gerstein (2000). "A Bayesian System Integrating Expression Data with Sequence Patterns for Localizing Proteins: Comprehensive Application to the Yeast Genome," J. Mol. Biol. 301:1059-75 R Jansen & M Gerstein (2000). "Analysis of the Yeast Transcriptome with Broad Structural and Functional Categories: Characterizing Highly Expressed Proteins," Nuc. Acids Res. 28:1481-1488 M Gerstein (1998). "Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census," Proteins 33: 518-534.
3. The classification of Immunoglobulin-like proteins A. Kister and I. Gelfand, Department of Mathematics, Rutgers University. Most of the traditional tools for protein classification and structural prediction are based on alignment of a sequence in question with sequences of known protein families. These alignment methods require one to know a significant fraction of the residues in the query sequence. We have suggested a principally different approach for classification and structural prediction of proteins based on the premise that it is sufficient to know the residues at just a few key positions in the sequence to make an assignment of that sequence to its proper protein fold and family. Our recent investigations showed that proteins with no significant homology that share same type of immunoglobulin fold all contain a small set of residues at specific secondary structural positions. This set of residues constitutes the folding pattern of the immunoglobulin fold. Residues at any one of these key positions are chemically related across all Ig sequences and play approximately the same structural role in all Ig-folded proteins. Energy calculations support the conclusion about the decisive role of these residues in structure stability and showed that in great majority of cases presence of key residues is a necessary and sufficient condition for assigning a sequence into Ig fold.
4. Parsing of Structural Protein Domains from Sequence Data Casimir A. Kulikowski Computer Science Department There has been much work in the past five years on the development of automatic methods for defining protein sequence domains, which are fragments of sequences with a high proportion of conservative positions, helpful in evolutionary and functional analyses of proteins. In contrast, methods for finding hints about possible structural domains from sequence data have largely consisted of correlation analyses between known structural domains and domains constructed from sequence data. Effective techniques for detecting signals of structural domains from sequence data would be extremely valuable for protein analysis, gene finding, structural genomics, and drug design. We have developed just such a set of methods for detecting structural domains based on HMMs built from a subset of sequence-continuous Dali Domain Dictionary (DDD) domains. In the process of testing our predictions against an independent set of Scop domains we have carried out both HMM and Blast matches in a preliminary study with good results. To obtain the greatest reliability in such structural domain parsing, however, it is necessary to develop machine learning procedures based on consensus results from different domain knowledge sources (such as DDD and Scop) and inference methods (such as HMMs and Blast search matches). Two main concepts underpin our consensus reasoning: 1) independent detection results from the different sources and homology searches on entire protein sequence data must be carried out at a sufficiently large number of confidence levels so as to produce a large enough set of multiple, plausible candidate structural domains that could be present within the full sequence; and 2) consensus construction needs to be iteratively refined so as to construct a covering set of probable candidate domains for the sequence.
5. The Fold Rush - A useful lesson from protein classification Michal Linial, The Hebrew University The number of proteins whose structure was solved at high resolution lags way behind the number of proteins sequenced. It is a major obstacle in studying protein structure to predict which proteins belong to new, currently unknown superfamilies or folds. In an attempt to address this problem we mapped all proteins with solved structures onto a graph of all known protein sequences provided by ProtoMap. We wish to sort proteins according to their likeliness to belong to new superfamilies. We hypothesize that proteins within neighboring clusters tend to share common structural superfamilies or folds. If true, the likelihood to find new superfamilies (and folds) increases in clusters that are distal from other solved structures within the graph. On the basis of this hypothesis, we define an order relation between unsolved proteins according to their "distance" from solved structures in the graph. Based on that order relation we sort about 48,000 proteins (from the most likely to belong to new superfamilies to the most unlikely). Our list may be partitioned to three parts: the first contains ~35,000 proteins sharing a cluster with a known structure; the second contains ~6,500 proteins in clusters neighboring clusters containing known structures; the third contains the rest of the proteins - 6096, in 1274 clusters. Over 97% of third part proteins have no significant pairwise sequence similarity to any solved protein (E score worse than 0.1). We test the quality of the order relation using datasets of solved structures that were not considered when the order was defined. The tests show that our order is significantly (P-value ~ 10-5) better than a random order. More interestingly, even when we ignore the first part, or when we consider only the third part, the order performs better then random (P-values: 0.002 and 0.15 respectively). For most test sets, an order derived from PSI-BLAST results performed worse than our method (P-values: 0.008 and 0.21). We tested an order derived from the refinement of our order using PSI-BLAST results. The last order usually performed better than either of the first two. Herein we present a method for selecting target proteins to be used in the Structural Genomics Project.\
6. Combinatorial Model for Revealing Shelled Cores in Biomolecular Structures Borin Mirkin, Birbeck College, London The primary aim is to introduce a combinatorial model for aggregate representation of complex biological structures such as protein folds. This involves: - establishing a framework for the analysis of systems of inter-related elements using the concept of the tightness of a subset induced by a monotonic entity-to-subset linkage function; - incorporating restrictions imposed by such real-world structure-affecting features as amino acid hydrophobicity and protein domain structure; - constructing efficient algorithms for finding layered shell cores of tightness functions; - investigating the relationship between our model and other order-based structures such as hereditary mappings and convex geometries; - specifying appropriate details of models for targeted biological applications. Joint work with T. Fenner, G. Loizou, and I. Muchnik.
7. Structure-based Functional Genomics Gaetano T. Montelione Center for Advanced Biotechnology Medicine and Dept. Molecular Biology and Biochemistry, Rutgers University, Piscataway, NJ 08854 Genome sequencing projects have already determined nearly complete genome sequences of several organisms, including human. The products of these genes are widely recognized as the next generation of therapeutics and targets for the development of pharmaceuticals. While identification of these genes is proceeding quickly, elucidation of their three-dimensional (3D) structures and biochemical functions lags far behind. In some cases, knowledge of 3D structures of proteins can provide important insights into evolutionary relationships that are not easily recognized by sequence alignment comparisons. Thus, structure determination by NMR or X-ray crystallography can sometimes provide key information regarding protein fold class, locations and clustering of conserved residues, and surface electrostatic field distributions that connect a protein sequence with potential biochemical functions. The resulting limited set of putative biochemical functions can then be tested by appropriate biochemical assays. The goal of this work is to develop a "high-throughput" process for structural analysis of novel gene products on a genomic scale and to apply this in the analysis of novel gene products identified in the genome sequencing projects. Montelione, G. T.; Anderson, S. Nature Struct. Biol. 1999, 6: 11 - 12. Structural Genomics: Keystone for a human proteome project. Moseley, H. N. B.; Montelione, G. T. Curr. Opin. Struct. Biol. 1999, 9: 635 - 641. Automated analysis of NMR assignments and structures for proteins.
8. Alignment scores as kernel functions in a regularized support vector classification method for fold recognition of remote protein families Vadim Mottl Tula State University, Russia One of fundamental principles of molecular biology says that the primary structure of a protein, i.e. sequence of amino acid residues forming its polypeptide chain, carries an essential amount of information for unambiguous establishing its spatial structure. Despite the fact that each protein has its own spatial structure, it is a typical phenomenon that the fold pattern remains basically the same within large groups of evolutionarily allied proteins, so that the "number" of essentially different spatial structures is much less than that of known proteins. Since spatial structures are classified in that or other manner, the estimation of the spatial structure of a given protein reduces to its allocating over a finite set of classes, i.e. the problem falls into the competence area of pattern recognition. The traditional methodology of pattern recognition presupposes that the object whose class-membership is to be recognized is represented by vector of some numerical features and is considered as a point in the respective linear vector space. However, the actual diversity of amino acid properties that may play an important part in forming the spatial structure of a protein is so immensely rich, that the choice of suitable numerical properties makes a special problem which is the key one here. We consider here an alternative featureless approach to recognition of spatial structure of proteins. It is proposed to judge about the membership of a protein in one of the classes of spatial structures immediately on the basis of measuring the proximity of its amino acid chain to those of some other proteins whose spatial structure is known. To infer the decision rule of recognition from a training sample of proteins of known structure, we apply the traditional support vector method of machine learning on the basis of treating amino acid chains as elements of a Hilbert space, i.e. linear space with inner product, in what role the pairwise alignment scores are used. The inevitable difficulty of the small size of training samples is overcome by a special regularization technique that makes use of some available a priori information on the sought-for decision rule. The proposed approach to the problem of fold class recognition illustrated by results of processing a collection of 396 mutually distant protein domains of 51 fold classes chosen from the SCOP database.
9. Accuracy of the pair-wise protein sequence alignment. From the observations to a new approach. M.A. Roytberg (Institute of mathematical problems in biology, Pushchino, Russia) in collaboration with S.R.Sunyaev (Institute of Molecular Biology, Moscow, Russia and EMBL, Heidelberg, Germany), A.V.Finkelstein (Institute of Protein Research, Pushchino, Russia Pushchino, Russia), G.Bogopolsky, P.Vlasov, N.Oleinikova. The pair-wise alignment of protein sequences is a key step in the most of the computational methods for prediction of protein function and homology-based modelling of 3D-structures. An alignment algorithm is supposed to reconstruct a “biologically correct“ alignment, which reflects the chain of evolutionary events (substitutions, deletions and insertions). In this study we started with the question of how well “biologically correct” alignments can be reproduced by Smith-Waterman algorithm, currently the most sensitive one. Although a true “biologically correct” alignment is always unknown, accurate structural or multiple alignments may serve as a reasonable approximation. Our analysis was focused on the differences between golden standard and algorithmically produced alignments and revealed some specific features of currently used algorithms, which are responsible for the loss of accuracy. Based on the analysis we propose a novel approach to align two protein sequences, allowing us to achieve a higher performance compared to conventional alignment techniques. We started with more than 600 pair-wise alignments extracted from structural and multiple alignments provided in curated databases BAliBase (Thompson,Plewniak, Poch. Bioinformatics, 1999, 15, 87-88) and tested the obtained results on the over 10000 pair-wise alignments from SMART (Schultz,J., Copley,R.R.,Doerks,T., Ponting, C.P. and Bork, P. (2000) SMART: A Web-based tool for the study of genetically mobile domains Nucleic Acids Res 28, 231-234) . For each pair of protein sequences the golden standard alignments have been compared to the alignments constructed by Smith-Waterman algorithm with standard settings. Percentage of amino acid pairs correctly aligned by an algorithm with respect to the golden standard alignment was used as a measure of the algorithmic alignment accuracy. The large scale analysis shows that the accuracy of protein sequence alignment can be improved for protein pairs at 10-40% sequence identity (conventional methods achieve ~90% of accuracy for more similar pairs, whereas for the pairs of less related sequences the likelihood to reconstruct correct alignment is negligible). Usage of the uniform gap penalties for all protein pairs accounts for ~7% of the accuracy loss. However, the algorithm of specific tuning of gap penalties is still to be developed. Exploration of golden standard alignments revealed a considerable number of ungapped segments of negative or very low positive score (in terms of used substitution matrices). This can be explained by some features of the models used for construction of substitution matrices. Pattern of amino acid substitutions strongly depends on local structural context and local evolutionary rate, which can substantially vary along the protein sequence. Therefore, unique substitution matrix cannot adequately score substitutions in all segments of the alignment. Standard algorithms tend to lose a majority of low scoring ungapped segments of correct alignments completely. However, detailed analysis of these “weak” segments shows that many of them contain a highly positively scoring core. A novel approach allowing us to restore correct alignment in segments with positively scoring core has been developed. The method relies on the substitution matrix only in the regions of significant sequence similarity. Excessive tests confirmed that the method consistently outperforms standard alignment techniques for protein pairs at the similarity range of 10-40% of sequence identity.. The higher accuracy is not due to compromise for an algorithm running time. Moreover, the approach allows considerable reduction in the computation time needed.
10. Multiple Sequence Alignment Based on the Quasi-concave Function Optimization Over a Lower Semi lattice Leonid Shvartser, Ness A.T.Ltd-TSG, Tel-Aviv, Israel Multiple sequence alignment is usually considered as an optimization problem, which has a statistical and a structural component. It is known that in the problem of protein sequence alignment a processed sample is too small and not representative in the statistical sense though this information can be sufficient if an appropriate structural model is used. In order to utilize this information a new structural description of the pairwise alignment results union has been developed. It is shown that if the structure is restored then Multiple Sequence Alignment is achieved. Introduced structure represents the set of local maximums of quasi-concave set function on a lower semi lattice, which in turn is a union of the set-theoretical intervals. This union is a set of the consistent subsets of diagonals, introduced by B. Morgenstern, A. Dress, and T. Werner (1996). Algorithm for local maximums search on proposed structure has been developed. It consists of an alternation of the Forward and Backward passes. The Backward pass in this algorithm is a rigorous while the Forward pass is based on heuristics. Multiple alignment of 5 protein sequences are used as an illustration of the proposed algorithm. This work was done in cooperation with Ilya Muchnik, DIMACS, Rutgers University.

Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on December 13, 2000