DIMACS Working Group Meeting on Informatics of Protein Classification
December 15, 2000
DIMACS Center, Rutgers University, PIscataway, NJ
Presented under the auspices of the Special Year on Computational Molecular Biology.
- Casimir Kulikowski, Rutgers University, email@example.com
- Guy Montelione, Rutgers University, firstname.lastname@example.org
- Ilya Muchnik, Rutgers University, email@example.com
Co-sponsored by The New Jersey Commission on Science and Technology Initiative on Structural Genomics and Bioinformatics and by Rutgers Bioinformatics Initiative.
Building the protein family invariant: correlation matrix
Boris Galitsky & Sergey Shelepin
681 Main Str.
We build the invariant of protein family with respect to specific amino
acids, sequence, length, residue variability and quality of data set
representation. This invariant is based on revealing of correlation
between residues, reflecting the specific diversity and evolution of a
protein family. Our method of computation of correlation allows to
discover the following properties of protein families, which remained
unclear using the traditional estimation of residue correlation:
1. For any position, if it is correlated with the specific set of
highly correlated positions, than it is likely correlated with the
majority of other positions. Vice versa, if it is not correlated with
this set, it is unlikely correlated with the rest of positions.
2. Correlation strength weakly depends on the residue
3. Correlation strength does not depend on the distance between
positions in the sequence.
4. Protein correlation matrix has the well-structured rectangle
Based on the built invariant, one can distinguish an arbitrary protein
family from the set of sequences of the other nature (not actual
Each protein family has specific geometric properties of the
correlation matrix, which are used to relate a set of sequences to a
protein family without knowing the specific residues. Besides,
correlation matrices can be used as a framework for protein families
classification, complementary to the ones based on the specific folding
Analyis of Genomes and Transcriptomes in terms of the Occurrence of
Protein Parts and Features
Molecular Biophysics & Biochemistry Department
New Haven, CT 06520
My talk will focus on analyzing genomes and gene-expression data in
terms of the finite list of protein "parts". Depending on context, a
part could be a structural fold or sequence superfamily. I will touch
on the following topics:
* How one can compare different genomes in terms occurrence of various
parts in them. And how this idea can be extended to compare the
representation of parts in the genome versus the transcriptome. In
particular, this allows one to see what protein features are enriched
in highly expressed proteins.
* How one can analyze the relationship between where a part is located
and its transcriptome occurrence -- i.e. between a protein's
subcellular localization and its level of gene expression. We extend
this work to develope a formal Bayesian system for predicting
subcellular localization, partially based on gene expression data.
* To what degree is protein function and protein-protein interactions
related to similarities in the level of gene expression. Based on
developing a statistical significance formalism, I will argue that
while there is a definite relationship for certain classes of protein
functions and protein-protein interactions, the relationship is not
general and global. The absence of correlation is principally due to
the inconsistent way protein function is defined.
M Gerstein & R Jansen (2000). "The current excitement in
bioinformatics, analysis of whole-genome expression data: How does it
relate to protein structure and function?"
Curr. Opin. Struc. Biol. (in press).
A Drawid, R Jansen & M Gerstein (2000). "Gene Expression Levels are
Correlated with Protein Subcellular Localization," Trends in Genetics
A Drawid & M Gerstein (2000). "A Bayesian System Integrating
Expression Data with Sequence Patterns for Localizing Proteins:
Comprehensive Application to the Yeast Genome,"
J. Mol. Biol. 301:1059-75
R Jansen & M Gerstein (2000). "Analysis of the Yeast Transcriptome
with Broad Structural and Functional Categories: Characterizing Highly
Expressed Proteins," Nuc. Acids Res. 28:1481-1488
M Gerstein (1998). "Patterns of Protein-Fold Usage in Eight Microbial
Genomes: A Comprehensive Structural Census," Proteins 33: 518-534.
The classification of Immunoglobulin-like proteins
A. Kister and I. Gelfand,
Department of Mathematics, Rutgers University.
Most of the traditional tools for protein classification and structural
prediction are based on alignment of a sequence in question with sequences
of known protein families. These alignment methods require one to know a
significant fraction of the residues in the query sequence. We have
suggested a principally different approach for classification and
structural prediction of proteins based on the premise that it is
sufficient to know the residues at just a few key positions in the
sequence to make an assignment of that sequence to its proper protein fold
Our recent investigations showed that proteins with no significant
homology that share same type of immunoglobulin fold all contain a small
set of residues at specific secondary structural positions. This set of
residues constitutes the folding pattern of the immunoglobulin fold.
Residues at any one of these key positions are chemically related across
all Ig sequences and play approximately the same structural role in all
Ig-folded proteins. Energy calculations support the conclusion about the
decisive role of these residues in structure stability and showed that in
great majority of cases presence of key residues is a necessary and
sufficient condition for assigning a sequence into Ig fold.
Parsing of Structural Protein Domains from Sequence Data
Casimir A. Kulikowski
Computer Science Department
There has been much work in the past five years on the development of
automatic methods for defining protein sequence domains, which
are fragments of sequences with a high proportion of conservative positions,
helpful in evolutionary and functional analyses of proteins.
In contrast, methods for finding hints about possible structural domains
from sequence data have largely consisted of correlation analyses between
known structural domains and domains constructed from sequence data.
Effective techniques for detecting signals of structural domains from
sequence data would be extremely valuable for protein analysis, gene
finding, structural genomics, and drug design.
We have developed just such a set of methods for detecting structural
domains based on HMMs built from a subset of sequence-continuous Dali
Domain Dictionary (DDD) domains. In the process of testing our predictions
against an independent set of Scop domains we have carried out both HMM
and Blast matches in a preliminary study with good results.
To obtain the greatest reliability in such structural domain parsing,
however, it is necessary to develop machine learning procedures based on
consensus results from different domain knowledge sources (such as DDD and
Scop) and inference methods (such as HMMs and Blast search matches).
Two main concepts underpin our consensus reasoning: 1) independent detection
results from the different sources and homology searches on entire protein
sequence data must be carried out at a sufficiently large number of
confidence levels so as to produce a large enough set of
multiple, plausible candidate structural domains that could be present
within the full sequence; and 2) consensus construction needs to be
iteratively refined so as to construct a covering set of probable
candidate domains for the sequence.
The Fold Rush - A useful lesson from protein classification
Michal Linial, The Hebrew University
The number of proteins whose structure was solved at high resolution lags
way behind the number of proteins sequenced. It is a major obstacle in
studying protein structure to predict which proteins belong to new,
currently unknown superfamilies or folds. In an attempt to address this
problem we mapped all proteins with solved structures onto a graph of all
known protein sequences provided by ProtoMap. We wish to sort proteins
according to their likeliness to belong to new superfamilies. We
hypothesize that proteins within neighboring clusters tend to share common
structural superfamilies or folds. If true, the likelihood to find new
superfamilies (and folds) increases in clusters that are distal from other
solved structures within the graph. On the basis of this hypothesis, we
define an order relation between unsolved proteins according to their
"distance" from solved structures in the graph. Based on that order
relation we sort about 48,000 proteins (from the most likely to belong to
new superfamilies to the most unlikely). Our list may be partitioned to
three parts: the first contains ~35,000 proteins sharing a cluster with a
known structure; the second contains ~6,500 proteins in clusters
neighboring clusters containing known structures; the third contains the
rest of the proteins - 6096, in 1274 clusters. Over 97% of third part
proteins have no significant pairwise sequence similarity to any solved
protein (E score worse than 0.1).
We test the quality of the order relation using datasets of solved
structures that were not considered when the order was defined. The tests
show that our order is significantly (P-value ~ 10-5) better than a random
order. More interestingly, even when we ignore the first part, or when we
consider only the third part, the order performs better then random
(P-values: 0.002 and 0.15 respectively). For most test sets, an order
derived from PSI-BLAST results performed worse than our method (P-values:
0.008 and 0.21). We tested an order derived from the refinement of our
order using PSI-BLAST results. The last order usually performed better than
either of the first two.
Herein we present a method for selecting target proteins to be used in the
Structural Genomics Project.\
Combinatorial Model for Revealing Shelled Cores
in Biomolecular Structures
Borin Mirkin, Birbeck College, London
The primary aim is to introduce a combinatorial model for aggregate
representation of complex biological structures such as protein folds.
- establishing a framework for the analysis of systems of inter-related
elements using the concept of the tightness of a subset induced by a
monotonic entity-to-subset linkage
- incorporating restrictions imposed by such real-world
structure-affecting features as amino acid hydrophobicity
- constructing efficient algorithms for finding layered shell cores of
- investigating the relationship between our model and other order-based
structures such as hereditary mappings and convex
- specifying appropriate details of models for targeted biological
applications. Joint work with T. Fenner, G. Loizou, and I.
Structure-based Functional Genomics
Gaetano T. Montelione
Center for Advanced Biotechnology Medicine and
Dept. Molecular Biology and Biochemistry,
Rutgers University, Piscataway, NJ 08854
Genome sequencing projects have already determined nearly complete genome
sequences of several organisms, including human. The products of these
genes are widely recognized as the next generation of therapeutics and
targets for the development of pharmaceuticals. While identification of
these genes is proceeding quickly, elucidation of their three-dimensional
(3D) structures and biochemical functions lags far behind. In some cases,
knowledge of 3D structures of proteins can provide important insights into
evolutionary relationships that are not easily recognized by sequence
alignment comparisons. Thus, structure determination by NMR or X-ray
crystallography can sometimes provide key information regarding protein
fold class, locations and clustering of conserved residues, and surface
electrostatic field distributions that connect a protein sequence with
potential biochemical functions. The resulting limited set of putative
biochemical functions can then be tested by appropriate biochemical
assays. The goal of this work is to develop a "high-throughput" process
for structural analysis of novel gene products on a genomic scale and to
apply this in the analysis of novel gene products identified in the genome
Montelione, G. T.; Anderson, S. Nature Struct. Biol. 1999, 6: 11 - 12.
Structural Genomics: Keystone for a human proteome project.
Moseley, H. N. B.; Montelione, G. T. Curr. Opin. Struct. Biol. 1999, 9:
635 - 641. Automated analysis of NMR assignments and structures for
Alignment scores as kernel functions in a regularized support vector
classification method for fold recognition of remote protein families
Tula State University, Russia
One of fundamental principles of molecular biology says that the primary
structure of a protein, i.e. sequence of amino acid residues forming its
polypeptide chain, carries an essential amount of information for unambiguous
establishing its spatial structure. Despite the fact that each protein has its
own spatial structure, it is a typical phenomenon that the fold pattern remains
basically the same within large groups of evolutionarily allied proteins, so
that the "number" of essentially different spatial structures is much less than
that of known proteins. Since spatial structures are classified in that or
other manner, the estimation of the spatial structure of a given protein
reduces to its allocating over a finite set of classes, i.e. the problem
falls into the competence area of pattern recognition.
The traditional methodology of pattern recognition presupposes that the object
whose class-membership is to be recognized is represented by vector of some
numerical features and is considered as a point in the respective linear vector
space. However, the actual diversity of amino acid properties that may play an
important part in forming the spatial structure of a protein is so immensely
rich, that the choice of suitable numerical properties makes a special problem
which is the key one here. We consider here an alternative featureless approach
to recognition of spatial structure of proteins. It is proposed to judge about
the membership of a protein in one of the classes of spatial structures
immediately on the basis of measuring the proximity of its amino acid chain to
those of some other proteins whose spatial structure is known.
To infer the decision rule of recognition from a training sample of proteins of
known structure, we apply the traditional support vector method of machine
learning on the basis of treating amino acid chains as elements of a Hilbert
space, i.e. linear space with inner product, in what role the pairwise
alignment scores are used. The inevitable difficulty of the small size of
training samples is overcome by a special regularization technique that makes
use of some available a priori information on the sought-for decision rule.
The proposed approach to the problem of fold class recognition illustrated by
results of processing a collection of 396 mutually distant protein domains of
51 fold classes chosen from the SCOP database.
Accuracy of the pair-wise protein sequence alignment.
From the observations to a new approach.
M.A. Roytberg (Institute of mathematical problems in biology, Pushchino,
in collaboration with S.R.Sunyaev (Institute of Molecular Biology,
Moscow, Russia and EMBL, Heidelberg, Germany), A.V.Finkelstein
(Institute of Protein Research, Pushchino, Russia Pushchino, Russia),
G.Bogopolsky, P.Vlasov, N.Oleinikova.
The pair-wise alignment of protein sequences is a key step in the most
of the computational methods for prediction of protein function and
homology-based modelling of 3D-structures. An alignment algorithm is
supposed to reconstruct a “biologically correct“ alignment, which
reflects the chain of evolutionary events (substitutions, deletions and
insertions). In this study we started with the question of how well
“biologically correct” alignments can be reproduced by Smith-Waterman
algorithm, currently the most sensitive one. Although a true
“biologically correct” alignment is always unknown, accurate structural
or multiple alignments may serve as a reasonable approximation. Our
analysis was focused on the differences between golden standard and
algorithmically produced alignments and revealed some specific features
of currently used algorithms, which are responsible for the loss of
accuracy. Based on the analysis we propose a novel approach to align two
protein sequences, allowing us to achieve a higher performance compared
to conventional alignment techniques.
We started with more than 600 pair-wise alignments extracted from
structural and multiple alignments provided in curated databases
BAliBase (Thompson,Plewniak, Poch. Bioinformatics, 1999, 15, 87-88) and
tested the obtained results on the over 10000 pair-wise alignments from
SMART (Schultz,J., Copley,R.R.,Doerks,T., Ponting, C.P. and Bork, P.
(2000) SMART: A Web-based tool for the study of genetically mobile
domains Nucleic Acids Res 28, 231-234) .
For each pair of protein sequences the golden standard alignments have
been compared to the alignments constructed by Smith-Waterman algorithm
with standard settings. Percentage of amino acid pairs correctly aligned
by an algorithm with respect to the golden standard alignment was used
as a measure of the algorithmic alignment accuracy. The large scale
analysis shows that the accuracy of protein sequence alignment can be
improved for protein pairs at 10-40% sequence identity (conventional
methods achieve ~90% of accuracy for more similar pairs, whereas for the
pairs of less related sequences the likelihood to reconstruct correct
alignment is negligible). Usage of the uniform gap penalties for all
protein pairs accounts for ~7% of the accuracy loss. However, the
algorithm of specific tuning of gap penalties is still to be developed.
Exploration of golden standard alignments revealed a considerable number
of ungapped segments of negative or very low positive score (in terms of
used substitution matrices). This can be explained by some features of
the models used for construction of substitution matrices. Pattern of
amino acid substitutions strongly depends on local structural context
and local evolutionary rate, which can substantially vary along the
protein sequence. Therefore, unique substitution matrix cannot
adequately score substitutions in all segments of the alignment.
Standard algorithms tend to lose a majority of low scoring ungapped
segments of correct alignments completely. However, detailed analysis of
these “weak” segments shows that many of them contain a highly
positively scoring core.
A novel approach allowing us to restore correct alignment in segments
with positively scoring core has been developed. The method relies on
the substitution matrix only in the regions of significant sequence
similarity. Excessive tests confirmed that the method consistently
outperforms standard alignment techniques for protein pairs at the
similarity range of 10-40% of sequence identity.. The higher accuracy is
not due to compromise for an algorithm running time. Moreover, the
approach allows considerable reduction in the computation time needed.
Multiple Sequence Alignment
Based on the Quasi-concave Function Optimization
Over a Lower Semi lattice
Ness A.T.Ltd-TSG, Tel-Aviv, Israel
Multiple sequence alignment is usually considered as an optimization
problem, which has a statistical and a structural component. It is known
that in the problem of protein sequence alignment a processed sample is too
small and not representative in the statistical sense though this
information can be sufficient if an appropriate structural model is used. In
order to utilize this information a new structural description of the
pairwise alignment results union has been developed. It is shown that if the
structure is restored then Multiple Sequence Alignment is achieved.
Introduced structure represents the set of local maximums of quasi-concave
set function on a lower semi lattice, which in turn is a union of the
set-theoretical intervals. This union is a set of the consistent subsets of
diagonals, introduced by B. Morgenstern, A. Dress, and T. Werner (1996).
Algorithm for local maximums search on proposed structure has been
developed. It consists of an alternation of the Forward and Backward passes.
The Backward pass in this algorithm is a rigorous while the Forward pass is
based on heuristics. Multiple alignment of 5 protein sequences are used as
an illustration of the proposed algorithm. This work was done in cooperation
with Ilya Muchnik, DIMACS, Rutgers University.
Contacting the Center
Document last modified on December 13, 2000