DIMACS Workshop on Protein Domains: Identification, Classification and Evolution

February 27-28, 2003
DIMACS Center, CoRE Building, Rutgers University

Stephen Bryant, National Institutes of Health, bryant@ncbi.nlm.nih.gov
Teresa Przytycka, National Institutes of Health, przytyck@ncbi.nlm.nih.gov
Presented under the auspices of the Special Focus on Computational Molecular Biology.


Stephen Altschul, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health

Title: Assessing the accuracy of database search methods, and improving the performance of PSI-BLAST

A variety of measures have been proposed for assessing the accuracy of sequence database search methods. One measure that has gained wide use is the ROC score, derived from a graph of false vrs. true positives as alignment cutoff score varies. An interesting question is when the ROC scores of two different methods can be said to differ significantly. Recent analytic results concerning bootstrap resampling applied to ROC scores provide one possible answer to this question.

We have used ROC analysis to assess a large number of possible refinements of the original, 1997 version of PSI-BLAST. Several modifications lead to significant or near-significant improvements in program accuracy. The most important among these is the incorporation of sequence-composition based statistics, which substantially suppress the corruption of protein profiles by false positive alignments.

Alex Bateman, Wellcome Trust Sanger Institute

Title: The language of proteins

The Pfam database of protein families has treated proteins and domains independently of each other in the past. However, we know that most proteins are part of large complexes and the domains within proteins interact to compose their cellular function.

In this seminar I will discuss the two ways we are integrating and using this higher order organisation to improve detection of protein domains and our understanding of the interactions between domains.

1) Methods used in voice recognition have been applied to protein domain detection. Profile-HMMs have been applied successfully and are used in many of the domain databases such as SMART and Pfam.

In the speech recognition field it has been shown that models that take into account the proximity of words during word prediction are more accurate. This higher-order model is called the language model.

We have shown that the techniques used can be applied to finding protein domains. In essence the context of domain combinations is used to detect more distant similarities. This approach can be used to include other kinds of contextual information such as the species distribution of protein domains.

2) We are investigating the atomic details of domain-domain interactions using proteins of known structure. This work will provide the basis for allowing Pfam users to investigate protein-protein interactions. Some of the results of this work will be presented.

Bonnie Berger, Mathematics Department and Lab. for Computer Science, MIT

Title: Discovery of Sequence-Structure Patterns across Diverse Proteins

We have developed a new program, Trilogy, for the automated discovery of sequence-structure patterns in proteins. Trilogy identifies several thousand high-scoring patterns that occur across protein families, which include both previously identified and novel motifs. We hope to employ these sequence-structure patterns in predicting protein structure from sequence, annotating newly determined protein structures, and identifying novel motifs of potential functional or structural significance. We expect that automated approaches such as the Trilogy algorithm will become increasingly important as the structural genomics initiatives begin to produce protein structures in high-throughput fashion.

(This is joint work with Phil Bradley and Peter S. Kim.)

Stephen H. Bryant, National Center for Biotechnology Information, NIH

Title: Structure-Based Alignments of Conserved Domains

Multiple alignments for NCBI's conserved domain database are based on explicit identification of a core structure inferred to be present in all family members. This alignment model is intended to support mapping of structure and family-specific annotation to sequences identified in RPS-BLAST database searches. Under this model core structure is defined as a series of chain-continuous sites, apparent in alignment displays as ungapped blocks. Candidate core sites are initially identified from the intersection of pairwise sequence and/or structure alignments involving a reference protein. The core structure is then refined iteratively subject to a number of constraints: a) When multiple structures are known, core sites must superimpose. b) Sequence segments aligned with core sites must exhibit significant RPS-BLAST and/or threading scores. c) Mapping of core sites from sequence to structure must not imply any physically impossible models. d) The core structure must include functional sites identified from complexes, bound ligands, or other experimental data. e) Fragmentary sequences that do not include a complete core structure must be excluded. These constraints are imposed by a combination of automated alignment algorithms and manual curation. In the talk I will present examples of core structure alignments and an analysis of their effectiveness in database searching.

Christine Vogel, Sarah Teichmann and Cyrus Chothia, MRC Laboratory of Molecular Biology

Title: Changes in Protein Repertoires that Underlay Increases in Biological Complexity: the Immunoglobulin Superfamily in Drosophila melanogaster and Caenorhabditis elegans

During the course of evolution life has become more complex What are the changes in protein repertoires that are responsible for this?

The nematode C. elegans has 1,000 cells and a simple physiology. Drosophila has many more cells, perhaps 1,000,000, and a more complex physiology and mode of development. These differences do not have a simple relation to the number of genes in the two organisms: C. elegans has 19,518 and Drosophila has 13,639.

Members of the immunoglobulin superfamily (IgSF) play a major role in the development and structure of the nervous system and significant roles in the extracellular matrix and muscle. Using sequences from the genome projects and hidden Markov models, we have determined the repertoire of IgSF proteins in the two organisms.

Some 35% of these had been experimentally characterised in previous work. The other 65% are experimentally uncharacterised. The domain structure of each protein was also determined.

In this talk, the IgSF repetoires in C. elegans and Drosophila will be compared and contrasted and the implications of the results for the evolution of complexity discussed.

Doron Betel and Christopher Hogue, Mount Sinai Hospital

Title: Protein Domain Research and Visualization Tools for the Biomolecular Interaction Network Database

We have been undertaking to build BIND tools capable of displaying and providing computational analysis of protein domains. We have built a cluster implementation of RPS-BLAST together with the CDD and Interpro databases (http://seqhound.mshri.on.ca) that are capable of serving domain information to help scientists create script-based analyses for their queries. This infrastructure also drives new visualization tools in BIND for scientists who are interested in seeing the domain architecture of protien complexes found in our database. Further, we have been carrying out analyses of protein interaction datasets to try to find statistically significant co-ocurrences of protein domains within protein complex information. Preliminary results will be shown and discussed. Finally we will demonstrate our ProteoGlyph system, a line based graphics system for visualizing specific protein domains upon which sequence annotation may be simultaneously presented. We discuss generating the required shape diversity as well as determining the distinguishability of symbols, as well as the problem of assigning such symbols to best indicate protein structure and function to the viewer.

Liisa Holm, EMBL, European Bioinformatics Institute

Title: Recurrent domains and domain space

Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. We optimize domain boundaries based on the principle that the 'best', concise description of all-on-all alignments in terms of domains uses large and completely covered units with high frequency. This operational definition is applicable to structures and sequences, and yields results in reasonable agreement with manual domain definitions. Domain cutting disentangles the protein similarity graph, enabling the generation of global overviews of protein space as well as the identification of homologous domain families by hierarchical clustering. References:

Alexander E. Kister, Rutgers University

Title: Analysis of common structural and sequence features of beta sandwich and beta barrel proteins.

The sandwich like proteins (SP) and the barrel-like proteins (BP) are the two groups of very different proteins comprising now 69 SP superfamilies in 38 protein folds and 67 BP superfamilies in 37 protein folds, respectively. The goal of this work is to define the structural and sequence features common to these proteins. We will discussed the following main questions: (i) Common structural features of SP: a new rule that determine supersecondary structures of SP. Analysis of the arrangements of strands within main sandwich sheets revealed a rigorously defined constraint on the supersecondary substructure that holds true for 94% of known SP structures; (ii) Sequence determinants of SP. As homology among these non-homologous proteins is usually not detectable even with most powerful sequence-comparing algorithms, we employ a structurally-based approach to sequence alignment. Analysis revealed eight hydrophobic positions conserved across all SP (iii) Application of the sequence determinants for protein classification and structure prediction. A novel algorithm for protein classification, which does not require homology of a query sequence with known proteins, is suggested; (iv) Analysis of the arrangements of strands in BP. Comparison of the combinatorics of strands in the barrel with different numbers of strands (n*=4, 5, 6. 8 and 10) revealed the main regularities in supersecondary structures. (v) Common and distinctive structural features in SP and BP. Analysis of the common geometrical core in these proteins.

Joint work with Alexei V. Finkelstein, Institute of Protein Research, Russian Academy of Sciences, Russia, and Israel M. Gelfand, Department of Mathematics, Rutgers University, USA.


Aron Marchler-Bauer, NCBI, National Library of Medicine, NIH

Title: A Conserved Domain Database

A set of pre-calculated position-specific score matrices (PSSMs) is used to rapidly detect and annotate the position of conserved domain subsequences in proteins using RPS-BLAST, a variant of the Psi-BLAST algorithm. Its sensitivity may be limited by the search heuristics and the scope of the alignment models used for PSSM calculation. Our goal is to build a system for the accurate annotation of functional domains and associated conserved sites in proteins. I will describe efforts undertaken to increase coverage, sensitivity, and specificity of the annotation system. Very diverse domain families can not be represented well with single alignments and search models. The redundancy created by splitting models into sub-families may be inherent to all comprehensive domain model collections. We attempt to make that redundancy transparent by embedding related domain models in hierarchical trees, which may describe the natural history of a domain family.

CDD is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

Marc A. Marti-Renom, UCSF

Title: Identification of Structural Domains in Proteins

Protein structure domains can be defined as compact modules that occur frequently in protein structures. It is sometimes possible to identify such domains by inspection but this task is very difficult to apply to a large number of protein structures. Several of the existing automated methods frequently result in conflicting assignments. We present a new computational algorithm (PAR-DOM) that utilizes the recurrence of protein structural motifs among the increasingly large number of experimentally determined structures. PAR-DOM uses the DBAli database (Marti-Renom et al., 2001) of structural alignments of all-against-all comparisons of structures deposited in PDB (>10^10 superimpositions), generated by MAMMOTH (Ortiz et al., 2002). The PAR-DOM program clusters residues in space by analyzing their co-occurrence in many structural superimpositions. The clustering is done by the Markov Clustering Algorithm, previously used in clustering protein sequences into families (Enright et al., 2002). On a limited benchmark (Islam et al., 1995), PAR-DOM achieved higher accuracy than four previously published algorithms. This program can also be used to identify recurrent fragments of domains in proteins.


Ruth Nussinov, National Cancer Institute and Medical School, Tel Aviv University, Israel

Title: Protein Interactions: Binding and Folding

The availability of a computer-vision based amino acid sequence order- independent structural comparison technique enabled a construction of a dataset of protein-protein interfaces and its comparison to a dataset of protein chains. The protein-protein interfaces are derived from two-chain complexes. Despite the absence of chain connectivity, the global features of the architectures, hydrophobicity and compact hydrophobic units in protein-protein interfaces resemble those of protein cores. The general similarity in the forces governing protein folding and protein-protein binding is consistent with hierarchical protein folding. This has led us to develop the building block folding model. According to this model, fluctuating relatively stable local building blocks are formed first. Through conformational selection, they hierarchically assemble into higher order protein structures, hydrophobic folding units, domains and entire folds. This model leads us to develop a scheme which in principle may reduce the computational complexity of protein folding. Preliminary results will be presented.

Christine Orengo, University College London

Title: The Evolution of Structure and Function in CATH Protein Superfamilies

We have used GRATH, a graph-based structure comparison algorithm, to map the similarities between the different folds observed in the CATH domain structure database. Statistical analysis of the distributions of the fold similarities has allowed us to assess the significance for any similarity. Therefore we have examined whether it is best to represent folds as discrete entities or whether, in fact, a more accurate model would be a continuum wherein folds overlap via common motifs. To do this we have introduced a new statistical measure of fold similarity, termed gregariousness. For a particular fold, gregariousness measures how many other folds have a significant structural overlap with that fold, typically comprising 40% or more of the larger structure. Gregarious folds often contain commonly occurring super-secondary structural motifs, such as b-meanders, greek keys, a-b plait motifs or a-hairpins, which are matching similar motifs in other folds. Apart from one example, all the most gregarious folds matching 20% of the other folds in the database, are a-b proteins. They also occur in highly populated architectural regions of fold space, adopting sandwich-like arrangements containing two or more layers of a-helices and b-strands.

Although the structural data is still relatively sparse, with fewer than 50,000 known structural domains, powerful sequence comparison methods allow the detection of distant sequence relatives for many structural families. We have developed a classification of protein structures (CATH) which now contains nearly 1500 homologous superfamilies. Using Hidden Markov models we can recruit nearly 400,000 gene sequences into these families from GenBank. This has enriched the functional information within each superfamily and enabled us to study the variation in function and evolutionary mechanisms within these families. The available structural data in each superfamily often provides insights on how embellishments in domain structure and variation in domain partners can modify protein function.

Joint work with Andrew Harrison, Annabel Todd, Ian Sillitoe.

Teresa Przytycka, Johns Hopkins University

Title: Recursive Domains in Proteins

Numerous studies have analyzed folding patterns in protein domains of known structure in order to gain insight into the underlying protein folding process. Are such patterns a haphazard assortment or are they like sentences in a language, which can be generated by an underlying grammar? Specifically, can a small number of intuitively sensible rules generate a large class of folds, including feasible new folds? The application of elementary rules to generate structure from basic building blocks is an intrinsically hierarchical process. In this talk, we propose four simple folding rules and explore the extent to which they can generate the known all-b folds, using tools from graph theory. As a control, an exhaustive set of b-sandwiches was tested and found to be largely incompatible with such a grammar. The existence of a protein grammar has potential implications for both the mechanism of folding and the evolution of domains.

George D. Rose, Johns Hopkins University

Title: From the Hierarchic Organization of Domains to Hierarchic Folding in 20 years

Two decades ago, we found that protein domains are organized as a structural hierarchy. In this work, a domain is defined as a contiguous, compact and physically separable segment of the polypeptide chain. Protein domains form a hierarchy because each domain is contained within the next larger domain, like a series of nested boxes.

Though controversial at the time, the hierarchic architecture of proteins is nowan accepted fact. Our early approach used analytic methods to identify domains in X-ray structures, but it was later realized that a simple procedure can approximate these results. To divide a protein into separable domains, display the structure with the first N/2 residues in red and the remaining N/2 residues in blue. Then repeat this process, iteratively. In each successive stage of the hierarchy, it is apparent at a glance that the red and blue regions do not intermingle.

The top-down, hierarchic organization of folded proteins is an experimental fact, and no hypothesis is needed to extract this result from known structures. The existence of hierarchic architecture suggests a bottom-up folding mechanism, in which chain segments form local structures of marginal stability, which then interact to produce intermediates of ever-increasing complexity. In this process, multiple folding routes co-exist, and the stabilities of the intermediates and their combinatorial associations will determine the dominant pathways.

This folding hypothesis - called folding by hierarchic condensation - has motivated our research during ensuing years. Recent results will be presented.

Sarah Teichmann, MRC Laboratory of Molecular Biology

Title: Evolution of Multi-Domain Proteins

Two thirds of prokaryote all prokaryote proteins, and eighty percent of eukaryote proteins are multi-domain proteins. The composition and interaction of the domains within a multi-domain protein determine its function. Using structural assignments to the proteins in completely sequenced genomes, we have insight into the domain architectures of a large fraction of all multi-domain proteins. Thus we can investigate the patterns of pairwise domain combinations, as well as the existence of evolutionary units larger than individual protein domains.

Structural assignments provide us with the sequential arrangement of domains along a polypeptide chain. In order to fully understand the structure and function of a multi-domain protein, we also need to know the geometry of the domains relative to each other in three dimensions. By studying multi-domain proteins of known three-dimensional structure, we can gain insight into the conservation of domain geometry, and the prediction of the structures of domain assemblies.

Yuri Wolf, NIH

Title: Birth and death of protein domains and the power law behavior.

Power distributions appear in numerous biological, physical and other contexts, which appear to be fundamentally different. In biology, power laws have been claimed to describe the distributions of the connections of enzymes and metabolites in metabolic networks, the number of interactions partners of a given protein, the number of members in paralogous families, and other quantities. We propose a simple model of evolution of the domain composition of proteomes with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from non-coding or non-globular sequences or acquisition via horizontal gene transfer) - a birth, death and innovation model (BDIM). It is proved that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the degree not equal to -1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. We apply the BDIM formalism to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show an excellent fit between these empirical data and a particular form of the model, the second-order balanced linear BDIM.

Golan Yona, Cornell University

Title: The domain structure of proteins: prediction and organization

Automatic detection of protein domains from sequence is a challenging problem. We describe a novel method for detecting the domain structure of proteins from sequence information alone and methods for organizing the protein space based on the domain structure of proteins.

The prediction of domains is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence, based on the multiple alignment as well as on intron/exon data and predicted contact maps. The measures are optimized using principles of information theory and are combined into a single predictor using a neural network. The output is post-processed using a probabilistic model to predict the most likely transitions between domains of all possible hypotheses, considering the likelihood of the data given the model, and prior knowledge about domain distributions.

The method was assessed using the domain definitions in SCOP and CATH for proteins of known structures and was compared to several other existing methods. Our method achieves both high accuracy (80%) and high coverage (70%), significantly better than the best methods available, even the manual and semi-manual ones, while being fully automatic. Our method can also be used to verify domain partitions based on structural data.

Next: Call for Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on February 13, 2003.