DIMACS Workshop on Sequence, Structure and Systems Approaches to Predict Protein Function

May 3 - 5, 2006
DIMACS Center, CoRE Building, Rutgers University

Anna Panchenko, NIH, panch@mail.nih.gov
Teresa Przytycka, NIH, przytyck@mail.nih.gov
Mona Singh, Princeton University, mona@CS.Princeton.EDU
Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology.

This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).


Saikat Chakrabarti, Christopher J. Lanczycki, and Stephen H. Bryant

Title: Analysis and Prediction of Functionally Important Sites

The rapidly increasing pace of data accumulation and volume of sequence and structural information available for proteins has posed the daunting task of determining their functional importance experimentally. In recent years, the ever-expanding protein databases have yielded an opportunity to analyze functional diversities amongst many protein families. Computational methods can prove to be very useful in understanding and characterization of the biochemical and evolutionary information contained in this wealth of data, particularly within the functionally important sites. In this study, we perform a detailed survey of compositional and evolutionary constraints at molecular and biological function level for a large set of known functionally important sites extracted from a wide range of protein families. We compare the degree of conservation across different functional categories and provide detailed statistical insight to help decipher varying evolutionary constraints on functional important sites. The compositional and evolutionary information at functional important sites has been compiled into a library of functional templates, which is used to predict functionally important sites in other families. Our benchmarking studies show good sensitivity/specificity profiles for the prediction of functional sites. This prediction method is solely based on information derived from homologous sequences and no structural information is required. Therefore this method could be extremely useful for large scale functional annotation.

Praveen F. Cherukuri, Boston University and NCBI; Aron Marchler-Bauer, Lewis Y. Geer, and Stephen H. Bryant, National Center for Biotechnology Information

Title: Domain clusters hint at size of ancient conserved protein domain universe

Protein domains, the building blocks of all globular proteins, are units of compact three-dimensional structure as well as of molecular evolution. The growth of sequence databases has elevated the need for computational annotation of proteins, and the detection of conserved domains is one of the first steps towards assigning molecular function. Not surprisingly a variety of mostly domain-based protein annotation resources have emerged, including Pfam, SMART and COGs. NCBI's Conserved Domain Database (CDD) imports and curates these domain models. The incorporation of several different source databases has created redundancy, however, ranging from simple duplication to complex hierarchical parent-child relationships, which may be caused by differing levels of representation in its source databases. In addition, a fairly large subset of domains in CDD describes lineage-specific protein families with very narrow taxonomic coverage. Here we use a taxonomic filter approach to detect and suppress such lineage-specific domain models. We cluster the remaining domains into super-families of related models in order to simplify domain architecture analysis and display. The resulting set of clusters provides a rough estimate of the number of "ancient" conserved domain families in cellular organisms, and we attempt to utilize 3D structure similarity - where available - to make that estimate more accurate.

Judith Cohn, Karin Verspoor, Susan Mniszewski, Cliff Joslyn, Los Alamos National Lab

Title: A Categorization Approach to Automated Ontological Function Annotation

Automated Function Prediction (AFP) methods increasingly use knowledge discovery algorithms to map sequence, structure, literature, and other information about proteins whose functions are unknown into functional ontologies, typically (a portion of) the Gene Ontology (GO). While there are a growing number of methods within this paradigm, the general problem of assessing the accuracy of such prediction algorithms has not been seriously addressed. We present an application for function prediction from protein sequences using the POSet Ontology Categorizer (POSOC) to produce new annotations by analyzing collections of GO nodes derived from annotations of protein BLAST neighborhoods. We then present hierarchical precision and hierarchical recall as new evaluation metrics for assessing the accuracy of any predictions in hierarchical ontologies, and discuss results on a test set of protein sequences. We show that our method provides substantially improved hierarchical precision (measure of predictions made which are correct) when applied to the nearest BLAST neighbors of target proteins, as compared with simply imputing that neighborhood?s annotations to the target. Moreover, when our method is applied to a broader BLAST neighborhood, hierarchical precision is enhanced even further. In all cases, such increased hierarchical precision performance is purchased at a modest expense of hierarchical recall (measure of all annotations which get predicted at all).

Iddo Friedberg and Adam Godzik, Burnham Institute for Medical Research

Title: Structural Genomics: Delving into Protein Function Space

Structural genomics is a broad initiative of various projects and centers aiming to provide a complete coverage of protein structure space. As it is not feasible to experimentally determine the structures of all proteins even in a single genome, it is generally agreed that the only viable strategy to achieve such coverage is to carefully select specific proteins ("targets"), determine their structure experimentally, and then use comparative modeling techniques to model the rest. However, the details of the selection strategy are a matter of debate. Here we argue that any strategy should take into account that determining structure is not an end by itself, but rather a means for understanding the atomic-level implementation of the biochemical function of a protein. What if the protein being modeled has a different function from its template? In that case, there is no knowledge gained regarding the functional mechanism of the modeled protein. We therefore propose that structural genomics refine the structure-driven approach in target selection by adopting some function-based criteria. We propose to target functionally divergent subfamilies within a given fold group because each requires a structural characterization of its functionality. We have developed a classification system for that purpose and used it to propose a list of additional targets within three functionally rich folds: the TIM barrel, immunoglobulin, and flavodoxin-like folds. We show that a function-driven target selection approach in structural genomics is feasible, and that each of the three folds surveyed has only 50?75% of its functional contents structurally characterized. We call upon structural genomics centers to consider this approach and upon computational biologists to further develop function-based targeting methods for structural genomics efforts.

Maricel G. Kann, Sergey L. Sheetlin, Yonil Park, Stephen H. Bryant and John L. Spouge*, National Center for Biotechnology Information

Title: Accurate Statistics for Global Alignment of Protein Domains

The precise computational annotation of protein function is one of the fundamental problems after completion of the sequencing of several genomes. Most commonly used approaches are based on the local alignment of new sequences to existing database proteins or profiles. Because proteins are composed of structural and functional units called domains, a gene can be annotated from domain databases by aligning domains to the gene's protein sequence. Ideally, protein subsequences should be aligned to complete domains, in a "semi-global alignment". Local alignment, which aligns subsequences to only pieces of domains, dominates annotation applications, however, mainly because it has an accurate P-value for evaluating its alignments. Here, we introduce an accurate P-value approximation relevant to semi-global alignment and many other biological applications. In searching domain databases, the P-value retrieved relevant domains better than current local alignment methods.

Grigory Kolesov and Leonid Mirny, MIT

Title: Prediction of specificity determining residues: discriminating functional and phylogenetic variability

Specific recognition is essential for most cellular functions and is frequently affected in diseases. In recent years a number of bioinformatics methods to predict specificity determining residues (SDRs) have been developed. These methods use multiple sequence alignment and looks for positions exhibiting specific patterns of conservation and variability. To detect SDRs, one needs to delineate evolutionary signature of SDRs from those arising due to phylogenetic structure of a protein family, protein structure and active site. Here we present methods of finding specificity determining residues and coordinately evolving residues in protein domains using a Monte Carlo method to estimate phylogenetic interference at each given position of the protein and, hence, reveal specific evolutionary signature of SDRs. We apply these methods to study interactions between different proteins in bacterial two-component systems.

Raja Loganantharaj and John Clifford

Title: Towards Finding an Interesting Subset of Genes from an Expression Data Sets

The objective of this investigation is to narrow down a large number of genes from a DNA microarray experiment to a small subset that are helpful in generating new hypothesis or to answer some interesting questions regarding causality. Suppose a DNA microarray test is carried out to study causal relationship among k different factors on N genes. There will be a total of 2k possible combinations of the factors and hence a test has to be carried out for all such combinations, which makes it difficult for a large value of k. Suppose an experiment was conducted to study the cancer suppressive mechanism of a class of chemicals called retinoids, in the well-established 2-stage mouse skin chemical carcinogenesis model. In this model skin tumors can be readily induced in by the sequential application of a carcinogen, referred to as the initiation stage, followed by repetitive treatment with a noncarcinogenic tumor promoter, referred to as the promotion stage. The initiation stage, accomplished by a single application of the carcinogen dimethylbenzanthracene (DMBA) to the skin, results in a small subset of keratinocytes (skin cells) carrying a mutation in a critical gene(s). The promotion stage requires repeated (twice weekly) application of tumor promoting agents such as 12-Otetradecanoylphorbol- 13-acetate (TPA) that causes the initiated cells to proliferate, eventually producing tumors. All-trans retinoic acid (ATRA), one of the primary biologically active retinoids, has been shown to be a highly efficient suppressor of tumor initiation and promotion in this model. In this experiment, data obtained from a single time point in the 2-stage protocol, k=2 and the total number of conditions are 4 (22) over which the gene expression values are observed; namely the untreated control skin, skin treated with TPA alone, ATRA alone, and skin co-treated with both ATRA and TPA. The genes that are either up or down regulated by either TPA or ATRA are filtered out first, and we call this set S0. These genes are then clustered based on their expression patterns over all treatment conditions. Further, the annotated genes among S0 are clustered based on their functions. Combining the results of clustering based on expression patterns and functions provides a subset of genes that has a very high probability of being co-expressed. The question is then posed: Can we discover the genes responsible for the suppression of skin cancer caused by ATRA from these results? Unfortunately this is a difficult problem since the gene expression data provided by DNA microarrays at a single time point in the 2-stage protocol provides only a snapshot of stable expression pattern without revealing about all the intermediate processes. By combining signaling and regulatory pathways we can narrow down the genes S0 to a very small subset of interesting genes that are likely to be the cause of the effect. This is an ongoing exploratory investigation and we will discuss the challenges of this investigation and our findings at the workshop.

Robert D. Sedgewick, Saeed Tavazoie, and Dannie Durand

Title: Evolution of transcription factors following whole genome duplication

Approximately 100 million years ago, a whole genome duplication (WGD) occurred in an ancestor of Saccharomyces Cervisiae. The majority of duplicated genes were lost, but 554 pairs of genes were retained in duplicate. Transcription factors were preferentially retained. We study the fate of these duplicated transcription factor pairs, using high-throughput functional genomics data. We model the system as a graph, with the genes as its nodes and the high-throughput data as its links. Evolutionary insights can be gained from the statistics of various graph motifs associated with the duplicated transcription factors. These motif statistics are evaluated against null hypotheses based on paralogs and on random graphs.

Todd Taylor and Iosif Vaisman, George Mason University

Title: Discrimination and Classification of Thermophilic and Mesophilic Proteins

There has been considerable interest in the physical basis for the increased thermostability of thermophilic proteins with respect to their mesophilic counterparts. We have systematically studied several large sets of protein structures in order to determine which sequence and structural properties have the most power to discriminate thermophilic and hyperthermophilic proteins from their mesophilic orthologs and to classify proteins as mesophilc, thermophilic, or hyperthermophilic. Some of the quantities we test have been previously reported to be good thermophile/mesophile discriminators (e.g. surface area to volume ratio and the percent composition of charged minus polar residues), others are Delaunay tessellation derived (e.g. mean simplex tetrahedrality, median circumsphere radius, Delaunay four-body threading potential score, and contact graph diameter). We conclude that it is possible to accurately discriminate mesophilic from (hyper)thermophilic orthologs and to a lesser extent classify proteins M/T/H. Purely geometric indices are generally only fair discriminators, inferior to purely sequence based ones. The best discriminatory indices, like the Delaunay tessellation derived 4-body sequence-structure compatibility score, contain both sequence and structure information.

Jinfeng Zhang & Jun S. Liu, , Harvard University

Title: Side-chain Conformational Entropy of Proteins - New Discoveries of an Old Story

Side-chains of amino acid residues encode the information that governs a protein's three-dimensional fold. However, the roles that side-chain entropy plays in protein folding and interaction are still poorly understood. In this study, we developed an accurate and efficient Monte Carlo method for estimating absolute side-chain conformational entropy of a given protein backbone structure. We found that there is still significant side-chain conformational entropy left for both buried and exposed residues when proteins fold to compact native states. We also found that this quantity can be discriminative between native structures and computer generated models. In consistent with this finding, we observed that, for the same protein and at the similar level of compactness, the side-chain conformational entropies of the X-ray structures are generally higher than those of artificially generated decoy structures for monomeric proteins and protein complexes and also higher than those of the NMR structures. We show that incorporating side-chain entropy in the free energy function can improve the discrimination of native structures from artificial decoys. In addition to the traditional view that side-chain entropy simply opposes protein folding, our findings suggest that side-chain entropy may play another role in favouring native proteins and protein complexes among alternative compact structures.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on May 2, 2006