DIMACS Workshop on Machine Learning Techniques in Bioinformatics

July 11 - 12, 2006
DIMACS Center, CoRE Building, Rutgers University

Dechang Chen, Uniformed Services University of the Health Services, dchen@usuhs.mil
Xue-Wen Chen, University of Kansas, xwchen@ku.edu
Sorin Draghici, Wayne State University, sod@cs.wayne.edu
Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology.

This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).


Philip E. Bourne, University of California San Diego

Title: Predicting Protein Structure Flexibility from Sequence

The belief that the 3 dimensional structure of a protein is encoded in the 1 dimensional sequence has received significant attention from the structure prediction community and significant progress has been made and documented by the CASP experiment [1]. However, "structure" does not imply a single state, but a spectrum of states ranging from total disorder to a very rigid arrangement for of all atoms. Biological function is often dependant of a specific state, which we refer to as functional flexibility. The question then becomes, if the structure is encoded in the sequence and can be detected to some degree, can functional flexibility be predicted from sequence? Disorder predictors [2] provide an indication that the answer is yes, at least for the extreme case, but what about intermediate states? Using the Gaussian Network Model to model short and long range movements and a novel normalization procedure we have defined functional flexibility for a range of proteins for which 3 dimensional structure exist and are known to have flexible regions. Using these as input to a support vector machine we have predicted functional flexibility from 1 dimensional sequence. Results are surprisingly good and currently being tested experimentally [3]. Even partial success in predicting flexibility has the potential to impact protein engineering and design and to address questions concerning the evolution of designable protein folds.

 [1] Moult et al. 2005 Proteins 61 Supp 7, 3 7.
 [2] Oldfield et al. 2005 Biochemistry 44(6) 1989 2000.
 [3] Gu et al. 2006 PLoS Computational Biology, submitted.

Liran Carmel, NIH/NLM/NCBI

Title: An expectation maximization algorithm for inferring the Evolution of eukaryotic gene structure.

Recently, several evolutionary models, each incomplete in a different way, yielded very different predictions about evolution of introns. We formulated a more general and realistic model of intron gain and loss, taking into account gene specific rates, branch specific rates, invariant sites and rate variability of both gain and loss which is gamma distributed across sites. We developed an expectation maximization algorithm to estimate the parameters of the model. Also, we extended the intron exon dataset to include 18 eukaryotic species with complete genomes. Using this combination of a realistic model and large dataset, we refute both extreme views of intron early and intron late. Instead, we observe a diverse kaleidoscope of events, with different parts of the phylogenetic tree exhibiting dramatically different patterns of intron gain and loss.

James Howse, Los Alamos National Laboratory

Title: A machine learning approach for predicting the EC numbers of proteins

In this presentation we present a principled machine learning methodology for predicting the EC numbers of proteins. The first step of our approach is to construct a reference predictor whose results model those achieved by a human expert. If one agrees with the assertion that the reference predictor is a reasonable model for a human expert, then clearly any other predictor must outperform the reference by a statistically significant margin in order to be considered successful. In addition to providing a benchmark for comparison, constructing a reference predictor also defines the feature space that should be used when designing other predictors. The second step of our approach is to estimate the smallest possible prediction error (Bayes error) that can be achieved using the chosen feature space. Note that for most feature spaces the Bayes error is NOT zero. Our technique for estimating Bayes error does NOT produce a predictor that can be used on data. The Bayes error estimate provides a benchmark for comparing different features spaces. It also provides a benchmark for comparing the predictor performance for a given feature space to the best possible predictor for that feature space. The third step of our approach is to design a predictor whose generalization performance (performance on future data) and computational performance (training and testing time and memory) are well characterized.

As an example, we design a system which predicts the first EC number of a protein. We frame this problem as a multi class classification problem. Our predictors are a form of classifier that we call simple classifiers. These classifiers have many of the desirable properties of support vector machines. We examine the effect on prediction performance of using just sequence information and using both sequence and structure information when designing the predictors. We also compare the performance of our predictors to both the reference predictor and the Bayes error.

Dustin T. Holloway, Mark Kon, and Charles DeLisi, Boston University

Title: Machine Learning and data combination for regulatory pathway prediction


High throughput technologies, including array based chromatin immunoprecipitation, have rapidly increased in our knowledge of transcriptional maps?the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types.


A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104 Saccharomyces cerevisiae regulators indicates that our SVM based target classification is more sensitive (73% vs 20%) and has double the precision (positive predictive values of 90% and 45% respectively).

We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth.

We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying k mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1 and the factor responsible for expression of ribosomal protein genes is active. After identifying these conditions, we can see that targets of Fhl1 are differentially expressed in them, as compared to expressions of average and negative set genes.


SVM based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.

Sungchul Ji, Rutgers University

Title: How to Avoid Misinterpreting Microarray Data.

DNA arrays have been successfully used to measure three distinct quantities from intact cells: i) mRNA levels (to be denoted as TL, transcript levels), ii) rates of transcription (TR, transcription rates) in conjunction with the nuclear run-on methods [1], and iii) DNA levels. Interpreting TR or DNA data so measured is non-problematic and relatively straightforward. There are two ways of interpreting mRNA levels ? to be called the first-order and the second-order for convenience. The first-order interpretation of mRNA levels is non-problematic, just as is the interpretation of TR or DNA data. But the second-order interpretation of mRNA levels in terms of quantities other than mRNA levels themselves, such as rates of gene expression (namely, TR), is not straightforward (as has widely been assumed) and requires careful theoretical and computational analysis. For example, when mRNA levels (TL) are found to decrease as a result of some cellular perturbation, it has been a common practice in the field to interpret it in terms of down-regulation of the rate of expression of the genes (TR) encoding the mRNAs. That is, it has been commonly assumed that TL and TR are linearly correlated so that TL and TR always change in parallel. However, direct measurements of both TL and TR [2,3] indicate that this assumption is not always valid. An analysis of the TL and TR data on 5,184 genes of budding yeast measured at 6 time points after shifting glucose to galactose [2] indicates that i) both TL and TR increase or decreases together only 52% of the time, ii) TL decrease while TR increases 30% of the time, and iii) TL increases while TR decrease 10% of the time. These observations can be readily accounted for if one assumes that TL is regulated not only by TR but also by TD (i.e., transcript degradation rates). A mathematical equation will be presented that relates TL to TR and TD. This equation has been used to calculate the TD values from TL and TR data on glycolytic and oxidative phosphorylation genes for the first time. The results demonstrate that transcript degradation plays a key regulatory role in RNA metabolism in budding yeast during glucose-galactose shift, thus establishing the concept of "degradational control" in analogy to the well-known "transcriptional control".

   [1] Hirayoshi, K. & Lis, J. T. (1999). Nuclear Run-On Assays: Assessing Transcription by Measuring Density of 
       Engaged RNA Polymerases.  Methods in Enzymology 304: 351-362.
   [2] Garcia-Martinez, J., Aranda, A. & Perez-Ortin, J. E. (2004). Genomic Run-On Evaluates Transcription Rates 
       for all Yeast Genes and  Identifies Gene Regulatory Mechanisms.  Mol.  Cell 15:303-313.
   [3] Fan, J., Yang, X., Wang, W., Wood, W. H., Becker, K. G. & Gorospe, M. (2002).  Global analysis of 
       stress-regulated mRNA turnover by using cDNA arrays.  Proc. Nat. Acad. Sci. 99(16):10611-10616.

Christina Leslie, Columbia University

Title: Learning the cis regulatory code by predictive modeling of gene regulation

A cell's gene regulatory network refers to the coordinated switching on and off of genes by regulatory proteins that bind to non coding DNA. Studying the behavior of gene regulatory networks using high throughput genomic data, like gene expression data from microarrays and DNA sequence data, has become one of the central problems in computational biology. Most work in this area has focused on learning structure from data for example, finding clusters of potentially co regulated genes, or building a graph of putative regulatory "edges" between genes and has been used to generate qualitative hypotheses about regulatory networks.

Instead of adopting the structure learning viewpoint, our focus is to build predictive models of gene regulation, i.e., models that allow us to make accurate quantitative predictions on new or held out experiments (test data). In our approach, we learn a prediction function for the regulatory response of genes, using a boosting algorithm to enable feature selection from a high dimensional search space while avoiding overfitting. In particular, we learn motifs representing putative regulatory elements whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. We combine this information into a global predictive model for gene regulation. We apply our algorithm to two systems in yeast, the environmental stress response in yeast and oxygen sensing and regulation. Our results show that we can make accurate predictions about which genes will be up or down regulated in held out (test) microarray data, discover true regulatory elements, and suggest interpretable biological hypotheses about regulatory mechanisms.

Zhenqiu Liu, University of Maryland

Title: Genome wide Tagging SNPs with Entropy Based Methods

human genome is estimated to be 3 6 millions. It is widely hoped that the study of SNPs will provide a means of elucidating the genetic component of complex disease and variable drug responses. Throughput technologies such as oligonucleotide array have produced enormous SNP data and put great challenge in genome wide disease association studies. In this talk, we present a new entropy based monte carlo method for optimally selecting minimum informative subsets of SNPs also known as "tagging SNPs", that is efficient for genome wide selection. The proposed method is based on a multilocus LD measure and global optimization technique. It can deal with large number of SNPs without block assumption. We compare our method with other techniques in the literature. Our method is much faster and perform well with the datasets to be tested.

Raja Loganantharaj, University of Louisiana at Lafayette

Title: Comparing the Performance of Several Popular Machine Learning Algorithms on Classifying TATA-box from putative TATA boxes

To understand a regulatory mechanism, it is important to detect all the binding sites in a promoter region. A TATA box, a transcription binding site, occurs in many promoters and its profile can be obtained from TRANSFAC data-base or from a set of annotated sequences. From a profile a position specific weighted matrix (PSWM) is constructed for the binding site. A TATA-box is said to be detected in a promoter region of a DNA sequence, when the total weight of a subsequence exceeded above a certain threshold when applied to the PSWM. Unfortunately, many substrings of a DNA sequence fit to the profile of a TATA box and such substrings are called putative TATA boxes. It has been estimated that a large number of putative TATA boxes occur in a promoter sequence of many genome.

Identification of a TATA box among putative TATA boxes will improve the accuracy of determining a transcription start site (TSS) and hence the detection of a promoter. In this talk, I will address my investigation of the effectiveness of several popular machine learning algorithms namely na´ve Bayes, artificial neural networks, decision tree and support vector machine in discriminating TATA box from putative TATA boxes. To compare the effectiveness, we use the following metric namely prediction accuracy, true and false positive, and Kappa value.

Our previous work on discriminating TATA box from putative TATA boxes in plant genome has revealed that the neighborhood around a TATA box carries the information required to distinguish TATA box from putative TATA boxes. Our empirical results on plan genome reveal that Na´ve Bayes has outperformed other sophisticated machine learning algorithms in predicting TATA box from putative TATA boxes. It may be due to very short motifs surrounding a real TATA box and further investigations are being conducted for verification.

The prediction accuracy and the true positive of na´ve Bayes increases with the length of flanking strings. The comparison of Kappa values, a measure of performance over random prediction algorithm, of different algorithms indicates that Na´ve Bayes performs better than that of all other algorithms. Similar study is underway for comparing the effectiveness of algorithms in classifying TATA boxes from putative TATA boxes in mouse genome. The results of this investigation will also be presented.

Feng Luo, Clemson University

Title: Modular Organization of Protein Interaction Network

Accumulating evidences suggest that biological systems are composed of interacting functional modules. Identification of these modules is essential for the understanding of the structure, function and evolution of biological systems. These biological modules are reflected in different topological sub-networks in biological networks. We extended the degree concept from single vertices to sub-graphs, and propose a formal definition of a module in a network. Two algorithms are developed to identify modules within biological networks.. By applying our algorithms to the yeast core protein interaction network from the Database of Interaction proteins (DIP), statistical significant protein modules are identified. Furthermore, both algorithms allow the construction of an interconnecting web of modules to get insights into the high level relationship among modules.

Chandan Reddy, Yao-Chung Weng and Hsiao-Dong Chiang, Cornell University

Title: Motif Refinement by Improving Information Content Scores using Neighborhood Search

The goal of motif finding is to detect novel, over represented unknown signals in a set of sequences (for eg. transcription factor binding sites in a genome). Most widely used algorithms for finding motifs obtain a generative probabilistic representation of the overrepresented signals and try to discover profiles that maximize information content score. Although these profiles form a very powerful representation of the signals, the major difficulty arises from the fact that the best motif corresponds to the global maximum of a non convex continuous function. Algorithms like Expectation Maximization (EM) and Gibbs sampling tend to be very sensitive to the initial guesses and are known to converge to the nearest local maximum very quickly. In order to improve the quality of the results, EM is used with multiple random starts or any other global methods that might yield promising initial guesses (like projection algorithms). Global methods do not necessarily give initial guesses in the convergence region of the best local maximum but rather suggest that a promising solution is in the neighborhood region. In this work, we describe a novel optimization framework that searches the neighborhood regions of the initial alignments in a systematic manner to explore the neighborhood profiles. Our results show that the popularly used EM algorithm often converges to sub optimal solutions which can be significantly improved by the proposed neighborhood search. Based on experiments using both synthetic and real datasets, our method demonstrates significant improvements in the information content scores of the probabilistic models. The proposed method also gives the flexibility in using different local solvers and global methods that work well for some specific datasets.

Aik Choon Tan, Daniel Q. Naiman, Lei Xu, Raimond L. Winslow and Donald Geman, Johns Hopkins University

Title: Simple decision rules for classifying human cancers from gene expression profiles

Various studies have shown that cancer tissue samples can be successfully detected and classified by their gene expression patterns using machine learning approaches. One of the challenges in applying these techniques for classifying gene expression data is to extract accurate, readily interpretable rules providing biological insight as to how classification is performed. Current methods generate classifiers that are accurate but difficult to interpret. This is the trade-off between credibility and comprehensibility of the classifiers. Here, we introduce a new classifier in order to address these problems. It is referred to as k-TSP (k?Top Scoring Pairs) and is based on the concept of 'relative expression reversals'. This method generates simple and accurate decision rules that only involve a small number of gene-to-gene expression comparisons, thereby facilitating follow-up studies. We have compared our approach to other machine learning techniques for class prediction in 19 binary and multi-class gene expression datasets involving human cancers. The k-TSP classifier performs as efficiently as Prediction Analysis of Microarray and support vector machine, and outperforms other learning methods (decision trees, k-nearest neighbour and na´ve Bayes). Our approach is easy to interpret as the classifier involves only a small number of informative genes. For these reasons, we consider the k-TSP method to be a useful tool for cancer classification from microarray gene expression data. Furthermore, we show that our method can be applied to cross-platform analysis. Our results show that the classifier built from the marker gene pair, which simply compares relative expression values, achieves high accuracy, sensitivity and specificity on independent datasets generated using various array platforms. Our findings suggest a new model for the discovery of marker genes from accumulated microarray data and demonstrate how the great wealth of microarray data can be exploited to increase the power of statistical analysis.

Anne Zhang, The University of Kansas

Title:Cancer Tissue Classification with Data-dependent Kernels

Microarray techniques enable the measurement of genes' expression profiles at genome scale and provide an unprecedented opportunity to characterize cells at molecular level. Classification of tissue samples according to their gene expression profiles can be a valuable diagnostic tool for diseases such as cancer. Currently, gene expression data sets are typically characterized by high dimensionality (a large number of genes) and small sample size, which makes the classification task quite challenging. We introduce a data-dependent kernel for cancer classification with microarray data. This kernel function is engineered so that the class separability of the training data is maximized. To overcome the problem of overfitting, a data resampling scheme is introduced for kernel optimization. The effectiveness of this adaptive kernel for cancer classification is illustrated with a k-Nearest Neighbor (KNN) classifier. Based on our experimental results, the data-dependent kernel leads to a significant increase in the accuracy of the KNN classifier. The kernel-based KNN scheme is also shown to be competitive to, if not better than, more sophisticated classifiers such as support vector machines (SVMs) and the uncorrelated linear discriminant analysis (ULDA) in classifying the gene expression data.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on May 12, 2006.