DIMACS Workshop on Machine Learning Approaches for Understanding Gene Regulation

August 15 - 17, 2005
DIMACS Center, CoRE Building, Rutgers University

Christina Leslie, Columbia University, cleslie@cs.columbia.edu
Chris Wiggins, Columbia University, chris.wiggins @ columbia.edu
Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology.

This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).

Harmen Bussemaker, Department of Biological Sciences, Columbia University

Title: Condition-specific regulation of mRNA stability in yeast

Steady-state transcript abundances are determined by the balance of transcription and mRNA turnover. While cis-regulatory elements in DNA that control the transcription rate have been intensively studied both experimentally and computationally, the control of mRNA stability has received far less attention. Using a computational approach that requires only genomic sequence and steady-state mRNA expression data as input, we discovered weight matrices that model the RNA-binding specificity of several distinct mRNA stability factors. These include two members of the Pumilio-homology domain family of RNA-binding proteins, Puf3p and Puf4p. We provide both computational and experimental evidence that the regulation of mRNA stability by these CREs is dynamic and responds to a variety of environmental

Mark Craven, Department of Biostatistics and Medical Informatics University of Wisconsin, Madison

Title: Modeling Overlapping Sequence Elements and other Challenges in Uncovering Regulatory Networks in Bacteria

I will discuss my group's work on developing and applying machine learning methods for the task of elucidating transcription-regulation networks in bacterial genomes. In particular, I will focus on several algorithmic contributions we have made, including methods for (i) modeling and predicting arbitrarily overlapping elements in sequence data, (ii) learning to represent the hidden states and roles of key variables in regulatory networks, and (iii) learning models that represent kinetic rate constants in terms of sequence features.

Mark Gerstein, Yale University

Title: Understanding protein function on a genome-scale using networks

My talk will be concerned with topics in proteomics, in particular predicting protein function on a genomic scale. We approach this through the prediction and analysis of biological networks -- both of protein-protein interactions and transcription-factor-target relationships. I will describe how these networks can be determined through Bayesian integration of many genomic features and how they can be analyzed in terms of various simple topological statistics.

Alex Hartemink, Department of Computer Science, Duke University

Title: Quality, quantity, and diversity of high-throughput data: methodological ramifications and biological results

The availability of high-throughput DNA sequence data spurred the growth of a new field, now known as computational biology. Because DNA sequence data are discrete and have only occasional errors, the field took shape with a decidedly discrete and deterministic outlook. But with the availability of high-throughput mRNA and protein expression data, which are continuous and extremely noisy, more continuous and probabilistic methods have been developed to address these new challenges. In large measure, these methods derive from---and contribute to---the fields of statistical and machine learning.

I have been interested in two different kinds of learning tasks in this domain: network inference (unsupervised) and classification (supervised). This talk will present a collection of results along both of these dimensions. I will omit some of the methodological details, as they are available in our papers over the years, and instead present the methodological highlights and results, as part of an effort to summarize the whole endeavor, as well as reflect on what the next steps will be.

David Kulp, Department of Computer Science, University of Massachusetts

Title: Regulatory Network Dependencies From Quantitative Trait Loci

Jansen and Nap introduced the concept of "genetical genomics" in which putative disease-associated loci are identified through linkage analysis of gene expression data treated as quantitative phenotypes. Since then works by Brem, et al, Schadt, et al, Bystrykh, et al and others have shown that QTL analysis of large scale gene expression data can implicate trans-acting loci as putative upstream modulators of gene expression in regulatory pathways. We expand on this idea by proposing a simple Gaussian linear model of gene expression that explicitly incorporates additive genetic effect. Our model can be seen as an extension of classic QTL interval mapping to incorporate the expression level of putative modulators. Alternatively, our model represents the addition of a genotype state to the regulatory module of a Bayesian Network of gene expression.

Using a large data set of gene expression and genotype data from a set of 30 recombinant inbred mice we generated boot-strapped simulated datasets, introducing a new target gene according to various complex models of control, e.g. exponential, discrete control, and multiple regulators, to determine how well our model could detect true relations among the confounding noise of tens of thousands of gene expression measurements and to compare its performance with conventional QTL interval mapping. We show that our method is highly robust and surprisingly effective compared to standard genetic mapping. Next we applied our model to the gene expression (6000+ ORFs) and genotype (2957 markers) data from 115 crosses between two homozygous yeast strains. Brem, et al, had identified by manual investigation the likely gene regulators in six highly pleiotropic loci. We show how our method automatically identified the correct gene. Moreover, we used our model to identify putative targets of known regulators and found highly significant enrichment for known transcription factor binding sites.

Christina Leslie, Center for Computational Learning Systems, Columbia University

Title: Discovering regulatory element motifs by predictive modeling of gene regulation

Studying the behavior of gene regulatory networks by learning from high-throughput genomic data has become one of the central problems in computational biology. Most work in this area has focused on learning structure from data -- e.g. finding clusters of potentially co-regulated genes, or building a graph of putative regulatory "edges" between genes -- and has been successful at generating qualitative hypotheses about regulatory networks.

Instead of adopting the structure learning viewpoint, our focus is to build predictive models of gene regulation, i.e., models that allow us to make accurate quantitative predictions on new or held-out experiments (test data). In our approach, we learn a prediction function for the regulatory response of genes, using a boosting algorithm to enable feature selection from a high-dimensional search space while avoiding overfitting. In particular, we generate motifs representing putative regulatory elements whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression, and we combine this information into a global predictive model for gene regulation. In experiments for the yeast environmental stress response, our method is able to make accurate predictions about which genes will be up- or down-regulated in held-out (test) microarray experiments, retrieve known regulatory elements, and suggest interpretable biological hypotheses about regulatory mechanisms.

Satoru Miyano, Human Genome Center, Institute of Medical Science University of Tokyo

Title: Reverse Engineering of Gene Networks from Microarray Data with Heterogeneous Genome-Wide Biological Information

We developed a series of computational methods based on Bayesian networks for mining gene networks from microarray gene expression data. We combined the Bayesian network approach with nonparametric regression, where genes are regarded as random variables and the nonparametric regression enables us to capture from linear to nonlinear structures between genes. In order to improve the biological accuracy of estimated gene networks, we made a general framework by extending this method so that it can employ genome-wide other biological information such as sequence information on promoter regions, protein-protein interactions, protein-DNA interactions, subcelluar localization information, and literature.

The problem of finding an optimal Bayesian network is known computationally intractable. Our recent computational challenge made possible to search and enumerate optimal and suboptimal Bayesian networks in feasible time on supercomputers. Computational experiments with this search algorithm have provided evidences of the biological rationality of our computational strategy.

Chris Myers, Cornell Theory Center, Cornell University

Title: Inference of gene regulation in bacterial pathogen

Many bacterial pathogens infect hosts by secreting effector proteins into host cells through the type III secretion system (TTSS). Through a collaboration between Cornell and USDA-ARS research groups, we are combining genome-wide experimental and computational methods to infer gene regulation networks underlying pathogenesis in Pseudomonas syringae, a plant pathogen that infects a broad range of hosts. Previous work has focused on regulation of the TTSS itself, but less attention has been paid to the regulatory connections linking pathogenicity to other bacterial processes (such as stress response, quorum sensing, etc.). We have been able to combine whole-genome microarrays and high-throughput proteomic analyses with machine learning techniques such as Gibbs sampling and motif classification to develop more predictive models of regulatory motifs than had previously been available. This talk will focus on the identification of promoter motifs recognized by various sigma factors, and the application of those computational models to infer regulation in different host-specific strains.

Dana Pe'er, Department of Genetics, Harvard University

Title: Inferring Molecular Pathways in Mammals: A Single Cell Approach

Molecular networks underlie the regulatory function and decision processing of cells -- constituting a cascade of information flow, triggered by signals, culminating in a cellular response. With the advent of high-throughput genomic and proteomic technologies, molecular biology is experiencing an explosion of new experimental results. We present computational tools that analyze high throughput genomic and proteomic data, to automatically map mammalian signaling pathways and obtain a systems level view of the cell's regulatory network. We use intracellular multicolor flow cytometry that allows for quantitative, simultaneous observation of multiple signaling molecules in many thousands of individual cells. A key distinction of our approach is the use of single cell measurements, thus avoiding population averaging, which often masks true activities.

We use Bayesian networks for automated elucidation of causal influences of protein signaling networks using simultaneous multivariate measurements of phosphoproteins on single cells. To automatically derive inter-molecular influence connections, we applied a probabilistic modeling algorithm to determine a graph of causal influences among phospho-signaling molecules in human primary CD4+ T cells. The approach identified a majority of classically reported signaling relationships, and predicted novel influence connections including inter-pathway crosstalk of causal influence of Erk1 on Akt that we confirmed experimentally.

Norbert Perrimon, Department of Genetics, HHMI and Harvard Medical School

Title: Technical challenges associated with RNAi screen in Drosophila cells

High-throughput screens based either on RNA interference methods or small molecules are increasingly being used to identify and understand the molecular pathways responsible for key behaviors of cells. In these screens, one or more genes or pathways is individually perturbed in many samples of cells in parallel, allowing many pathway components relevant to a behavior of interest to be identified. A key need is for readouts that accurately reflect the behavior of each sample, and which must be fully automated to work effectively in high-throughput contexts. Recently, automated microscopy has been coupled with image analysis to generate readouts that provide unprecedented levels of information about structural, compositional, and population aspects of cells. Screens using these methods have characterized molecular networks responsible for cytokinesis, muscle and neurite development, and other behaviors. However, automated readouts that accurately reflect overall cell shape?a frequently critical aspect of cell function?have been unavailable because image analysis algorithms perform poorly at distinguishing and finding the correct boundaries of individual cells in crowded cell images and at extracting useful features to classify phenotypes. I will present our efforts at addressing some of these issues.

Nikolaus Rajewsky, Department of Biology and Mathematics, New York University

Title: Gene regulation by microRNAs

MicroRNAs are a large class of non-coding regulatory genes that are oftentimes differentially expressed in tissues and development, and oftentimes conserved over large evolutionary distances. microRNAs regulate the expression of target genes by binding to partially complimentary sites in the messenger RNAs of the targets. Little is known about the biological function of microRNAs, although recent experiments have demonstrated that they can regulate insulin secretion, developmental timing, apoptosis, signaling pathways, etc. In this talk I describe our work on identifying targets of microRNAs which lead to our prediction that thousands of genes in each of the vertebrate-, fly-, and nematode genomes are regulated by microRNAs. Some of these predictions were experimentally validated. I also present insights into microRNA function by comparing targets between very distant clades. Finally, our understanding of microRNA:target recognition is still very limited and I will outline open problems.

John Reinitz, Applied Mathematics, SUNY Stony Brook

Title: Variation and Transcriptional Control in Drosophila Segment Determination

This talk will be concerned with two fundamental questions. The first is the determination of a moprphogenetic field, and the second is the control of transcription in metazoan genes with large promoters.

One of the central ideas in animal development is that of the determination of cell fates in a morphogenetic field. A second central idea, or perhaps observation, is that morphogenetic fields are capable of regulation, a classical term for the correction of errors. In the past, regulation was investigated by surgical perturbation of embryos. In the modern context regulation can also be studied in the context of genetic perturbations or of individual variations in gene expression in an isogenic population. We consider this problem in the early embryo of the fruit fly Drosophila, a well characterized system for molecular developmental genetics which can also be used as a naturally grown differential display system for reverse engineering networks of genes. This system is being used by ourselves and others to address fundamental questions about the reliability of developmental processes.

In the Drosophila system which we study, determination of the morphogenetic field is implemented by means of differential regulation of transcription. The control of this process by groups of binding sites is as yet poorly understood. We present a new model of transcriptional control and show how it can be used to understand anomalous expression of even-skipped stripe 7 and to predict the results of site directed mutagenesis experiments.

Valerie Reinke, Department of Genetics, Yale University School of Medicine

Title: Genome organization, gene expression, and germline development in C. elegans

Genome-scale sequencing projects of multiple organisms have provided us with the means to gain higher-order views of genome organization. Recent work has clearly demonstrated that genes are not arranged randomly in the genome. Global gene expression studies in yeast, C. elegans, Drosophila, and humans demonstrate that genes within a genomic neighborhood often have similar expression profiles. For instance, sex chromosome(s) often have a paucity or enrichment of genes whose expression is regulated within the germ line or by sexual identity (Reinke et al., 2000; Wang et al., 2001; Parisi et al., 2003). Additionally, local chromosome domains, ranging in size from ten to several hundred kilobases, frequently contain genes that are co-expressed in specific tissues or under specific conditions (Cohen et al., 2000; Caron et al., 2001; Roy et al., 2002; Spellman and Rubin, 2002). An open question from these studies is whether these gene arrangements are passive, such that neighboring genes adopt similar expression states as a consequence of local chromatin conformation, or whether a selective advantage drives re-organization of the genome so that co-expressed genes cluster over an evolutionary time scale. We have found through gene expression profiling that genes expressed in a specific tissue in C. elegans, the germline, have a very striking genome organization. Genes expressed in the germline are lacking from the X chromosomes, cluster in local domains, and frequently appear in operons. Indeed, most operons contain primarily or exclusively genes expressed in the germline. These data suggest that germline-expressed genes have unique constraints on their organization in the genome relative to somatic genes.

Rob Schapire, Department of Computer Science, Princeton University

Title: Introduction to machine learning

Machine learning studies the design of computer algorithms that automatically make predictions about the unknown based on past observations. The focus of machine learning research is on the design of fully automatic methods that can be applied "off the shelf" to virtually any learning problem. This talk will introduce machine learning, particularly with regard to classification problems, briefly describing some of the main state-of-the-art machine learning techniques, and also discussing some of the key issues in the design of machine-learning systems, including avoidance of overfitting.

Anirvan Sengupta, Department of Physics, Rutgers University

Title: SVMs and probabilistic approaches for classifying promoters

We discuss how likelihood based approaches to regulatory site detection, with minimal input from biophysics of protein-DNA interaction, lead naturally to low degree polynomial kernels for appropriate imbedding of sequences in R^n. We study the performance of these one-class SVMs on real and quasi-synthetic data. The method allows us to score sites as well as set a threshold to separate out functional sites from non-functional ones.

Combining evidence from heterogeneous sources (motifs, gene expression, phylogenetic comparison, etc.) to verify of regulatory interactions is getting to be the way to compensate for limitations of inference based on individual data types. In many cases, each kind of data allows us to rank genes according to the likelihood of the gene being the target of regulation by a particular mechanism. Often, one has to choose cutoffs, separately, for each of these ranking and then use some meta-classifier to combine the results to decide whether or not a gene in question is appropriately regulated or not. We discuss a simple non-parametric method for combining ranked data for discovering correlation between high ranks. The threshold is drawn on the combined data in a principled manner. We show how well this method works for a particular yeast dataset, where we have experimentally tested the predictions from this method.

Gustavo A. Stolovitzky, Functional Genomics & Systems Biology IBM T.J. Watson Research Center

Title: Function-centric Mining of Gene Expression Data: Profiling Dis-tinctions between Similar Cancer Subtypes

Different classes of cancer may be associated with differences in the behavior of a number of cellular processes. When these differences involve modifications of the expression of genes transcriptional analysis can be used to uncover differentially regulated processes in these cancers. The typical way this has been done is by finding the genes that are differentially expressed in two classes of cancer and subsequently identifying those cellular processes enriched in the differentially expressing genes. We call this a Gene-Centric approach, to emphasize that the starting point is a list of differentially expressed genes. Processes that depend on gene interactions may be difficult to detect by a Gene-Centric approach alone. We propose a Function-Centric approach, in which the starting point is a catalogue of biological processes with their associated genes. We systematically explore all the processes in a given catalogue and look for those processes whose genes contain enough information (not necessarily in the form of over- or under-expression) to discriminate between the given cancer states. This approach is particularly useful if we deal with subtypes of related cancers in which only a modest number of genes exhibit substantial differential expression. We apply these ideas to the study of two sub-phenotypes in Chronic Lymphocytic Leukemia, one with a benign, the other with a more malignant disease course. The overall assessment indicates that cell signaling through receptor tyrosine kinases, vesicular transport and nucleotide and carbohydrate metabolism behave differently in these two CLL subtypes.

Olga Troyanskaya, Lewis-Sigler Institute for Integrative Genomics, Princeton University

Title: Building network-level pathway models from diverse functional genomic data

I will describe a general probabilistic system for discovery of pathway-specific networks through integration of diverse genome-wide functional data, including high-throughput gene expression microarrays, physical and genetic interactions, protein co-localization, transcription factor co-regulation, and data from individual experiments curated from literature. This framework was validated by accurately modeling known networks for 31 biological processes in Saccharomyces cerevisiae. In addition to modeling known biology, the method identifies novel network components that include 1006 uncharacterized proteins. This method produces accurate and readily testable hypotheses, as demonstrated by our experimental verification of predictions for the process of chromosomal segregation. The system can also be used to study novel functional links among diverse biological processes.

Koji Tsuda, AIST (Japan)

Title: Selective Integration of Multiple Biological Data for Supervised Inference of Protein and Gene Networks

Inferring networks of genes and proteins from biological data is a central issue of computational biology. Most network inference methods, including Bayesian networks, take unsupervised approaches in which the network is totally unknown in the beginning, and all the edges have to be predicted. Yamanishi et al. (2004) recently proposed a more realistic supervised framework that assumes that a substantial part of the network is known. We propose a new kernel-based method for supervised graph inference based on multiple types of biological datasets such as gene expression, phylogenetic profiles, and amino acid sequences. Notably, our method assigns a weight to each type of dataset and thereby selects informative ones. Data selection is useful to reduce data collection costs. For example, when a similar network inference problem must be solved for other organisms, the dataset excluded by our algorithm need not to be collected. Supervised network inference is formulated as a kernel matrix completion problem, where the inference of edges boils down to estimation of missing entries of a kernel matrix. Then, an EM algorithm is proposed to simultaneously infer the missing entries of the kernel matrix and the weights of multiple datasets. By introducing the weights, we can integrate multiple datasets selectively and thereby exclude irrelevant and noisy datasets. Our approach is favorably tested in two biological networks: a metabolic network and a protein interaction network.

Chris Wiggins, Department of Applied Mathematics, Columbia University

Title: Predicting evolution from topology: a machine learning approach

There has been a proliferation of models of `network mechanisms' (e.g., duplication-mutation, preferential attachment, small-world models) proposed recently to describe various biological networks. I'll discuss a recently-developed machine learning approach for inferring the mechanism most accurately capturing a given network topology. We classify several biological networks, including transcriptional regulatory networks, with most attention paid to the protein-protein interaction network of Drosophila melanogaster. I hope also to discuss how this approach might be useful for studying to what extent the topologies of inferred networks might be consequences either of the inference approach or of systematic errors particular to the experimental technique on which they are based.

Eric Xing, Department of Computer Science, Carnegie Mellon University

Title: In silico detection of cis-regulatory elements under complex genomic and evolutionary context: a probabilistic graphical model approach

A hallmark of the transcriptional regulatory sequences of higher eukaryotic genome is the presence of highly sophisticated deterministic and stochastic constraints on the spatial distribution of cis-regulatory elements and diverse structural characteristics of the DNA-binding domains of the trans-regulatory elements, and the enormous complexity of the regulatory sequences in which motifs must be found. Most contemporary motif detection algorithms adopt simple assumptions on motif structure and organization, and are therefore incapable of identifying non-trivial regulatory structures such as enhancers out of a complex background from higher eukaryotic genome. In this talk, I will discuss a methodology based on the probabilistic graphical models for modeling the transcriptional regulatory sequences in complex genome. This approach uses a Bayesian formalism to capture the dependency structure of regulatory elements at two levels---the conservation dependencies between sites within motifs and the clustering of motifs into regulatory modules. It supports major queries related to in silico cis-regulatory , such as learning motif representations, model-based motif prediction, and de novo motif detection. I will also discuss some recent ideas on probabilistic models for motif and enhancer evolution, and outline a novel multi-resolution phylogenomic model for comparative cis-element detection.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on August 1, 2005.