Jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology, the Columbia University Center for the Multiscale Analysis of Genetic Networks (MAGNet), and the NIH Roadmap Initiative
This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).
Title: Experimental Gold Standards for Reverse Engineering Network Connections
Many groups are developing methods to predict the networks of biomolecular interactions (small molecule - protein, protein - protein, protein - DNA) that are responsible for biological processes. How can we evaluate algorithm performance? The existence of curated annotation schemes, most famously the Gene Ontology (GO) categories, provides an immediate route to 'gold standard' assignments of genes and proteins to specific functions or pathways. These annotations do not always correspond, however, to properties that can be measured experimentally. This talk proposes network connections as pathway properties that are subject to direct experimental verification. Large-scale studies are generating increasing numbers of high-confidence protein-protein and protein-DNA interactions. These data sets, pre-release, could provide an ideal mechanism for testing predictive methods.
Title: Evaluating Algorithms for Learning Biological Networks
In our group we have often encountered the need to evaluate the efficacy of our reverse engineering algorithms. Our evaluation attempts can be divided into two categories: 1) evaluations using simulation studies of synthetically generated data and 2) evaluations using experimentally collected data. Here, we discuss our experiences with both these categories through a set of case studies. The case studies will describe some of the problems we have encountered and lessons we have learned. All these studies involve the reverse engineering of biological networks from data. Most examples are drawn from our experiences with learning regulatory networks, but we also discuss some ongoing work on learning protein-protein interaction networks. With respect to regulatory networks, we discuss the learning of dynamic and static regulatory networks from synthetic and experimental data. The two biological networks we examine are the cell cycle in yeast and the vocal communication system in the songbird brain. For both these examples we used graphical models, so our evaluation studies will be focused on the learning of such networks using graphical models. With respect to protein-protein interaction networks, we discuss some of the difficulties we have encountered in comparing our work with other algorithms that have been published in the literature.
Title: Benchmarking reverse-engineering strategies via a synthetic gene network in Saccharomyces cerevisiae
The possibility of measuring transcript levels has prompted the development of a variety computational algorithms that claim to be able to infer a detailed map of gene-to-gene interactions from these data. However, no clear data exist on their real performance due to the lack of an in vivo network that is perfectly known and could be used as a benchmark. To this end we built a synthetic network composed of 5 genes in S. cerevisiae. The network contains transcriptional and protein-protein interactions with feedback loops. In order to isolate the synthetic network from the cellular environment, we chose non-essential and non-redundant genes and deleted the endogenous ones. The coding sequence of each gene was assembled with a non-self specific promoter and with C-terminal tags including GFP in order to monitor the protein in living cells. Each cassette was integrated by homologous recombination within the locus of another gene obtaining its simultaneously deletion. We have now perturbed the synthetic network by over-expressing each of the 5 genes from a tetracycline inducible promoter, and measured levels of transcripts following two different strategies: data collected at steady state and at different time points. The different network inference strategies will be compared and "benchmarked".
Title: Reverse Engineering Gene-Protein Networks
Cellular processes are governed by extensive, interconnected networks of genes and proteins. The complexity of these cellular networks can hinder attempts to elucidate their structure and function. To address this problem, we have developed integrated computational-experimental approaches that enable construction of quantitative models of gene-protein regulatory networks using expression measurements and no prior information on the network structure or function. In this talk, we present these methods, and discuss how the reverse-engineered network models, coupled to experiments, can be used: (1) to gain insight into the regulatory role of individual genes and proteins in the network, (2) to identify the pathways and gene products targeted by pharmaceutical compounds, and (3) to identify the genetic mediators of different diseases.
Title: Quantifying Reliability of Dynamic Bayesian Networks
We seek to quantify the failure and success of Dynamic Bayesian Networks (DBNs), a popular tool for reverse engineering networks from time-series data, including data which, although generated by continuous time processes (e.g. genetic expression), are sampled at discrete times. To facilitate analysis and interpretation, we employ a "minimal model" to generate arbitrary abundances of stochastic data from networks of known topologies, which are then subsampled and in some cases interpolated. We find that DBNs perform relatively poorly when given datasets comparable to those used for genetic network inference. Interpolation does not appear to improve inference success. We benchmark DBN performance against linear regression.
Title: Understanding Biological Function through Evaluation of Genome-scale Networks
My talk will be concerned with topics in proteomics, in particular predicting protein function on a genomic scale. We approach this through the prediction and analysis of biological networks -- both of protein-protein interactions and transcription-factor-target relationships. I will describe how these networks can be determined through integration of many genomic features and how they can be analyzed in terms of various simple topological statistics. I will discuss the accuracy of various reconstructed quantities.
Title: Genome-scale mapping and global validation of the E. coli transcriptional network using a compendium of Affymetrix expression profiles
Machine learning approaches offer the potential to systematically identify transcriptional regulatory interactions from a compendium of microarray expression profiles. However, \ experimental validation of the performance of these methods at the genome scale has remained elusive. Here we assess the global performance of four existing classes of inference algorithms using 445 Escherichia coli Affymetrix arrays and 3216 known Escherichia coli regulatory interactions from RegulonDB. We also develop and apply the CLR algorithm, a novel extension of the relevance networks class of algorithms. CLR demonstrates an average precision gain of 36% relative to the next-best performing algorithm. At a 60% true positive rate, CLR identifies 1103 regulatory interactions, of which 338 were previously known interactions and 765 were novel predictions. We tested the predicted interactions for three transcription factors with chromatin immunoprecipitation, confirming 21 interactions and verifying our RegulonDB-based performance estimates. CLR also identified a regulatory link providing central metabolic control of iron transport, which we confirmed with real-time quantitative PCR. The compendium of expression data compiled in this study, coupled with RegulonDB, provides a valuable model system for further improvement of network inference algorithms using experimental data.
Title: Data requirements of reverse-engineering algorithms
High-throughput methods make it currently feasible to simultaneously collect data on all chemicalsin large biochemical networks, such as gene regulatory networks, metabolic networks, or signaltransduction networks. Still, data collection remains expensive, and reverse-engineering algorithms typically rely on small data sets. Thus reverse-engineering problems are typically vastly underdetermined in the sense that a huge number of network models are consistent with the available data. Reverse-engineering algorithms usually deal with this problem by selecting and returning one model that is consistent with the data. In view of the above, it will be extremely useful to develop a theory of data requirements for the most popular reverse-engineering algorithms. Ideally, such a theory would be able to predict the probability that a given algorithm returns the correct, or an approximately correct, model of the network from a given data sets. Moreover, if an algorithm uses input parameters, the theory should provide some guidelines for the most promising choice of input parameters. In the first part of this presentation we outline some general issues that arise in developing such a theory. We concentrate on only one type of data, suitably discretized concentration vectors of the chemicals in the network, and one modeling paradigm, treatment of the network as a discrete-time, finite-state dynamical system. However, most of our remarks should also apply to reverse-engineering from other types of data and/or under different modeling paradigms. In the second part we present some theorems on data requirements of the reverse-engineering algorithms developed by Laubenbacher and collaborators.
Title: Learning regulatory programs that accurately predict differential expression with MEDUSA
Inferring gene regulatory networks from high-throughput genomic data is one of the central problems in computational biology and a principal focus of the DREAM initiative. In this paper, we describe a new predictive modeling approach for studying regulatory networks, based on a novel machine learning algorithm called MEDUSA. MEDUSA integrates promoter sequence, mRNA expression, and transcription factor occupancy data to learn gene regulatory programs that predict the differential expression of target genes. Instead of using clustering or correlation of expression profiles to infer regulatory relationships, MEDUSA determines condition-specific regulators and discovers regulatory motifs that mediate the regulation of target genes. In this way, MEDUSA meaningfully models biological mechanisms of transcriptional regulation. MEDUSA solves the problem of predicting the differential (up/down) expression of target genes by using boosting, a technique from statistical learning, which helps to avoid overfitting as the algorithm searches through the high dimensional space of potential regulators and sequence motifs. Experimental results demonstrate that MEDUSA achieves high prediction accuracy on held-out experiments (test data), i.e. data not seen in training. The motivating problem behind the DREAM initiative is the difficulty of validating reverse engineered networks in the absence of a gold standard. Our approach of learning regulatory programs provides at least a partial solution for the problem: MEDUSA's prediction accuracy on held-out data gives a concrete and statistically sound way to validate how well the algorithm performs. With MEDUSA, statistical validation becomes a prerequisite for hypothesis generation and network building rather than a secondary consideration.
Title: Single Nucleotides in the P53 Pathway
In the cells of the body the p53 gene and its protein respond to environmental stresses. After an exposure to radiation or a mutagen, the p53 protein is activated and initiates a program of cell death, or apoptosis, eliminating clones of cells that carry mutations and that could develop into cancers. Individuals with somatic or germ line mutations in the p53 gene develop cancers at a very high frequency and at an early age. This is why the p53 gene has been called a tumor suppressor gene. Single nucleotide polymorphisms (SNP) in the p53 pathway have been identified that enhance or reduce the efficiency of the p53 pathway to eliminate potential cancerous cells in the body. A polymorphism in the MDM-2 gene, which regulates p53 protein levels by degrading the p53 protein, has been identified and termed SNP 309. Most individuals have a T-residue at the promoter element of the MDM-2 gene and this produces low levels of the MDM-2 m-RNA and protein. A small percentage of individuals in the population have a G-residue in that location and this produces about four fold more MDM-2 m-RNA and protein, which in turn lowers p53 levels and weakens the p53 apoptotic response to DNA damage. Individuals with the G/G genotype tend to develop cancers more frequently and at a younger age (about 10-12 years earlier) than those with a T/T genotype. The MDM-2 gene is also regulated by the estrogen receptor so that pre-menopausal females with a G/G genotype are at the highest risk for developing some types of cancers at earlier ages. These observations help to explain the genetic basis of gender differences in cancer. In addition this type of information can form the basis for identifying those women at highest risk for taking hormone replacement therapy or identifying those individuals who should be screened for cancers at earlier ages.
Title: In Silico Gold Standards from Virtual Cell
Elaborate biochemical reaction networks control the activity of cells and their responses to environmental stimuli. The challenge of understanding the behavior of such networks can be guided by the predictions generated from reverse engineering approaches. But cells and their subcellular organelles have complex structures that provide a framework for the dynamic spatial distribution of signaling molecules. How this cellular architecture shapes and controls the response of cells to their environment must be incorporated in any attempt to reverse engineer a cellular process. The Virtual Cell is a computational modeling software environment that has been designed to address this need. The operation of the Virtual Cell will be illustrated with several potential "Gold Standard" models. (This work is supported by NIH Grants U54RR022232 and P41RR013186)
Title: Dynamic pathway modeling: Feasibility analysis and optimal experimental design
A major challenge in Systems Biology is to evaluate the feasibility of a relevant biological research agenda prior to its realization. Since experiments are animals-, cost- and time-consuming, approaches allowing researchers to discriminate alternative hypotheses with a minimal set of experiments are highly desirable. Given a null-hypothesis and alternative model, as well as laboratory constraints like observable players, sample size, noise level and stimulation options, we suggest a methodology to obtain a list of required experiments in order to significantly reject the null hypothesis model M0 if a specified alternative model MA is realized. For this purpose, we estimate the power to detect a violation of M0 by means of Monte Carlo simulations. Iteratively, the power is maximized over all feasible stimulations of the system using multi-experiment fitting, leading to an optimal combination of experimental settings to discriminate the null-hypothesis and alternative model. We prove the importance of simultaneous modeling of combined experiments with quantitative, highly sampled in vivo measurements from the Jak/STAT5 signaling pathway in fibroblasts, stimulated with Erythropoietin (Epo). Afterwards we apply the presented iterative experimental design approach to the Jak/STAT3 pathway of primary hepatocytes stimulated with IL6. Our approach offers the possibility to decide which scientific questions can be answered based on existing laboratory constraints. To concentrate on feasible questions due to inexpensive computational simulations, yields not only to enormous cost and time saving but helps also to specify realizable, systematic research agendas in advance.
Title: In Silico Models for Reverse Engineering - Complexity and Realism versus Well-Defined Metrics
In silico biochemical networks are important for establishing comparisons of reverse engineering algorithms. A major advantage of such networks is that they allow establishing many different types of (simulated) experiments, thus are able to provide data of diverse kinds. This is important because many algorithms require specialized types of experiments that are not compatible with the requirements of other algorithms. Another advantage of in silico networks is their utility in providing objective metrics for the efficiency of reverse engineering algorithms. This is due to the fact that these networks are known in detail and so it is possible to quantify several metrics, like ROC curves. An issue that arises from the use of in silico networks, though, is whether they can provide realistic data. Can in silico networks display behaviors that are essentially similar to those of real biochemical networks? I will present two in silico networks of different complexity and show that as we increase realism in these networks, their ability to provide well-defined metrics decreases.
Title: The gap gene system of Drosophila melanogaster: Model-fitting and validation
The gap gene system of Drosophila melanogaster, part of the segmentation network, is one of the most well-known developmental gene networks, and is an ideal system for benchmarking the performance of reverse engineering algorithms. Regulation in the system is thought to be almost entirely by means of transcription factors, and many of those interactions have been identified by standard laboratory techniques Thus, the success of reverse engineering algorithms can be evaluated. In addition, low-noise, high spatial- and temporal-resolution expression data from wild-type organisms is publicly available. Several recent reverse engineering studies based on this data have established clear performance benchmarks, both in terms of quality and speed of fitting. Finally, while the gap gene system is relatively well understood, the regulation of genes further downstream in the segmentation network is not as well understood. Successful reverse engineering of these downstream genes would constitute a clear scientific contribution to our understanding of this system. We begin by giving an overview of the gap gene system and the expression data used in recent reverse engineering studies. We then summarize some of the findings and contributions of those studies, including a new optimization approach that promises to dramatically speed the onerous task of fitting differential equation models to time series data. Finally, we turn to the question of model validation, discussing several different notions of validation and how validation was approached in the aforementioned studies.
Title: Inferring Regulatory Pathways: Data and experimental design
Molecular networks underlie the decision processing of cells - constituting a cascade of information flow, triggered by signals, culminating in a cellular response. We demonstrate the applications of probabilistic graphical model machine learning approached to analyze high throughput proteomic and genomic data and automatically map regulatory pathways. We use intracellular multicolor flow cytometry to quantitatively and simultaneously measure the abundance of phosphorylated proteins and phospholipids in thousands of single primary human CD4+ T cells. Additionally, we use small molecules that activated or inhibited the measured molecules, facilitating inference of the direction of influence between them. Our Bayesian network algorithm identified a majority of classically reported signaling relationships, and predicted novel a influence connection of causal influence of Erk1 on Akt that we confirmed experimentally. Our results demonstrate the feasibility of computational elucidation of causal influences in signaling networks from high throughput proteomic data. After presenting the main results we will discuss which qualities of the data made our inference so successful and how gene expression data sets can be better designed towards the goal of network inference.
Title: Nuclear Pore Complex: The hole picture?
Nucleocytoplasmic transport occurs through nuclear pore complexes (NPCs), macromolecular structures embedded within the nuclear envelope. Composed of nucleoporins (nups), NPCs mediate bi-directional trafficking between the nucleoplasm and cytoplasm, acting as a dynamic barrier to control access to the nucleus. Nuclear transport of macromolecules depends on the interplay between transport cargoes, their cognate soluble transport factors, and NPCs. Our group studies the structure and function of the NPC in the model eukaryote Saccharomyces (yeast). We have catalogued the NPC's composition and assigned fold types to ~98% of the nups, exposing a simple modularity in the architecture of the NPC; moreover, similarities between structures in coated vesicles and those in the NPC support our hypothesis for their common evolutionary origin in a progenitor protocoatomer. We have also determined the position, shape and stoichiometry of each nup, and have systematically isolated nup subcomplexes and analyzed their composition by mass spectrometry in order to determine the network of interactions they make. Together, this wealth of information represents thousands of spatial restraints, which we have used to create a three-dimensional map of the NPC's architecture using the computer program MODELLER. We have determined the position of every nup with a precision of ~5 nm, sufficient to resolve the molecular organization of the entire NPC. We see an arrangement of coaxial rings with lateral interactions; these define a set of cage-like structures lining the pore membrane and forming a tube faced and lined with docking sites for transport factors. Taken together, these data has allowed us to propose an evolutionary origin for the NPC, and also to propose a mechanism for nuclear transport. These approaches show great promise as a novel method for studying large, flexible or transient macromolecular complexes.
Title: Using Data Fusions and Biomolecular Modeling towards Improving the Results of Reverse Engineering in Biological Networks. The ENRICHed Approach
Biological networks have a number of unique and distinguishing features. Chief among them is, arguably, the fundamentally chemical nature of most underlying molecular reaction processes. While providing for their inherently high organizational and functional complexity, this characteristic property also imposes a number of strong constraints on the possible biological network structures that may correspond to the observed system dynamics and other traits. The advent of high-throughput experimental techniques has been making large quantities of empirical information about such dynamic behaviors, individual molecular interactions, general network robustness, etc. available for a broad range of organisms. Correspondingly, a number of powerful methods based upon various statistical, algebraic and other principles have been developed to take advantage of these types of data towards reverse engineering of their constitutive biological networks. Notably, the results of these methods must nonetheless satisfy the constraints imposed by the underlying chemical nature of biological molecular networks. As such criteria may not be an intrinsic part of the reconstruction algorithm, our ENRICH approach aims to provide an ability to synchronize these reverse engineering results with the underlying biochemical constraints by, among other things, relying on data fusions between structural information and other types of observation.
Title: Simulations and Multifactorial Gene Perturbation Experiments as a Way to Validate Reverse Engineered Gene Networks Reconstructed via the Integration of Genetic and Gene Expression Data
To dissect common human diseases like obesity and diabetes, a systematic approach is needed to study how genes interact with one another, and how genetic and environmental factors and interactions among and between these factors contribute to clinical end points or disease phenotypes. Bayesian networks provide a convenient framework for extracting relationships from noisy data and are frequently applied to large-scale data to derive causal relationships among variables of interest. Given the complexity of molecular networks underlying common human disease traits, large data sets are required to reconstruct and reliably extract information from these networks. However, increasing the number of subjects in an experiment is an expensive and time-consuming way to improve network reconstruction. In addition, biological networks can be rewired under varying environmental and genetic perturbations so that many experiments are needed to get a comprehensive view of the network. With limited resources, the balance of coverage of multiple perturbations and multiple subjects in a single perturbation needs to be considered in the experimental design. Further, the use of Bayesian network reconstruction methods to derive predictive models has not met with great success in life sciences and biomedical research. However, it has recently been demonstrated that combining genotypic and gene expression data in a segregating population leads to improved network reconstruction, which in turn may lead to better predictions of the effects of experimental perturbations on any given gene. Here we simulate data from biologically motivated networks and quantify the improvement in network reconstruction achieved using genotypic and gene expression data, compared to reconstruction using gene expression data alone. We demonstrate that networks reconstructed using the combined genotypic and gene expression data achieve a level of reconstruction accuracy that exceeds networks reconstructed from expression data alone, and that fewer subjects may be required to achieve this superior reconstruction accuracy. Given that many complex phenotypes like common human diseases may be emergent properties of biological networks that are themselves defined by a complex network of genetic and environmental perturbations, we must move beyond the concept of single gene perturbation experiments as a way to validate these networks and the complex phenotypes they induce. Therefore, we also discuss the role single and multifactorial gene perturbation experiments can play in validating the predictive power of our reconstructed networks. We discuss the type of data that could be generated towards this end (with some examples) to achieve validation on a large scale. More generally, we discuss the important role independent sets of experiments could play in cross-validating networks reverse engineered using one set of experiments, and subsequently validated using independent experiment sets.
Title: Computational Modeling of Fetal Erythroblasts Predicts Negative Autoregulatory Interactions Mediated by Fas and its ligand
Tissue development is regulated by multiple intercellular signaling interactions that control developmental rate and the number of differentiated progeny. Here we present a novel computational algorithm used to identify specific feedback and feedforward interactions between progenitors in developing erythroid tissue. The algorithm makes use of dynamic measurements of red cell progenitors between embryonic days 12 and 15 in the mouse. It selects for intercellular interactions that reproduce the erythroid developmental process and endow it with robustness to external perturbations. This analysis predicts that pivotal negative autoregulatory interactions arise between early erythroblasts of similar maturation stage. By studying 189 mouse embryos both wild type and mutant for the death receptor Fas, or for its ligand, FasL, in which we measured the rate of Fas-mediated apoptosis in vivo, we show that Fas and FasL are the molecular mediators of a negative autoregulatory interaction predicted by the computational model. The presented algorithm is unique and compares favorably with the existing network re-construction algorithms. It is also the only network reconstruction algorithm proposed so far that is applied to a developmental network.
Title: Reverse Engineering of Network Topology
In this paper we consider the problem of reverse-engineering static models, so-called wiring diagrams, of biochemical network, that is, directed graphs that represent the causal relationships between network variables. We present an algorithm which computes all possible minimal wiring diagrams for a given data set of measurements from a biochemical network and scores the diagrams. The algorithm uses computational algebra, namely primary decomposition of monomial ideals, as the principal tool. An application to the reverse-engineering of two gene regulatory networks is included.
Title: The DREAM project and the goals of this conference
A number of methods have been developed, and continue to be developed, to disentangle the connectivity maps within the cell. To help the community understand the merits and pitfalls of one method versus the other, we have started what we call the DREAM (Dialogue on Reverse Engineering Assessment Methods) project, which is composed of two inter-related thrusts. On the one hand, we will create a repository of data, methods and tools to reverse engineer signaling, gene regulatory, metabolic, and developmental networks. The second thrust consists of periodic conferences in which the DREAM steering committee will curate data-sets (actual measurements of different data types, as well as data produced in-silico) of known but undisclosed network topology and parameters. The participants in this exercise will be challenged to infer the connectivity of the network underlying the curated data sets. In this way, we expect to enhance our understanding of the limitations and potential of specific methods as well as of the whole conception of reverse engineering from integrated data sets of cellular networks. The present conference has been design to discuss the feasibility of the DREAM project.