DIMACS Workshop on Analysis of Gene Expression Data

October 24 - 26, 2001
DIMACS Center, Rutgers University, Piscataway, NJ

Laurie Heyer, Davidson College, laheyer@davidson.edu
Gustavo Stolovitzky, IBM, gustavo@us.ibm.com
Shibu Yooseph, Celera Genomics, shibu.yooseph@celera.com
Presented under the auspices of the Special Year on Computational Molecular Biology.



Selective Expression algorithm for class separation for DNA Microarrays

Virginie Aris, Peter Tolias, and Michael Recce, Center for Applied Genomics, 
Public Health Research Institute, Center for Computational Biology and 
Bioengineering, NJIT

DNA microarrays are powerful tools that enable the transcriptional
expression profiling of thousands of genes simultaneously.  They have a
substantial impact on research by increasing the throughput of gene
expression analysis, gene function annotation, drug screening, and disease
classification.  The analytical techniques vary depending on the nature of
the experimental design.  In addition, data from microarray experiments are
both quantitative (expression level) and qualitative (the gene is expressed
or not).  However, a major drawback of using quantitative data is its
accuracy and precision.  Normalization, (to enable comparisons between
chips), is performed by scaling the gene expression levels of one chip to a
control microarray, to a control target intensity, or to another color
standard (in the case of printed microarrays).  When the scaling factor
departs from one, the magnitude of the correction may have an effect on the
accuracy of the resulting data.  For example a scaling factor of two will
increase by two-fold the intensity levels in the corrected chip.
With respect to performing class separation, we found that selectively
expressed genes contain information pertinent to the classification problem
(Aris and Recce, 2001).  A selectively expressed gene is by definition
differentially expressed in two sets of samples.  Our method was developed
using data derived from GeneChips, high-density oligonucleotide probe
arrays produced by Affymetrix Inc.  Each gene is represented by 16 to 20
probe cells on a GeneChip containing 25mer oligonucleotides complementary
to the gene sequence (perfect match; PM), and a repeat of these probe cells
with a homomeric base mismatch (MM) at the central position.  The presence
or absence of a gene is then evaluated by the Affymetrix software, with a
decision matrix based on metrics comparing the intensity of the PM to the
MM.  We then convert absent and present calls into binary numbers, with one
corresponding to present and zero corresponding to absent.  The selectivity
of each of the genes is computed as the absolute value of the difference
between the real-valued average of the binary values for each of the two
groups.  Selective genes are considered significant if they are twice as
likely to be expressed in one group as in the other (absolute difference is
larger than 0.5).  The genes are then ranked by their selectivity and the
most selective ones are used to construct exemplars.  By shuffling the
data, we can distinguish whether the grouping was random or significant.
A simple form of normalization can be applied to this algorithm to improve
the separation.  This normalization seeks to eliminate errors in the
assignment of absent calls, which may have occurred due to variation in the
processing of the samples.  We found that a high background or low
hybridization results in chips with a lower number of genes found
present.  Genes that are expressed at a low level are primarily affected as
they are close to background.  This normalization strategy reclassifies
present calls with low expression to absent calls on chips with more than
average number of present genes.

Our algorithm was developed using data from Golub et al (1999). This
expression data was derived from bone marrow and peripheral blood samples
from patients suffering with acute lymphoblastic leukemia (ALL) and acute
myeloid leukemia (AML).  The data was then split into 2 sets, a training
set and an independent set.  The training set was used to develop the
classification method and the independent set to test it.  They classified
and predicted classes of Leukemia by correlating the 2 diseases state to
gene expression levels.  With our method, we found 121 significant
selective genes among which, some had also been found by Golub et
al.   Some of these genes were relevant to the diseases involved (e.g,
HOXA9, cyclin D3 and zyxin).  The selectivity of these genes was 3 to 9
standard deviations above the mean of the selectivity of genes at the same
rank in 45 shuffled random sets.  To classify the samples into the two
leukemia groups we took the Euclidian distance of the each gene=92s binary
to the two exemplars.  We then averaged this distance over 30 to 100 most
selective genes and determined to which exemplar each sample was the
closest.  We were able to classify accurately the samples from the training
set, and only sample 66 was incorrectly classified in the independent data
set (note: this sample was systematically misclassified by all the analysis
techniques presented at the CAMDA '00 conference, and was speculated to be
a diagnostic error).  We also performed the same classification after
normalization and increased the separation between the two clusters.
In summary, our selective approach is novel, complementary to quantitative
differential expression analysis, and is a well-suited tool to develop
diagnostic microarrays.  Its power comes from its simplicity, and
robustness. The general usefulness of this method will be determined by
applying it to other datasets.

Aris, V.M., and Recce, M., 2001. A Method to Improve Detection of Disease
Using Selectively Expressed Genes in Microarray Data.  CAMDA=9200 Conference
Proceedings, in print.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H,
Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular
classification of cancer: class discovery and class prediction by gene
expression monitoring.  Science 286(5439): 531-537, 1999.

2. Microarray Image Analysis and Expression Ratio Statistics Yidong Chen Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland To quantitatively analyze gene expression level, two fluorescent-labeled RNAs were routinely hybridized to an arrayed cDNA probes on a glass slide. Ratios of gene expression levels arisen from two co-hybridized samples were obtained through image segmentation and signal detection methods. In our early report, expression ratio has been studied via ratio statistics, and the ratio confidence interval has been established so that ratio outliers can be easily identified. Typically, we assume fluorescent background level does not interfere with ratio measurement, which is derived via background subtraction and weak target elimination. However, many experiment results suggested that ratios derived from weak targets possess larger variation than those from strong targets. In this study, we propose a new signal and noise interaction model. Under the noisy environment, ratio statistic will be numerically evaluated and its self-adjusting confidence interval will be introduced. The new confidence interval, which automatically adapts under different signal-to-noise ratio (SNR), provides a better criterion to further interrogate weak expression levels. In addition, a new quality metric for each ratio is proposed that provides a quality assessment, such as the spot size, saturation, consistency, ect, for the expression ratio measurement. The quality factor effectively enables us to process gene expression data, without introducing a fixed intensity filter, in many analysis methods, such as clustering, fingerprinting analysis, and predictive analysis. A microarray image simulation program is proposed in order to understand the variance introduced by the image analysis methods. Not only will this simulation program help separating variations from various sources, but also provide a useful tool to evaluate existing software, and to design a better microarray image analysis software.
3. A Nonlinear Channel Normalization Approach for cDNA Microarray Data and Its Performance Analysis Chao Deng, Aili Wang, Denong Wang, Peisen Zhang Columbia Genome Center, Columbia University, New York City, NY 10032, USA In microarray experiments, there are varieties of systematic errors that affect measured gene expression levels. A normalization procedure is necessary to identify and remove such sources of systematic error. Current normalization methods fall into the categories of linear modifications to the absolute or log intensities of gene expression, which encounter many difficulties in case of a number of complex sources of systematic error. A nonlinear normalization method, i.e., polynomial channel correctness normalization (PCCN) approach, is proposed in this paper to deal with the normalization problem of cDNA microarray data efficiently. PCCN utilizes a polynomial with limited order to approximate the nonlinear dependence relation between two dye channels such that the imbalance of the two channels can be greatly compensated or corrected. Due to its strong abilities of nonlinear compensation, PCCN method obtains better normalization performance of cDNA microarray data than the current existing approaches through extensive experiments. The PCCN=92 s efficacy was justified by many experiments.
4. T. Gregory Dewey (Keck Graduate Institute of Applied Life Sciences, Claremont Graduate University), David Galas (Keck Graduate Institute of Applied Life Sciences, Claremont Graduate University), Ashish Bhan (Department of Mathematics, Claremont Graduate University) Talk title: A Network Analysis of Expression Time Series Abstract: A common approach to tackling complex phenomena is to first establish the extent and limitations of the linear response domain of the system. While there is no a priori reason to assume that the time series of a gene expression profile can be well related via a simple linear model, it is a reasonable starting point and can provide important clues to the origin of nonlinearity in the system. Linear models, in themselves, can often lead to surprisingly complicated responses, especially for multivariate systems. This provides a strong motivation for an in-depth exploration of such models. In this work, a dynamic, linear response model for analyzing time series of whole genome expression is presented. The simplest assumption about expression profile data is that the expression state represented in the data from one time point determines the expression state seen at the next time point. This assumption is equivalent to modeling the data by a first-order Markov process. The linear version of this model is described by a single transition matrix, L, defining the transitions from one state to the next. The expression levels of each of m genes at two points in time can be described by column vectors, a(t) and a(t-1), each of length m. The transition between the two states is modeled by: <<...OLE_Obj...>> where <<...OLE_Obj...>> is the expression level of the ith gene at time t after some exposure or treatment. The transition coefficients are <<...OLE_Obj...>> which are the respective elements of the <<...OLE_Obj...>> transition matrix. The matrix elements represent the influence of the expression level of the jth gene on that of the ith gene. Using this model, we calculate a transition matrix for both cell-cycle and diauxic shift data is yeast. These models are statistically robust and lead naturally to a network of interactions reflected in the data. A graphical network can be readily derived from these results by thresholding the parameter <<...OLE_Obj...>> and replacing it with a 1 above a certain threshold and a 0 below the threshold. This procedure results in a sparse matrix that gives a network representation of the data. This approach can provides a direct method of classifying genes according to their place in the resulting network and offers an alternative to traditional clustering approaches. These network groupings compare favorably with previously used methods like cluster analysis. The network derived by this method shows a hierarchical structure that is dominated by a collection of central hubs. These hubs are interconnected and have a cascade of tree-like structures attached to them. The statistical properties of these resulting networks were determined for a number of different time series data sets for yeast in the public domain. These results consistently show networks that have "small world" characteristics and show scale free distributions of connectivities. A general class of network growth models have been derived that show behavior consistent with the experimental results. Non-linear and higher order Markov behavior of the network can also be included by a self-consistent method. Networks derived from these more sophisticated models show similar behavior.
5. Vladimir Filkov and Sorin Istrail, Celera Genomics Talk title: Inferring Gene Transcription Networks: The Davidson Model Abstract: In 2001 Eric Davidson published the book "Genomic Regulatory Systems," where he reports on his and his colleagues' 30 years of work on developmental gene regulation of purple sea urchin. Their work resulted in a general experimental framework for the study of a gene's cis-regulatory region (an upstream DNA sequence containing a series of consecutive binding sites). Their approach consisted of performing systematic, almost exhaustive, mutations to individual binding sites of a gene's cis-region, and observing the corresponding transcription rates. They focused mostly on the endo16 gene. By quantitative analysis of the observed transcription rates, they were able to infer a complete set of minimal functional units of regulation and their interrelations. Hierarchically from those units, they uncovered "modularity" and "hardwired information processing logic" of that cis-region. Their extraordinary technology and the inference of the underlying cis-region's "network" for endo16 resulted in the most completely understood transcriptional system to date. It is quite remarkable how combinatorial and robust their approach is. In this paper we present an analysis and introduce a natural mathematical formalism of the Davidson transcriptional network inference framework together with combinatorial problems and algorithms related to it.
6. Nanxiang Ge (Aventis Pharmaceuticals), Fei Huang (Bristol-Myers Squibb), Peter Shaw (Bristol-Myers Squibb), C.F. Wu (University of Michigan, Ann Arbor) Talk title: PIDEX: a Statistical Approach for Screening Differentially Expressed Genes Using Microarray Analysis Abstract: Microarray technology is being applied in pharmaceutical drug discovery. A typical experiment is conducted to compare the gene expression profiles under two different conditions and the purpose is to find genes differentially expressed under the conditions. Common practice is to use fold change for detecting differential expression. However, use of fold change can generate many false positive errors because of the existence of genes with low or undetectable expression levels. A novel method to analyze differentially expressed genes is presented that combines the fold change, change in the absolute intensity measurements and data reproducibility. It produces p-values for identifying differentially expressed genes (PIDEX). The proposed methodology is demonstrated by analyzing the expression profiling data from a public data set and an internally conducted experiment comparing two cell lines (ES2 and WI38). Results from these analyses and a validation study using quantitative RT-PCR assays suggest that PIDEX outperform the use of fold change alone.
7. Lessons from the Analysis of Gene Expression Data: A comparison of methods for inferring gene networks and different data sets Thomas Heiman George Mason University A great deal of enthusiasm has been generated about micro array expression data analysis during the last couple of years. After an initial flood of methods developed to cluster the gene expression data based upon similarity of mRNA expression level profile, a number of techniques have been developed to reconstruct networks of biomolecular interactions, or gene networks, in order to create integrated and systematic models of the biological systems under study. A variety of approaches have been developed. However, it is important to clearly define and validate the results of different analytical tools given the variance in the methods and amount of data currently available. The goal of this paper is to shed some light on the above question and to serve as a baseline by comparing the different approaches available, both discrete and continuous methods, to extract the Saccharomyces cerevisae GAL gene network from four different sets of publicly available mRNA expression data.
8. MDL Gene Subset Selection for Classification Rebecka Jornsten, Department of Statistics, UC Berkeley rebecka@stat.berkeley.edu The scientific implication of microarray data lies in its revelation of biological information. Firstly, we are interested in finding groups of genes that function in a similar fashion under various experimental conditions. The conditions can correspond to tissue types, cell lines, time, or pathological contexts (e.g. type of cancer). For each gene, the vector of gene expression levels under different experimental conditions is also called the gene profile. Secondly, we would like to classify or group the conditions using the set or a subset of gene profiles. Thirdly, we aim to determine a functional relationship between gene expression profiles and experimental conditions, i.e. possible interactions and outcome models based on selected gene profiles. These three are tasks of statistical inference. Here we discuss the development of statistical methodologies using the Minimum Description Length (MDL) principle to deal with mainly the second and first task. When properly formulated, these tasks all fall within the sub-field of model selection in statistics. MDL is a general statistical modeling principle based on the data compression philosophy, that a good statistical model should compress the data well. It formalizes Occam's Razor and explicitly states that one should choose the model that gives the shortest description of the data (Rissanen, 1978). The key question in MDL research is which description length to use for a particular model class, which is addressed by optimal universal coding theorems. For regular parametric families, two-stage, mixture, predictive, and NML (normalized maximum likelihood) codes have been shown to achieve the universal optimality. MDL has had its major impact in statistical inference for model selection problems. With high dimensional gene profile data, an appropriate account of the complexity of a model is crucial for gene subset selection. The coding framework that MDL relies on provides a natural way to account for the model complexity. Rissanen, J. (1978), Modeling by shortest data description, Automatica, 14, 465-471
9. Richard Karp, University of California, Berkeley Talk Title: Combinatorial and Information-Theoretic Approaches to Mining Gene Expression Data Abstract: Gene expression data from a set of microarray experiments is typically presented as a matrix in which the rows correspond to genes, the columns to experiments, and each entry to the expression level of a given gene in a given experiment. Clustering methods are often used to find sets of genes, or sets of experiments, with similar patterns of expression. One can then explore the biological reasons for such similarities. In the case of genes, the similarity may arise because the genes are regulated by the same transcription factors or environmental conditions. In the case of experiments on different clinical samples, similarity of expression may occur because the samples are taken from tissues in similar disease states. Related to clustering is the problem of supervised classification, in which each experiment corresponds to a clinical sample, and a clustering of the samples according to phenotype is given. Here the challenge is to devise a rule that distinguishes the different clusters on the basis of their gene expression patterns, and can be used to classify further clinical samples. The problem of feature selection is of central importance, both in clustering and in supervised classification. This problem arises because of the large number of genes that can be measured in a single microarray experiment. The expression level of each gene can be used as a feature influencing the clustering or classification of clinical samples, but usually a clustering or classification rule is biologically plausible, comprehensible and robust only if it is based on a relatively small number of informative features. I shall describe joint work with Eric Xing and Michael Jordan (1,2) which uses information-theoretic principles as a guide to feature selection. The resulting algorithms have been successful in classifying leukemia samples, both in the setting of clustering and in the setting of supervised classification. Another approach to finding structure in a matrix of gene expression data is to look for two-dimensional patterns. Such a pattern is specified by selecting both a set of genes and set of experiments, such that the expression levels of the selected genes within the selected experiments exhibit some regularity or uniformity. For example, one might look for patterns in which every selected gene has a high level of expression in every selected experiment. I will report on joint work in progress with Amir Ben-Dor, Benny Chor and Zohar Yakhini directed towards finding patterns in which the expression levels of the selected genes are similarly ordered within the selected set of experiments. References [1] Eric P. Xing and Richard M. Karp, "CLIFF: Clustering of High-Dimensional Microarray Data via Iterative Feature Filtering Using Normalized Cuts," ISMB Conference (2001). [2] Eric P. Xing, Michael I. Jordan and Richard M. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data," Machine Learning Conference (2001).
10. Alvaro Mateos (Bioinformatics Unit, Centro Nacional de Investigaciones Oncologicas (CNIO), Ctra. Majadahonda-Pozuelo, km. 2, Majadahonda, 28200, Madrid), Joaquin Dopazo (Bioinformatics Unit, Centro Nacional de Investigaciones Oncologicas (CNIO), Ctra. Majadahonda-Pozuelo, km. 2, Majadahonda,28200, Madrid), Yuhai Tu (IBM Computational Biology Center, T.J. Watson Research Center), Ronald Jansen (Department of Molecular Biophysics and Biochemistry, Yale University), Mark Gerstein (Department of Molecular Biophysics and Biochemistry, Yale University), Gustavo Stolovitzky (IBM Computational Biology Center, T.J. Watson Research Center) Talk title: What can be "learned" from Gene Expression Arrays? Abstract: Recent advances in microarray technologies have allowed for the study of gene expression from a genomic perspective. One important application of this technology is functional annotation. When cells are treated under different conditions, genes will change their pattern of expression according to their cellular role, and this data can be used to assess their biological function. There are a number of possible data analysis techniques to deal with this type of data (see [1] for a review of methods). Typically, unsupervised clustering of patterns of expression only provides information on genes that co-express. However, genes belonging to the same functional class may display more complex behaviours, undetectable by these techniques. This difficulty can be overcome with supervised learning algorithms. These algorithms use the prior knowledge of gene function to extract out of the expression patterns the signature(s) corresponding to the different functional classes. Support vector machines (SVM) and other machine learning methods have recently been applied for this purpose [2]. We will review this previous work and elaborate on their reach and limitations. We have explored the use of supervised neural networks for the purpose of gene functional annotation using gene expression data reported in [3]. We have studied the ability of our machine-learning scheme to systematically learn one hundred functional classes catalogued in the MIPS database [4], and found that less than 10% of these classes are learned, based on a score of low rate of false negatives. We then turned our attention to the question of why the remaining 90% are poorly learned. The answer lies in the fact that there are genes that belong to more than one biological pathway, thus confounding the signatures that ought to be learned for a unique class. We studied in detail how the fact that different functional classes have no-null intersection influences the learning ability of our scheme. Finally, an iterative scheme is proposed that recruits the false positives of iteration i as true positives in iteration i+1. The iteration starts with all the genes assigned to a given class and proceeds until the rate of false positives reaches a low pre-assigned threshold. We show that this process converges in a few steps to a class that can be learned with considerably low rates of false positives and false negatives. Furthermore, the new set of genes thus created contains genes whose functional classes are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle. [1] D'haeseleer P, Liang S, Somogyi R. "Genetic network inference: from co-expression clustering to reverse engineering." Bioinformatics. 2000 Aug;16(8):707-26. [2] Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D. "Knowledge-based analysis of microarray gene expression data by using support vector machines." Proc Natl Acad Sci USA. 2000; 97:262-7. [3] Eisen M.B., Spellman P.T., Brown P.O. & Botstein D. "Cluster analysis and display of genome-wide expression patterns." Proc Natl Acad Sci USA 2000 95:14863-14868. [4] Munich Information Center for Protein Sequences. URL: http://mips.gsf.de/proj/yeast/catalogues/funcat
11. Felix Naef (Laboratory of Mathematical Physics, The Rockefeller University), Daniel Lim (Laboratory of Neurogenesis, The Rockefeller University), Nila Patil (Perlegen Sciences Inc./AffyMetrix Inc.), Marcelo Magnasco (Laboratory of Mathematical Physics, The Rockefeller University) Talk Title: From features to expression: High-density oligonucleotide array analysis revisited Abstract: One of the most popular tools for large scale gene expression studies are high-density oligonucleotide GeneChip arrays. These currently have 16-20 small probes ("features") for evaluating the transcript abundance of each gene. In addition, each probe is accompanied by a mismatched probe (MM) designed as a control for non-specificity. An algorithm is presented to compute comparative expression levels from the intensities of the individual features, based on a statistical study of their distribution. Interestingly, MM probes need not be included in the analysis. We show that our algorithm improves significantly upon the current standard and leads to a substantially larger number of genes brought above the noise floor for further analysis.
12. Computational Resequencing by Universal Microarrays Itsik Pe'er School of Computer Science, Tel Aviv University izik@post.tau.ac.il Abstract We have developed a new computational method that combines hybridization data from a universal chip with prior approximate knowledge on the target DNA sequence and determines the exact variations in the target sequence. In contrast to many other SNP-genotyping techniques, the sequence variations detected are not restricted to previously known polymorphic loci. This task is most prominent in the analysis of somatic genetic variants, where mutations are observed at arbitrary sites. "DNA chips" allow probing a DNA sequence for the presence of all possible short oligonucleotides of a certain length. Universal DNA microarrays with all possible 8-mers, generated by spotting or photolithography, were constructed in the past and used to sequence short molecules de novo. Our method allows using such microarrays, or any parallel probing technology, for resequencing. Our computational approach is to model probabilistically both the prior knowledge about the sequence, and the information produced by the hybridization reaction. To capture prior sequence knowledge we use Hidden Markov Models. The hybridization results are described by a graph theoretic construct, the de-Bruijn graph, extended to accommodate noise probabilities. We have developed an algorithm for reconstructing the target sequence given the two data sources. Our algorithm is capable of handling insertions and deletion in addition to substitutions, as well as fragmented targets (e.g., different exons). Preliminary simulations with real DNA sequences are very promising: With 8-mers hybridization data, our method can accurately determine virtually all SNPs in a target of length 2kb, even when the hybridization reaction has false positive and negative rate of 5%. The average reconstruction error rate is 1 base in 30,000. Naturally, there is a performance tradeoff between target length versus tolerated polymorphism rate and reaction inaccuracy. Joint work with Naama Arbili and Ron Shamir
13. Geometrical analysis of gene expression dynamics Scott A. Rifkin Department of Ecology and Evolutionary Biology, PO Box 208106 Junhyong Kim Department of Molecular, Cellular, and Developmental Biology, PO Box 208103, =87 Department of Statistics, PO Box 208290, Yale University, New Haven, CT 06520 Corresponding Author: Junhyong Kim, Department of Ecology and Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT 06520; junhyong.kim@yale.edu Abstract During physiological and developmental processes of an organism, the molecular state of any given cell undergoes a cascade of changes in coordination with other cells and the environment. These molecular interactions have an inherent temporal structure molecular interactions obey dynamical rules. Recent advances such as microarray technology (1) have the potential to characterize molecular interaction dynamics at the whole genome level. However, compared to typical time-series data microarray data is characterized by a relatively high degree of noise, an extremely large number of variables, and a small number of measurements, requiring non-traditional approaches. Parametric approaches such as Fourier analysis or wavelet analysis can be difficult to apply and may not reveal crucial aspects of the data, while an analysis based on the static structure of the data such as singular value decomposition (SVD) can be inadequate for revealing dynamical features. In this paper, we introduce two techniques to aid dynamical analysis of gene expression data: dynamical structure visualization and non-parametric geometrical analysis of periodic dynamics. We apply our analyses to the well-analyzed Saccharomyces cerevisiae cell cycle data as an example and demonstrate the strength of our method using numerical simulations. We used Mathematica 4.0 (2) to analyze the three yeast datasets alpha-factor, cdc15, and cdc28with adjustments for missing data as described in Rifkin et al. (3) providing us with data for 5541 genes (see (4) and (5) for details of the experiments). The cdc15 time-series has samples every 10 minutes from 10 to 290, except for 5 timepoints (these we estimated by linear interpolation). The different datasets arise from three different ways in which the yeast cell cycle was arrested and synchronized prior to measurement of gene expression over time. Dynamically, they represent three different initial conditions for the system. Given the periodic nature of the cell cycle, it is evident that the activity of some subset of the genes or linear combinations of gene expression levels will show simple periodic dynamics with the frequency of the cell cycle. If the dominant expression dynamics of these genes is governed by a low-dimensional periodic component, it will lie in some unknown subspace of the high-dimensional state space of all the genes (~6200 genes for yeast). We investigated the geometry of these dynamics with two new tools for (a) locating and visualizing the possible periodic dynamics in the state space and (b) projecting it onto subspaces to estimate dominant periodic dynamics and phase relation of the genes. We also demonstrate how previous results (3) arise from the geometry of cell cycle dynamics.
14. Mat Soukup and Jae K. Lee University of Virginia Talk title: Identifying Multiple-Factor Genes and Evaluating Classification Probability for Distinct Biological Groups Based on Gene Expression Data: Stepwise Cross-Validated Discriminant Analysis Abstract: Thanks to recent advances in gene chip technology, various genome-wide gene expression studies have been performed to discover important gene factors that enable us to discern between two or more biological conditions, disease subtypes, or critical time points in a biological pathway. For instance, performing a gene expression study for two subtypes of leukemia patients, Golub et al. (1999) tried to predict and distinguish the two subtypes of patients that show quite different prognoses and require fundamentally different medical treatment courses. They proposed to use a so-called gene-voting method, which aggregates the prediction power by gradually adding multiple gene factors. This approach, however, can neither identify individual gene factors that are most critical in predicting the two subtypes nor provide a prediction probability of their tumor classification. We propose a new method for simultaneously identifying important gene factors and evaluating their predictive power for two groups of gene expression data using a stepwise cross-validated discriminant analysis approach (SCVD). Applying one-leave-out cross-validation and quadratic discriminant analysis in a stepwise fashion, we identify all multiple gene factors that play an important role in discriminating two biological groups by their gene expression patterns. Our SCVD approach is as follows. At each stage, an additional gene that significantly improves the misclassification error and provides the lowest misclassification rate, is retained. In the case of a tie, the gene (or model) with the highest predictive power-highest posterior classification probability is chosen. Each gene in the current model in turn is then validated by a backward evaluation of classification power for possible elimination. The above process continues until the misclassification rate is no longer lowered. We iterate our search dropping the genes found in the previous model until we can find no more gene model within a preset threshold of misclassification error. We applied our method to Golub's leukemia data, for which we converged upon a model containing only two genes, Zyxin and Azurocidin, which correctly classified 37 of 38 patients in the training set and 31 out of 34 in the independent validation set. Thus, our approach provided misclassification error rates equivalent to or better than those of the gene voting models, which typically contained 50 to 100 gene predictors. Evaluating the posterior probability of each patient's tumor classification, we could also more carefully assess each patient's subtype of leukemia. Our method can be applied to more than two groups of differentially expressed array data sets for discrimination and prediction. An extensive simulation study is in progress.
15. Gary Stormo, Washington University Medical School Using Expression Data to Learn About Regulatory Networks Via Promoter Analysis One of the uses of expression data are to identify sets of coregulated genes. From those sets one can try to discover the regulatory sites that are involved in the regulatory processes, which can then lead to identifying the proteins involved and help in understanding the complete regulatory network. Methods for discovering the sites have been around for over 15 years, and current methods are reasonably good. A brief overview of those methods will be presented. But in higher eukaryotes it is rarely the case that genes are regulated by single transcription factors, rather they tend to be controlled by combinations of factors working in concert. Examples will be given of such combinatorial regulation that we've uncovered using some of the methods we and others have developed. In addition, improved methods with higher sensitivity for identifying combinatorial factors will be described. Finally, recent work on identifying regulatory sites within RNA sequences, which may be composed of both structure and sequence constraints, will be described. Such post-transcriptional regulatory mechanisms can influence the results of expression analysis by altering the half-lifes of mRNAs, and can also influence the expression of protein levels without changing the mRNA abundances, and therefore explain some of the discrepancies observed between mRNA and protein levels.
16. Emerging Technologies for Gene Expression Analysis Peter Tolias, Ph.D., Director, Center for Applied Genomics, Public Health Research Institute, Newark NJ. Associate Professor, Dept. of Microbiology. & Molecular Genetics, UMDNJ-New Jersey Medical School, Newark NJ. In recent years, we have seen revolutionary advances in technology related to gene expression studies. Individual measurements of steady state levels of mRNA have been replaced by multiplexing strategies. Technology permitting genomic scale analysis of gene expression in a single experiment such as DNA microarrays and GeneChips, are now widely used by researchers in both industry and academia. Finally, there are several emerging gene expression technologies in various stages of development. This tutorial will review the major platforms that are currently available for gene expression analysis and provide a glimpse of the emerging technology of the future.
17. A Brief Review of Methods used in the Analysis of Gene Expression Data Yuhai Tu IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 Abstract Several microarray technologies that monitor the levels of expression of a large number of genes have recently emerged and promised to revolutionize genetics research in the post-Genomic era. Some of the practical applications of the Microarray technology include gene function annontation; disease characterization; pharmacogenomics and study of gene regulatory network. Due to the huge amount of data being produced and also the noisy nature of the data, sensible analysis methods becomes crucial in deciphering the true signal in the massive gene expression data. In this talk, we will give a general review of the existing analysis methods that have been used in different problems. The analysis methods are divided into two catagories: unsupervised and supervised. For methods in the first catagory, which are mostly used in class discovery, we will discuss various clustering algorithm, including hierachical clustering method, K-means, self organizing maps (SOM) and superparamagnetic clustering. For the methods in the second catagory, which are mostly used in classification, we will discuss individual gene based methods, support vector machine (SVM) and pattern discovery based methods. There is no universally good analysis method that can be used in dealing with all gene expression data, the methods of choice will depend on the biological system being studied and the questions being asked. Therefore it is crucial for the biologists and bioinformaticians to undersatnd the pros and cons of the existing methods. In this review, we will evaluate the existing methods critically, emphasising the need for statistical analysis and consideration of noise.
18. Honghui Wan, NCGR/NIH Talk title: Gene Expression Analysis System from Multiple Biological Information Resources Abstract: Microarrays have become the most effective, broadly used tools in the genomics revolution. The development of microarray technology has advanced the ability to perform genome-wide analysis by simultaneously monitoring the gene expression and identifying genes related to complex diseases and multi-cellular responses. The impact on human health can be studied using microarrays to determine the effects on the expression pattern of genes. The ability to integrate access to such a wide variety of public biological resources with that of comparative gene expression data, is of great value to critically organize, archive, analyze, and visualize intrinsic gene expression profiles. Gene expression data is useless unless biologically meaningful information can be extracted and presented in some readily understandable fashion. The production of this meaningful information, involving many facets of statistical analyses associated with multiple resources, is only possible with computers running sophisticated software. The advancements in information technology provide the ability to design data management and analysis systems that not only warehouse information, but facilitate relational integration and interpretation of large-scaled microarray gene expression data with outputs from multiple heterogeneous, synthesized, and distributed biological resources. We develop an integrated and comprehensive gene expression data management and analysis system from heterogeneous, synthesized, and distributed biological databases and resources, such as: * Gene annotation information, * Clinical data about different cell lines, * Motif information, * Protein localization information, * Protein classification information, * Biological pathway information, * Experimental data relating how a microarray experiment was carried out, * Related textual biological data stored in databases such as MEDLINE, * Phylogenetic profiles that are derived from a comparison between a given gene and a collection of complete genomes. This system can be applied in a creative fashion to discover knowledge and understanding of genes associated with complex diseases.
19. Amir Ben-Dor, Agilent Laboratories Talk title: Overabundance Analysis with Applications in Cancer Sub-classification Abstract: Recent studies (e.g Alizadeh et al, Nature 2000; Bittner et al, Nature 2000; Golub et al, Science 1999) on molecular level classification of cancer cells produced results that strongly indicate the potential of gene expression assays as diagnostic and segmentation tools and as a basis to the discovery of putative disease subtypes. We will describe methods that enable data analysis in various stages of such studies. Classified gene expression data consists of tissue samples (for which expression profiles are measured) that are labeled as belonging to certain classes (such as tumor or normal, particular kinds of tumors, phase, differentiation stage, etc). Some of the genes measured play a major role in the processes that underlie the differences between the classes or are dramatically effected by the differences. Such genes are highly relevant to the studied phenomenon. On the other hand, the expression levels of many other genes are irrelevant to the distinction between the tissue types under consideration. We will examine ways of measuring the relevance of a gene, or a set of genes, to the studied phenomenon. We will discuss some corresponding statistical benchmarking techniques and see how these can be applied to the more complicated challenge of class discovery. This term refers to the process of trying to identify statistically significant subclasses of tissues in gene expression data, in an unsupervised manner. Specifically, we will consider the null model where each sample is labeled as '+' or '-', depending on class membership. Some genes have dramatic '+' to '-' expression level differences. Under a null model where a vector of labels of the appropriate composition is uniformly drawn, we can assign p-values to all '+' to '-' expression level differences. For actual biological classes we typically observe an overabundance of differentially expressed genes (compared to the null model). Efficient methods for calculating exact score distributions, under the above null model, allow, therefore, for a novel approach to class discovery. For candidate partitions of the sample set we compute the abundance of differentially expressed genes and assign a statistical to this observed abundance. Search heuristics (simulated annealing, genetic algorithms) find the highest scoring partitions. Thus, grouping is based on subsets of the genes rather than on the entire set. The calculations are accurate and efficient, in contrast to sampling based methods. We will discuss statistical and algorithmic approaches. We will use actual gene expression data to demonstrate the relevance scoring process and the discovery process. This is joint work with Amir Ben-Dor and other collaborators.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center

Document last modified on October 17, 2001.