DIMACS Workshop on Analysis of Gene Expression Data
October 24 - 26, 2001
DIMACS Center, Rutgers University, Piscataway, NJ
- Organizers:
- Laurie Heyer, Davidson College, laheyer@davidson.edu
- Gustavo Stolovitzky, IBM, gustavo@us.ibm.com
- Shibu Yooseph, Celera Genomics, shibu.yooseph@celera.com
Presented under the auspices of the
Special Year on Computational Molecular Biology.
Abstracts:
1.
Selective Expression algorithm for class separation for DNA Microarrays
Virginie Aris, Peter Tolias, and Michael Recce, Center for Applied Genomics,
Public Health Research Institute, Center for Computational Biology and
Bioengineering, NJIT
DNA microarrays are powerful tools that enable the transcriptional
expression profiling of thousands of genes simultaneously. They have a
substantial impact on research by increasing the throughput of gene
expression analysis, gene function annotation, drug screening, and disease
classification. The analytical techniques vary depending on the nature of
the experimental design. In addition, data from microarray experiments are
both quantitative (expression level) and qualitative (the gene is expressed
or not). However, a major drawback of using quantitative data is its
accuracy and precision. Normalization, (to enable comparisons between
chips), is performed by scaling the gene expression levels of one chip to a
control microarray, to a control target intensity, or to another color
standard (in the case of printed microarrays). When the scaling factor
departs from one, the magnitude of the correction may have an effect on the
accuracy of the resulting data. For example a scaling factor of two will
increase by two-fold the intensity levels in the corrected chip.
With respect to performing class separation, we found that selectively
expressed genes contain information pertinent to the classification problem
(Aris and Recce, 2001). A selectively expressed gene is by definition
differentially expressed in two sets of samples. Our method was developed
using data derived from GeneChips, high-density oligonucleotide probe
arrays produced by Affymetrix Inc. Each gene is represented by 16 to 20
probe cells on a GeneChip containing 25mer oligonucleotides complementary
to the gene sequence (perfect match; PM), and a repeat of these probe cells
with a homomeric base mismatch (MM) at the central position. The presence
or absence of a gene is then evaluated by the Affymetrix software, with a
decision matrix based on metrics comparing the intensity of the PM to the
MM. We then convert absent and present calls into binary numbers, with one
corresponding to present and zero corresponding to absent. The selectivity
of each of the genes is computed as the absolute value of the difference
between the real-valued average of the binary values for each of the two
groups. Selective genes are considered significant if they are twice as
likely to be expressed in one group as in the other (absolute difference is
larger than 0.5). The genes are then ranked by their selectivity and the
most selective ones are used to construct exemplars. By shuffling the
data, we can distinguish whether the grouping was random or significant.
A simple form of normalization can be applied to this algorithm to improve
the separation. This normalization seeks to eliminate errors in the
assignment of absent calls, which may have occurred due to variation in the
processing of the samples. We found that a high background or low
hybridization results in chips with a lower number of genes found
present. Genes that are expressed at a low level are primarily affected as
they are close to background. This normalization strategy reclassifies
present calls with low expression to absent calls on chips with more than
average number of present genes.
Our algorithm was developed using data from Golub et al (1999). This
expression data was derived from bone marrow and peripheral blood samples
from patients suffering with acute lymphoblastic leukemia (ALL) and acute
myeloid leukemia (AML). The data was then split into 2 sets, a training
set and an independent set. The training set was used to develop the
classification method and the independent set to test it. They classified
and predicted classes of Leukemia by correlating the 2 diseases state to
gene expression levels. With our method, we found 121 significant
selective genes among which, some had also been found by Golub et
al. Some of these genes were relevant to the diseases involved (e.g,
HOXA9, cyclin D3 and zyxin). The selectivity of these genes was 3 to 9
standard deviations above the mean of the selectivity of genes at the same
rank in 45 shuffled random sets. To classify the samples into the two
leukemia groups we took the Euclidian distance of the each gene=92s binary
to the two exemplars. We then averaged this distance over 30 to 100 most
selective genes and determined to which exemplar each sample was the
closest. We were able to classify accurately the samples from the training
set, and only sample 66 was incorrectly classified in the independent data
set (note: this sample was systematically misclassified by all the analysis
techniques presented at the CAMDA '00 conference, and was speculated to be
a diagnostic error). We also performed the same classification after
normalization and increased the separation between the two clusters.
In summary, our selective approach is novel, complementary to quantitative
differential expression analysis, and is a well-suited tool to develop
diagnostic microarrays. Its power comes from its simplicity, and
robustness. The general usefulness of this method will be determined by
applying it to other datasets.
Aris, V.M., and Recce, M., 2001. A Method to Improve Detection of Disease
Using Selectively Expressed Genes in Microarray Data. CAMDA=9200 Conference
Proceedings, in print.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H,
Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular
classification of cancer: class discovery and class prediction by gene
expression monitoring. Science 286(5439): 531-537, 1999.
2.
Microarray Image Analysis and Expression Ratio Statistics
Yidong Chen
Cancer Genetics Branch, National Human Genome Research Institute,
National Institutes of Health, Bethesda, Maryland
To quantitatively analyze gene expression level, two fluorescent-labeled
RNAs were routinely hybridized to an arrayed cDNA probes on a glass slide.
Ratios of gene expression levels arisen from two co-hybridized samples were
obtained through image segmentation and signal detection methods. In our
early report, expression ratio has been studied via ratio statistics, and
the ratio confidence interval has been established so that ratio outliers
can be easily identified. Typically, we assume fluorescent background level
does not interfere with ratio measurement, which is derived via background
subtraction and weak target elimination. However, many experiment results
suggested that ratios derived from weak targets possess larger variation
than those from strong targets. In this study, we propose a new signal and
noise interaction model. Under the noisy environment, ratio statistic will
be numerically evaluated and its self-adjusting confidence interval will be
introduced. The new confidence interval, which automatically adapts under
different signal-to-noise ratio (SNR), provides a better criterion to
further interrogate weak expression levels.
In addition, a new quality metric for each ratio is proposed that provides
a quality assessment, such as the spot size, saturation, consistency, ect,
for the expression ratio measurement. The quality factor effectively enables
us to process gene expression data, without introducing a fixed intensity
filter, in many analysis methods, such as clustering, fingerprinting
analysis, and predictive analysis. A microarray image simulation program is
proposed in order to understand the variance introduced by the image
analysis methods. Not only will this simulation program help separating
variations from various sources, but also provide a useful tool to evaluate
existing software, and to design a better microarray image analysis
software.
3.
A Nonlinear Channel Normalization Approach for cDNA Microarray Data and Its
Performance Analysis
Chao Deng, Aili Wang, Denong Wang, Peisen Zhang
Columbia Genome Center, Columbia University,
New York City, NY 10032, USA
In microarray experiments, there are varieties of systematic errors that
affect measured gene expression levels. A normalization procedure is
necessary to identify and remove such sources of systematic error. Current
normalization methods fall into the categories of linear modifications to
the absolute or log intensities of gene expression, which encounter many
difficulties in case of a number of complex sources of systematic error. A
nonlinear normalization method, i.e., polynomial channel correctness
normalization (PCCN) approach, is proposed in this paper to deal with the
normalization problem of cDNA microarray data efficiently. PCCN utilizes a
polynomial with limited order to approximate the nonlinear dependence
relation between two dye channels such that the imbalance of the two
channels can be greatly compensated or corrected. Due to its strong
abilities of nonlinear compensation, PCCN method obtains better
normalization performance of cDNA microarray data than the current existing
approaches through extensive experiments. The PCCN=92 s efficacy was
justified by many experiments.
4.
T. Gregory Dewey (Keck Graduate Institute of Applied Life Sciences,
Claremont Graduate University), David Galas (Keck Graduate Institute of
Applied Life Sciences, Claremont Graduate University), Ashish Bhan
(Department of Mathematics, Claremont Graduate University)
Talk title: A Network Analysis of Expression Time Series
Abstract:
A common approach to tackling complex phenomena is to first establish the
extent and limitations of the linear response domain of the system. While
there is no a priori reason to assume that the time series of a gene
expression profile can be well related via a simple linear model, it is a
reasonable starting point and can provide important clues to the origin of
nonlinearity in the system. Linear models, in themselves, can often lead to
surprisingly complicated responses, especially for multivariate systems.
This provides a strong motivation for an in-depth exploration of such
models. In this work, a dynamic, linear response model for analyzing time
series of whole genome expression is presented. The simplest assumption
about expression profile data is that the expression state represented in
the data from one time point determines the expression state seen at the
next time point. This assumption is equivalent to modeling the data by a
first-order Markov process. The linear version of this model is described by
a single transition matrix, L, defining the transitions from one state to
the next. The expression levels of each of m genes at two points in time
can be described by column vectors, a(t) and a(t-1), each of length m. The
transition between the two states is modeled by: <<...OLE_Obj...>> where
<<...OLE_Obj...>> is the expression level of the ith gene at time t after
some exposure or treatment. The transition coefficients are
<<...OLE_Obj...>> which are the respective elements of the
<<...OLE_Obj...>> transition matrix. The matrix elements represent the
influence of the expression level of the jth gene on that of the ith gene.
Using this model, we calculate a transition matrix for both cell-cycle and
diauxic shift data is yeast. These models are statistically robust and lead
naturally to a network of interactions reflected in the data. A graphical
network can be readily derived from these results by thresholding the
parameter <<...OLE_Obj...>> and replacing it with a 1 above a certain
threshold and a 0 below the threshold. This procedure results in a sparse
matrix that gives a network representation of the data. This approach can
provides a direct method of classifying genes according to their place in
the resulting network and offers an alternative to traditional clustering
approaches. These network groupings compare favorably with previously used
methods like cluster analysis. The network derived by this method shows a
hierarchical structure that is dominated by a collection of central hubs.
These hubs are interconnected and have a cascade of tree-like structures
attached to them. The statistical properties of these resulting networks
were determined for a number of different time series data sets for yeast in
the public domain. These results consistently show networks that have
"small world" characteristics and show scale free distributions of
connectivities. A general class of network growth models have been derived
that show behavior consistent with the experimental results. Non-linear and
higher order Markov behavior of the network can also be included by a
self-consistent method. Networks derived from these more sophisticated
models show similar behavior.
5.
Vladimir Filkov and Sorin Istrail,
Celera Genomics
Talk title: Inferring Gene Transcription Networks: The Davidson Model
Abstract:
In 2001 Eric Davidson published the book "Genomic Regulatory Systems," where
he reports on his and his colleagues' 30 years of work on developmental gene
regulation of purple sea urchin. Their work resulted in a general
experimental framework for the study of a gene's cis-regulatory region (an
upstream DNA sequence containing a series of consecutive binding sites).
Their approach consisted of performing systematic, almost exhaustive,
mutations to individual binding sites of a gene's cis-region, and observing
the corresponding transcription rates. They focused mostly on the endo16
gene. By quantitative analysis of the observed transcription rates, they
were able to infer a complete set of minimal functional units of regulation
and their interrelations. Hierarchically from those units, they uncovered
"modularity" and "hardwired information processing logic" of that
cis-region. Their extraordinary technology and the inference of the
underlying cis-region's "network" for endo16 resulted in the most completely
understood transcriptional system to date.
It is quite remarkable how combinatorial and robust their approach is. In
this paper we present an analysis and introduce a natural mathematical
formalism of the Davidson transcriptional network inference framework
together with combinatorial problems and algorithms related to it.
6.
Nanxiang Ge (Aventis Pharmaceuticals),
Fei Huang (Bristol-Myers Squibb),
Peter Shaw (Bristol-Myers Squibb),
C.F. Wu (University of Michigan, Ann Arbor)
Talk title: PIDEX: a Statistical Approach for Screening Differentially
Expressed Genes Using Microarray Analysis
Abstract:
Microarray technology is being applied in pharmaceutical drug discovery. A
typical experiment is conducted to compare the gene expression profiles
under two different conditions and the purpose is to find genes
differentially expressed under the conditions. Common practice is to use
fold change for detecting differential expression. However, use of fold
change can generate many false positive errors because of the existence of
genes with low or undetectable expression levels. A novel method to analyze
differentially expressed genes is presented that combines the fold change,
change in the absolute intensity measurements and data reproducibility. It
produces p-values for identifying differentially expressed genes (PIDEX).
The proposed methodology is demonstrated by analyzing the expression
profiling data from a public data set and an internally conducted experiment
comparing two cell lines (ES2 and WI38). Results from these analyses and a
validation study using quantitative RT-PCR assays suggest that PIDEX
outperform the use of fold change alone.
7.
Lessons from the Analysis of Gene Expression Data:
A comparison of methods for inferring gene networks and different data sets
Thomas Heiman
George Mason University
A great deal of enthusiasm has been generated about micro array expression
data analysis during the last couple of years. After an initial flood of
methods developed to cluster the gene expression data based upon similarity
of mRNA expression level profile, a number of techniques have been
developed to reconstruct networks of biomolecular interactions, or gene
networks, in order to create integrated and systematic models of the
biological systems under study. A variety of approaches have been
developed.
However, it is important to clearly define and validate the results of
different analytical tools given the variance in the methods and amount of
data currently available. The goal of this paper is to shed some light on
the above question and to serve as a baseline by comparing the different
approaches available, both discrete and continuous methods, to extract the
Saccharomyces cerevisae GAL gene network from four different sets of
publicly available mRNA expression data.
8.
MDL Gene Subset Selection for Classification
Rebecka Jornsten, Department of Statistics, UC Berkeley
rebecka@stat.berkeley.edu
The scientific implication of microarray data lies in its revelation of
biological information. Firstly, we are interested in finding groups of
genes that function in a similar fashion under various experimental
conditions. The conditions can correspond to tissue types, cell lines,
time, or pathological contexts (e.g. type of cancer). For each gene, the
vector of gene expression levels under different experimental conditions is
also called the gene profile. Secondly, we would like to classify or group
the conditions using the set or a subset of gene profiles. Thirdly, we aim
to determine a functional relationship between gene expression profiles and
experimental conditions, i.e. possible interactions and outcome models
based on selected gene profiles. These three are tasks of statistical
inference. Here we discuss the development of statistical methodologies
using the Minimum Description Length (MDL) principle to deal with mainly
the second and first task. When properly formulated, these tasks all fall
within the sub-field of model selection in statistics.
MDL is a general statistical modeling principle based on the data
compression philosophy, that a good statistical model should compress the
data well. It formalizes Occam's Razor and explicitly states that one
should choose the model that gives the shortest description of the data
(Rissanen, 1978). The key question in MDL research is which description
length to use for a particular model class, which is addressed by optimal
universal coding theorems. For regular parametric families, two-stage,
mixture, predictive, and NML (normalized maximum likelihood) codes have
been shown to achieve the universal optimality. MDL has had its major
impact in statistical inference for model selection problems. With high
dimensional gene profile data, an appropriate account of the complexity of
a model is crucial for gene subset selection. The coding framework that MDL
relies on provides a natural way to account for the model complexity.
Rissanen, J. (1978), Modeling by shortest data description, Automatica,
14, 465-471
9.
Richard Karp,
University of California, Berkeley
Talk Title: Combinatorial and Information-Theoretic Approaches to Mining
Gene Expression Data
Abstract:
Gene expression data from a set of microarray experiments is typically
presented as a matrix in which the rows correspond to genes, the columns to
experiments, and each entry to the expression level of a given gene in a
given experiment. Clustering methods are often used to find sets of genes,
or sets of experiments, with similar patterns of expression. One can then
explore the biological reasons for such similarities. In the case of genes,
the similarity may arise because the genes are regulated by the same
transcription factors or environmental conditions. In the case of
experiments on different clinical samples, similarity of expression may occur
because the samples are taken from tissues in similar disease states.
Related to clustering is the problem of supervised classification, in which
each experiment corresponds to a clinical sample, and a clustering of the
samples according to phenotype is given. Here the challenge is to devise a
rule that distinguishes the different clusters on the basis of their gene
expression patterns, and can be used to classify further clinical samples.
The problem of feature selection is of central importance, both in
clustering and in supervised classification. This problem arises because of
the large number of genes that can be measured in a single microarray
experiment. The expression level of each gene can be used as a feature
influencing the clustering or classification of clinical samples, but
usually a clustering or classification rule is biologically plausible,
comprehensible and robust only if it is based on a relatively small number
of informative features. I shall describe joint work with Eric Xing and
Michael Jordan (1,2) which uses information-theoretic principles as a guide
to feature selection. The resulting algorithms have been successful in
classifying leukemia samples, both in the setting of clustering and in the
setting of supervised classification.
Another approach to finding structure in a matrix of gene expression data is
to look for two-dimensional patterns. Such a pattern is specified by
selecting both a set of genes and set of experiments,
such that the expression levels of the selected genes within the selected
experiments exhibit some regularity or uniformity. For example, one
might look for patterns in which every selected gene has a high level of
expression in every selected experiment. I will report on joint work in
progress with Amir Ben-Dor, Benny Chor and Zohar Yakhini directed
towards finding patterns in which the expression levels of the selected
genes are similarly ordered within the selected set of experiments.
References
[1] Eric P. Xing and Richard M. Karp, "CLIFF: Clustering of
High-Dimensional Microarray Data via Iterative Feature Filtering Using
Normalized Cuts," ISMB Conference (2001).
[2] Eric P. Xing, Michael I. Jordan and Richard M. Karp, "Feature Selection
for High-Dimensional Genomic Microarray Data," Machine Learning Conference
(2001).
10.
Alvaro Mateos (Bioinformatics Unit, Centro Nacional de Investigaciones
Oncologicas (CNIO), Ctra. Majadahonda-Pozuelo, km. 2, Majadahonda,
28200, Madrid), Joaquin Dopazo (Bioinformatics Unit, Centro Nacional de
Investigaciones Oncologicas (CNIO), Ctra. Majadahonda-Pozuelo, km. 2,
Majadahonda,28200, Madrid), Yuhai Tu (IBM Computational Biology Center,
T.J. Watson Research Center), Ronald Jansen (Department of Molecular
Biophysics and Biochemistry, Yale University), Mark Gerstein (Department of
Molecular Biophysics and Biochemistry, Yale University), Gustavo Stolovitzky
(IBM Computational Biology Center, T.J. Watson Research Center)
Talk title: What can be "learned" from Gene Expression Arrays?
Abstract:
Recent advances in microarray technologies have allowed for the study of
gene expression from a genomic perspective. One important application of
this technology is functional annotation. When cells are treated under
different conditions, genes will change their pattern of expression
according to their cellular role, and this data can be used to assess their
biological function. There are a number of possible data analysis techniques
to deal with this type of data (see [1] for a review of methods). Typically,
unsupervised clustering of patterns of expression only provides information
on genes that co-express. However, genes belonging to the same functional
class may display more complex behaviours, undetectable by these techniques.
This difficulty can be overcome with supervised learning algorithms. These
algorithms use the prior knowledge of gene function to extract out of the
expression patterns the signature(s) corresponding to the different
functional classes. Support vector machines (SVM) and other machine learning
methods have recently been applied for this purpose [2]. We will review this
previous work and elaborate on their reach and limitations.
We have explored the use of supervised neural networks for the purpose of
gene functional annotation using gene expression data reported in [3]. We
have studied the ability of our machine-learning scheme to systematically
learn one hundred functional classes catalogued in the MIPS database [4],
and found that less than 10% of these classes are learned, based on a score
of low rate of false negatives. We then turned our attention to the question
of why the remaining 90% are poorly learned. The answer lies in the fact
that there are genes that belong to more than one biological pathway, thus
confounding the signatures that ought to be learned for a unique class. We
studied in detail how the fact that different functional classes have
no-null intersection influences the learning ability of our scheme. Finally,
an iterative scheme is proposed that recruits the false positives of
iteration i as true positives in iteration i+1. The iteration starts with
all the genes assigned to a given class and proceeds until the rate of false
positives reaches a low pre-assigned threshold. We show that this process
converges in a few steps to a class that can be learned with considerably
low rates of false positives and false negatives. Furthermore, the new set
of genes thus created contains genes whose functional classes are
biologically related to the original class, allowing for a coarse
reconstruction of the interactions between associated biological pathways.
We exemplify this methodology using the well-studied tricarboxylic acid
cycle.
[1] D'haeseleer P, Liang S, Somogyi R. "Genetic network inference: from
co-expression clustering to reverse engineering." Bioinformatics. 2000
Aug;16(8):707-26.
[2] Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M
Jr, Haussler D. "Knowledge-based analysis of microarray gene expression data
by using support vector machines." Proc Natl Acad Sci USA. 2000; 97:262-7.
[3] Eisen M.B., Spellman P.T., Brown P.O. & Botstein D. "Cluster analysis
and display of genome-wide expression patterns." Proc Natl Acad Sci USA 2000
95:14863-14868.
[4] Munich Information Center for Protein Sequences. URL:
http://mips.gsf.de/proj/yeast/catalogues/funcat
11.
Felix Naef (Laboratory of Mathematical Physics, The Rockefeller
University), Daniel Lim (Laboratory of Neurogenesis, The Rockefeller
University), Nila Patil (Perlegen Sciences Inc./AffyMetrix Inc.),
Marcelo Magnasco (Laboratory of Mathematical Physics, The Rockefeller
University)
Talk Title: From features to expression: High-density oligonucleotide
array analysis revisited
Abstract:
One of the most popular tools for large scale gene expression studies are
high-density oligonucleotide GeneChip arrays. These currently have 16-20
small probes ("features") for evaluating the transcript abundance of each
gene. In addition, each probe is accompanied by a mismatched probe (MM)
designed as a control for non-specificity. An algorithm is presented to
compute comparative expression levels from the intensities of the individual
features, based on a statistical study of their distribution. Interestingly,
MM probes need not be included in the analysis. We show that our algorithm
improves significantly upon the current standard and leads to a substantially
larger number of genes brought above the noise floor for further analysis.
12.
Computational Resequencing by Universal Microarrays
Itsik Pe'er
School of Computer Science, Tel Aviv University
izik@post.tau.ac.il
Abstract
We have developed a new computational method that combines hybridization
data from a universal chip with prior approximate knowledge on the target
DNA sequence and determines the exact variations in the target sequence. In
contrast to many other SNP-genotyping techniques, the sequence variations
detected are not restricted to previously known polymorphic loci. This task
is most prominent in the analysis of somatic genetic variants, where
mutations are observed at arbitrary sites.
"DNA chips" allow probing a DNA sequence for the presence of all possible
short oligonucleotides of a certain length. Universal DNA microarrays with
all possible 8-mers, generated by spotting or photolithography, were
constructed in the past and used to sequence short molecules de novo. Our
method allows using such microarrays, or any parallel probing technology,
for resequencing.
Our computational approach is to model probabilistically both the prior
knowledge about the sequence, and the information produced by the
hybridization reaction. To capture prior sequence knowledge we use Hidden
Markov Models. The hybridization results are described by a graph theoretic
construct, the de-Bruijn graph, extended to accommodate noise probabilities.
We have developed an algorithm for reconstructing the target sequence given
the two data sources. Our algorithm is capable of handling insertions and
deletion in addition to substitutions, as well as fragmented targets (e.g.,
different exons).
Preliminary simulations with real DNA sequences are very promising: With
8-mers hybridization data, our method can accurately determine virtually all
SNPs in a target of length 2kb, even when the hybridization reaction has
false positive and negative rate of 5%. The average reconstruction error
rate is 1 base in 30,000. Naturally, there is a performance tradeoff between
target length versus tolerated polymorphism rate and reaction inaccuracy.
Joint work with Naama Arbili and Ron Shamir
13.
Geometrical analysis of gene expression dynamics
Scott A. Rifkin
Department of Ecology and Evolutionary Biology, PO Box 208106
Junhyong Kim
Department of Molecular, Cellular, and Developmental Biology, PO Box
208103, =87 Department of Statistics, PO Box 208290, Yale University, New
Haven, CT 06520
Corresponding Author: Junhyong Kim, Department of Ecology and
Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT 06520;
junhyong.kim@yale.edu
Abstract
During physiological and developmental processes of an organism, the
molecular state of any given cell undergoes a cascade of changes in
coordination with other cells and the environment. These molecular
interactions have an inherent temporal structure molecular interactions
obey dynamical rules. Recent advances such as microarray technology (1)
have the potential to characterize molecular interaction dynamics at the
whole genome level. However, compared to typical time-series data
microarray data is characterized by a relatively high degree of noise, an
extremely large number of variables, and a small number of measurements,
requiring non-traditional approaches. Parametric approaches such as Fourier
analysis or wavelet analysis can be difficult to apply and may not reveal
crucial aspects of the data, while an analysis based on the static
structure of the data such as singular value decomposition (SVD) can be
inadequate for revealing dynamical features. In this paper, we introduce
two techniques to aid dynamical analysis of gene expression data: dynamical
structure visualization and non-parametric geometrical analysis of periodic
dynamics. We apply our analyses to the well-analyzed Saccharomyces
cerevisiae cell cycle data as an example and demonstrate the strength of
our method using numerical simulations.
We used Mathematica 4.0 (2) to analyze the three yeast
datasets alpha-factor, cdc15, and cdc28with adjustments for missing data
as described in Rifkin et al. (3) providing us with data for 5541 genes
(see (4) and (5) for details of the experiments). The cdc15 time-series has
samples every 10 minutes from 10 to 290, except for 5 timepoints (these we
estimated by linear interpolation). The different datasets arise from three
different ways in which the yeast cell cycle was arrested and synchronized
prior to measurement of gene expression over time. Dynamically, they
represent three different initial conditions for the system. Given the
periodic nature of the cell cycle, it is evident that the activity of some
subset of the genes or linear combinations of gene expression levels will
show simple periodic dynamics with the frequency of the cell cycle. If the
dominant expression dynamics of these genes is governed by a
low-dimensional periodic component, it will lie in some unknown subspace of
the high-dimensional state space of all the genes (~6200 genes for yeast).
We investigated the geometry of these dynamics with two new tools for (a)
locating and visualizing the possible periodic dynamics in the state space
and (b) projecting it onto subspaces to estimate dominant periodic dynamics
and phase relation of the genes. We also demonstrate how previous results
(3) arise from the geometry of cell cycle dynamics.
14.
Mat Soukup and Jae K. Lee
University of Virginia
Talk title: Identifying Multiple-Factor Genes and Evaluating Classification
Probability for Distinct Biological Groups Based on Gene Expression Data:
Stepwise Cross-Validated Discriminant Analysis
Abstract:
Thanks to recent advances in gene chip technology, various genome-wide gene
expression studies have been performed to discover important gene factors
that enable us to discern between two or more biological conditions, disease
subtypes, or critical time points in a biological pathway. For instance,
performing a gene expression study for two subtypes of leukemia patients,
Golub et al. (1999) tried to predict and distinguish the two subtypes of
patients that show quite different prognoses and require fundamentally
different medical treatment courses. They proposed to use a so-called
gene-voting method, which aggregates the prediction power by gradually
adding multiple gene factors. This approach, however, can neither identify
individual gene factors that are most critical in predicting the two
subtypes nor provide a prediction probability of their tumor classification.
We propose a new method for simultaneously identifying important gene
factors and evaluating their predictive power for two groups of gene
expression data using a stepwise cross-validated discriminant analysis
approach (SCVD). Applying one-leave-out cross-validation and quadratic
discriminant analysis in a stepwise fashion, we identify all multiple gene
factors that play an important role in discriminating two biological groups
by their gene expression patterns. Our SCVD approach is as follows. At
each stage, an additional gene that significantly improves the
misclassification error and provides the lowest misclassification rate, is
retained. In the case of a tie, the gene (or model) with the highest
predictive power-highest posterior classification probability is chosen.
Each gene in the current model in turn is then validated by a backward
evaluation of classification power for possible elimination. The above
process continues until the misclassification rate is no longer lowered. We
iterate our search dropping the genes found in the previous model until we
can find no more gene model within a preset threshold of misclassification
error. We applied our method to Golub's leukemia data, for which we
converged upon a model containing only two genes, Zyxin and Azurocidin,
which correctly classified 37 of 38 patients in the training set and 31 out
of 34 in the independent validation set. Thus, our approach provided
misclassification error rates equivalent to or better than those of the gene
voting models, which typically contained 50 to 100 gene predictors.
Evaluating the posterior probability of each patient's tumor classification,
we could also more carefully assess each patient's subtype of leukemia. Our
method can be applied to more than two groups of differentially expressed
array data sets for discrimination and prediction. An extensive simulation
study is in progress.
15.
Gary Stormo, Washington University Medical School
Using Expression Data to Learn About Regulatory Networks
Via Promoter Analysis
One of the uses of expression data are to identify sets of coregulated
genes. From those sets one can try to discover the regulatory sites that are
involved in the regulatory processes, which can then lead to identifying the
proteins involved and help in understanding the complete regulatory network.
Methods for discovering the sites have been around for over 15 years, and
current methods are reasonably good. A brief overview of those methods will
be presented. But in higher eukaryotes it is rarely the case that genes
are regulated by single transcription factors, rather they tend to be
controlled by combinations of factors working in concert. Examples will be
given of such combinatorial regulation that we've uncovered using some of
the methods we and others have developed. In addition, improved methods
with higher sensitivity for identifying combinatorial factors will be
described.
Finally, recent work on identifying regulatory sites within RNA sequences,
which may be composed of both structure and sequence constraints, will be
described. Such post-transcriptional regulatory mechanisms can influence
the results of expression analysis by altering the half-lifes of mRNAs, and
can also influence the expression of protein levels without changing the
mRNA abundances, and therefore explain some of the discrepancies observed
between mRNA and protein levels.
16.
Emerging Technologies for Gene Expression Analysis
Peter Tolias, Ph.D.,
Director, Center for Applied Genomics, Public Health Research Institute,
Newark NJ.
Associate Professor, Dept. of Microbiology. & Molecular Genetics,
UMDNJ-New Jersey Medical School, Newark NJ.
In recent years, we have seen revolutionary advances in technology related
to gene expression studies. Individual measurements of steady state
levels of mRNA have been replaced by multiplexing strategies. Technology
permitting genomic scale analysis of gene expression in a single experiment
such as DNA microarrays and GeneChips, are now widely used by researchers
in both industry and academia. Finally, there are several emerging gene
expression technologies in various stages of development. This tutorial
will review the major platforms that are currently available for gene
expression analysis and provide a glimpse of the emerging technology of the
future.
17.
A Brief Review of Methods used in the Analysis of Gene Expression Data
Yuhai Tu
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
Abstract
Several microarray technologies that monitor the levels of expression of a
large number of genes have recently emerged and promised to revolutionize
genetics research in the post-Genomic era. Some of the practical
applications of the Microarray technology include gene function annontation;
disease characterization; pharmacogenomics and study of gene regulatory
network.
Due to the huge amount of data being produced and also the noisy nature of
the data, sensible analysis methods becomes crucial in deciphering the true
signal in the massive gene expression data. In this talk, we will give a
general review of the existing analysis methods that have been used in
different problems. The analysis methods are divided into two catagories:
unsupervised and supervised. For methods in the first catagory, which are
mostly used in class discovery, we will discuss various clustering
algorithm, including hierachical clustering method, K-means, self organizing
maps (SOM) and superparamagnetic clustering. For the methods in the second
catagory, which are mostly used in classification, we will discuss individual
gene based methods, support vector machine (SVM) and pattern discovery based
methods.
There is no universally good analysis method that can be used in dealing
with all gene expression data, the methods of choice will depend on the
biological system being studied and the questions being asked. Therefore
it is crucial for the biologists and bioinformaticians to undersatnd the
pros and cons of the existing methods. In this review, we will evaluate
the existing methods critically, emphasising the need for statistical
analysis and consideration of noise.
18.
Honghui Wan,
NCGR/NIH
Talk title: Gene Expression Analysis System from Multiple Biological
Information Resources
Abstract:
Microarrays have become the most effective, broadly used tools in the
genomics revolution. The development of microarray technology has advanced
the ability to perform genome-wide analysis by simultaneously monitoring the
gene expression and identifying genes related to complex diseases and
multi-cellular responses. The impact on human health can be studied using
microarrays to determine the effects on the expression pattern of genes. The
ability to integrate access to such a wide variety of public biological
resources with that of comparative gene expression data, is of great value
to critically organize, archive, analyze, and visualize intrinsic gene
expression profiles. Gene expression data is useless unless biologically
meaningful information can be extracted and presented in some readily
understandable fashion. The production of this meaningful information,
involving many facets of statistical analyses associated with multiple
resources, is only possible with computers running sophisticated software.
The advancements in information technology provide the ability to design
data management and analysis systems that not only warehouse information,
but facilitate relational integration and interpretation of large-scaled
microarray gene expression data with outputs from multiple heterogeneous,
synthesized, and distributed biological resources.
We develop an integrated and comprehensive gene expression data
management and analysis system from heterogeneous, synthesized, and
distributed biological databases and resources, such as:
* Gene annotation information,
* Clinical data about different cell lines,
* Motif information,
* Protein localization information,
* Protein classification information,
* Biological pathway information,
* Experimental data relating how a microarray experiment was carried
out,
* Related textual biological data stored in databases such as MEDLINE,
* Phylogenetic profiles that are derived from a comparison between a
given gene and a collection of complete genomes.
This system can be applied in a creative fashion to discover knowledge and
understanding of genes associated with complex diseases.
19.
Amir Ben-Dor,
Agilent Laboratories
Talk title: Overabundance Analysis with Applications in Cancer
Sub-classification
Abstract:
Recent studies (e.g Alizadeh et al, Nature 2000; Bittner et al, Nature 2000;
Golub et al, Science 1999) on molecular level classification of cancer cells
produced results that strongly indicate the potential of gene expression
assays as diagnostic and segmentation tools and as a basis to the discovery
of putative disease subtypes. We will describe methods that enable data
analysis in various stages of such studies.
Classified gene expression data consists of tissue samples (for which
expression profiles are measured) that are labeled as belonging to certain
classes (such as tumor or normal, particular kinds of tumors, phase,
differentiation stage, etc). Some of the genes measured play a major role in
the processes that underlie the differences between the classes or are
dramatically effected by the differences. Such genes are highly relevant to
the studied phenomenon. On the other hand, the expression levels of many
other genes are irrelevant to the distinction between the tissue types under
consideration. We will examine ways of measuring the relevance of a gene, or
a set of genes, to the studied phenomenon. We will discuss some
corresponding statistical benchmarking techniques and see how these can be
applied to the more complicated challenge of class discovery. This term
refers to the process of trying to identify statistically significant
subclasses of tissues in gene expression data, in an unsupervised manner.
Specifically, we will consider the null model where each sample is labeled
as '+' or '-', depending on class membership. Some genes have dramatic '+'
to '-' expression level differences. Under a null model where a vector of
labels of the appropriate composition is uniformly drawn, we can assign
p-values to all '+' to '-' expression level differences. For actual
biological classes we typically observe an overabundance of differentially
expressed genes (compared to the null model).
Efficient methods for calculating exact score distributions, under the above
null model, allow, therefore, for a novel approach to class discovery. For
candidate partitions of the sample set we compute the abundance of
differentially expressed genes and assign a statistical to this observed
abundance. Search heuristics (simulated annealing, genetic algorithms) find
the highest scoring partitions. Thus, grouping is based on subsets of the
genes rather than on the entire set. The calculations are accurate and
efficient, in contrast to sampling based methods.
We will discuss statistical and algorithmic approaches. We will use actual
gene expression data to demonstrate the relevance scoring process and the
discovery process.
This is joint work with Amir Ben-Dor and other collaborators.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on October 17, 2001.