Evaluation of the DIMACS Special Year on
Mathematical Support for Molecular Biology
1994-1998
Committee: Daniel Gusfield, Sampath Kannan (chair), Pavel Pevzner
June 11, 1998
1. Introduction
Over the past few years computational biology and bio-informatics have
matured to the point that biologists routinely use techniques from
these fields to complement or even replace their wet lab experiments.
On the computer science side, computational biology research has
become a significant component in the research on algorithms.
Similarly bio-informatics research has acquired a prominent place in
the field of database research. While the origins of these new and
exciting fields can be traced back to the human genome project, the
DIMACS Special Year on Mathematical Support for Molecular Biology has
greatly contributed to the recent rapid growth in these areas.
This report examines the goals and accomplishments of this effort. In
this report, when we refer to the "special year" we are referring to
the entire period between 1994 and 1998.
A large number of biologists, computer scientists and discrete
mathematicians have been involved. Participants from biology
have mostly been senior researchers who are already involved
with computational and mathematical techniques. Participants
from the other disciplines have included undergraduates,
graduate students, post-doctoral fellows, junior and senior
researchers.
In our view the goals of the DIMACS special year
were the following.
- To train computer scientists and mathematicians in biology in
order to enable them to perform relevant and interesting
research in computational biology.
- To help build bridges
between people in computer science and people in biology so
as to make possible fruitful future collaborations.
- To serve as the forum for actually fostering research
in these interdisciplinary areas.
- To initiate collaboration between researchers in this field
and industries.
In keeping with the asymmetric
relationship between biology and computer science in
computational biology, there
does not seem to have been a significant effort to train
young researchers in biology in the techniques of computer
science. We will examine each of the above goals in turn
and describe how the special year went about accomplishing
these goals.
2. Training
The foremost accomplishment has been the training of several
post-doctoral fellows who have gone on to successful careers
in the new field. A list of such fellows along with their
terms at DIMACS and their current position is given below:
- Richa Agarwala, 1994-96.
Currently at R.O.W. Sciences.
- Vineet Bafna, 1994-96. Research Scientist at
SmithKline Beecham Pharmaceuticals Research and Development.
- Scott Decatur, 1996-97. Currently at Grantham, Mayo,
Van Otterloo and Co.
- Dannie Durand, 1995-96. Currently choosing between various
career options in computational biology at universities.
- Sridhar Hannenhalli, Spring 1996. Research Scientist
at SmithKline Beecham Pharmaceuticals Research and Development.
- S. Muthukrishnan, 1994-96. Currently at Bell Laboratories.
- Ramamoorthi Ravi, 1994-95. Assistant Professor at
Carnegie-Mellon University.
- Mona Singh, 1995-97. Currently at Whitehead Institute for
Biomedical Research at MIT.
- Elizabeth Sweedyk, 1996-present.
- Shibu Yooseph, 1997-present.
During the peak period of funding for the special year from 1994
to 1996 there were at least 4 postdocs in residence at DIMACS
at all times. They benefitted immensely from the variety of
DIMACS-sponsored workshops, seminars, and conferences, from
extended visits by senior researchers in biology and computer
science, and from interactions with each other. The tangible
benefits will be described later in this report.
In addition to training post-doctoral fellows, the special year
was used for educational outreach to undergraduate and
even high school students. Post-doctoral fellows and permanent
members of DIMACS have given lectures at local high schools
and to undergraduates.
A number of extended tutorials have been held at DIMACS
and at other locations to help with the training aspect.
To kickoff the special year, in August of 1994
a 4 week long tutorial with
lectures by William Sofer, Martin Farach, Mick Noordewier,
and David Axelrod was held to introduce participants to
molecular biology and to current mathematical and
computational approaches to biological problems.
Another tutorial on computational biology was held
in July 1997 in Buenos Aires. A number of other
events such as workshops, distinguished lectures,
and conferences have been held throughout the period
from 1994 to the present. While these events have
had a significant training component we describe them
later in the report.
The training and general research environment
for computational biology provided by the special year
benefitted not only the post-doctoral fellows but also
graduate students at Rutgers and other places throughout
the country. DIMACS hosted visits by several graduate
students working in this area in the course of their
Ph.D.s. Such visitors included Daniella Martin
and Fengzhu Sun from
the University of Southern California, Shibu Yooseph
from the University of Pennsylvania, Andris Ambainis
from the University of Latvia (currently a graduate
student at UC Berkeley), Mary Cryan from the University
of Warwick, and Dario Robak and Christian Rodriguez from
the University of Buenos Aires. In addition graduate
students at Rutgers who benefitted from the training
included Jaime Cohen, Rick Desper, and Vincenzo Liberatore.
Finally, several senior researchers from around
the country were able to use their visits to DIMACS
to develop a research direction in computational
biology or to further their research interests in
this area. The list of such visitors is long
and the dramatic increase in the number of
theoretical computer scientists working in computational
biology attests to the success of the special year
in this regard.
The success of this effort at training computer scientists
in biology depended in large part on the participation
of biologists. Many prominent biologists both at Rutgers
and from elsewhere in the country participated in the
workshops and distinguished lectures organized during
the special year. During the first year several of
the post-docs at DIMACS visited Leroy Hood's sequencing
laboratory at the University of Washington and gained
invaluable first-hand experience with wet lab techniques.
Until about five years ago there were no good textbooks
available for computational biology.
A book titled, "Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology", Cambridge Press 1998
was authored by Dan Gusfield. The time that the author spent at
DIMACS during the special year helped shape this book. The visit to
DIMACS not only gave the author time to work on the book but also
gave him exposure to a wide variety of
topics and helped shape his choice of content and the intended
audience.
Another important book by Michael Waterman titled "Introduction to
Computational Biology" (Chapman and Hall, 1995) also benefitted from
the author's presence at DIMACS.
These books provide an invaluable training tool and will begin to
standardize the syllabus for a course on this subject --- a sign
of the maturing of computational biology.
3. Bridges
The special year helped launch a number of cross-disciplinary
collaborations. Prior to the special year research in
computational biology had been rather compartmentalized.
The work of computer scientists did not usually find its way
into the biologist's toolkit. In addition, computer scientists
were working on a small set of cleanly formulated problems
without a full knowledge of their biological relevance.
It was therefore important to greatly encourage collaborative
research.
We list below some of the important collaborative projects
that arose out of the special year.
- Computer Modeling of Ultraselfish Genes in Mice:
- Lee Silver, a Princeton biologist and Dannie Durand, a DIMACS
member from Bellcore have used Monte Carlo computer simulation to
suggest new hypotheses for the persistence of a deleterious gene in
mice. The collaboration is ongoing and has led to a number of
important research papers already.
- DNA Computing:
- A model of computation that has generated a lot of excitement in
recent years is the DNA Computer. The goal is to use biological
processes undergone by DNA and other biomolecular sequences to perform
useful and difficult computation. Laura Landweber, a Princeton
biologist and Richard Lipton, a permanent member of DIMACS from
Princeton are working together in this area with the goal of using DNA
computers to solve difficult computational problems about DNA
themselves. Their work has progressed to the point of beginning wet
lab experiments.
- Models of Duplication:
- A process of gene duplication (and subsequent mutation) is
thought to be responsible for the great diversity of genes that are
observed in extant organisms. Boris Mirkin, a long-term visitor to
DIMACS has been collaborating with Martin Vingron at the National
Cancer Center of Germany in Heidelberg to investigate theoretical
models of gene duplication.
- Prediction of Functional Domains on a Herpesvirus Gene:
- Lynn Enquist, a Princeton biologist, DIMACS member Dannie Durand,
and DIMACS postdocs Vineet Bafna and R. Ravi collaborated
on prediction of functional domains on a herpesvirus gene.
The research aimed to find similar sequences in other
species by using new similarity algorithms and search techniques.
Although this research did not produce positive results it
led the researchers to seek similarity based on structure
rather than sequence. The collaboration led Ravi to supervise
an undergraduate student at Princeton.
- Phylogenetic Tree Construction:
- David Swofford, a
Smithsonian biologist and DIMACS permanent member
Martin Farach have collaborated on algorithms for
finding maximum agreement subtrees --- a useful
tool in phylogeny construction. Swofford has
coded up and included Farach's algorithm for this
problem in his popular package, PAUP.
These projects represent some of the collaborations during the special
year at DIMACS. Perhaps the more important function of DIMACS has been
in training young computer scientists in computational biology so that
they go on to establish collaborations later in their career. The
handful of post-docs who have gone on to research positions in
Pharmaceutical companies attests to success in this regard.
4. Research
The DIMACS special year provided a stimulating research environment
for all of its participants. This is attributable to a number of
reasons. The steady stream of high quality visitors, post-docs, and
students produced a "critical mass" of enthusiastic researchers at
all times. This led to several fruitful collaborations that started
out as conversations in the corridors or at a talk.
There was also a steady diet of organized workshops and distinguished
lectures. In the year 1994-95 alone there were the following meetings
and talks --- Workshop on Combinatorial Methods in DNA Mapping
(Oct. 6--9), Mini-Workshop on Combinatorial Structures in Molecular
Biology (Nov. 4), Workshop on Sequence Alignment (Nov. 10--12),
Workshop on DNA Topology and Structure (Dec. 9), Phylogeny Workshop
(Feb. 6--8), Workshop on Global Minimization of Nonconvex Energy
Functions: Molecular Conformation and Protein Folding (Mar. 20--21),
Mini-Workshop on Sequence-Based Methods for Protein Folding (Mar. 24),
DNA Based Computers: A Mini Workshop (Apr. 4), Workshop on HIV
Sequence Analysis (May 3--5), Mini-Workshop on Geometrical Methods for
Conformational Modeling (Aug. 2--3), Algorithm Implementation
Challenge Workshop: Two Problems in Computational Biology: Fragment
Assembly and Genome Rearrangements (Sept. 14--15), and Mini-Workshop
on Mathematics of Drug Discovery (Sept. 21--22).
The above list gives an idea of the level of activity and the breadth
of scope of the special year. The number of workshops organized in
subsequent years seems to have dropped significantly but this is
perhaps related to the level of funding.
Speaking from personal experience and from the record of research
accomplishments emanating from DIMACS, we would say that the research
program was an unqualified success. A few of the tangible
consequences are listed below:
- The field of DNA computing which was at a nascent stage at the
time of the first workshop, received a significant boost from it and
from subsequent workshops organized with DIMACS participation at
Princeton University and at the University of Pennsylvania. This
workshop has now become an annual event and the field has a number of
active researchers from computer science, biology, biochemistry and
other fields.
- Early work in computational biology prior to 1994 was largely
focused on a small set of problems such as alignments, physical
mapping, and reconstructing phylogenies. The special year has
stimulated work on other important problems such as gene finding and
motif recognition, protein and RNA folding, protein structure
prediction, and linkage analysis. In addition computer scientists
have begun to turn their attention to specific problems motivated by
particular biological scenarios. This is no small feat considering
that the natural proclivity of computer scientists is to work at the
greatest level of abstraction possible. A large part of this change
was initiated by the talks given by prominent biologists at various
DIMACS fora.
- New biotechnologies such as sequencing by hybridization, optical
mapping, and EST (expressed sequence tags) have been quickly brought
to the attention of computer scientists so that they have been able to
formulate and research the right questions concerning these
technologies.
- An important part of algorithmic research in computational
biology will be experimental validation of the efficacy of the
algorithms, since theoretical analysis often yields overly pessimistic
answers. This has been facilitated by the setting up of benchmark
biological data at least in some problem domains. Thus algorithm
designers can test how well their algorithms perform on real
instances.
It is difficult to list all the research results for which the special
year provided the backdrop. We mention a few below:
- Approximating Numerical Taxonomy:
- DIMACS post-doctoral fellows Richa Agarwala, Vineet Bafna, and
Babu Narayanan, in collaboration with Martin Farach, a permanent
DIMACS member and visitors Mikkel Thorup, and Mike Paterson presented
an algorithm for inferring an evolutionary tree from a matrix of
interleaf distances in such a manner that the difference between the
tree distances and the matrix distances is at most three times the
best possible. The experimental evaluation of this algorithm performed
by Farach and a student, Jaime Cohen, indicate that a good
implementation of this algorithm often outperforms neighbor-joining,
one of the popular distance-based methods. This result was used by
Farach and Kannan to design an efficient algorithm for finding good
approximations to the maximum likelihood tree.
- Sequencing by Hybridization:
- Sequencing by hybridization (SBH) is a promising alternative to
the classical DNA sequencing approaches. Roughly, the technology
involves choosing a set S of "short" DNA strings and finding out
what subset of S occurs in an unknown DNA strand. The goal is
to use this information to reconstruct the unknown
strand. Unfortunately, it is difficult to get enough resolving power
to sequence long strands using this technique. A promising variant
called positional SBH allows one to determine not only what strings
occur, but also approximately where they occur in the unknown strand.
Special year visitors Sridhar Hannenhalli, Pavel Pevzner, and Steven
Skiena together with Herbert Lewis have studied the Positional
Eulerian Path Problem and shown that it is NP-complete in
general. However, when the positions can be determined fairly
accurately, they showed that the problem can be solved in polynomial
time. Such a result could serve to motivate a refinement of the
technology to determine positions more accurately.
- Phylogenetic Tree Reconstruction from Quartet Splits:
- Special year visitors Peter Erdos, Mike Steel, Laszlo, Szekeley,
and Tandy Warnow have worked on phylogenetic tree reconstruction
under stochastic models of nucleotide substitution. While
it is well-known that the set of valid quartet splits determines
the tree, Erdos, Steel, Szekeley, and Warnow strengthened
this result by showing that local quartets (i.e., quartets
with small diameter) can be used to infer the tree as well.
The authors and various other collaborators have performed
several experimental studies of their algorithm on
real and simulated data.
- Combinatorial Approaches to Gene Finding:
- The problem of identifying the introns and exons constituting a
gene is an important one. Traditional approaches to this problem have
typically been statistical. Pavel Pevzner in collaboration with
Gelfand has proposed a new dynamic programming algorithm. Experimental
evaluation of this algorithm has shown that it is very successful at
prediction.
- Parametric Alignment:
- Dan Gusfield, a visitor to the Special year, and his collaborators
have been studying the problem of parametric alignment. The task is
to design an alignment algorithm which finds optimal alignments for
all possible settings of the values of parameters such as the cost of
insertion/deletion, the cost of a mismatch, the cost for a gap, and
the score for a match. Gusfield has designed efficient algorithms for
this problem and has implemented these in a software package.
5. Industrial Connections
Biotech companies have been heavily involved in the Special year.
Participating companies include CuraGen, Seq Limited, Roche Molecular
Systems, Hoffman LaRoche, Pharmagenics, SmithKline Beecham, Johnson
and Johnson, Biosym Tech., Ciba Geigy, Merck, Perkin-Elmer, Genome
Therapeutics and the Institute for Genomic Research. These companies
have provided financial support, speakers, committee members,
organizers, attendees, and data and feedback for the algorithm
implementation challenge.
In addition, there are several people trained at DIMACS or associated
with DIMACS who have gone on to take important positions in these and
other companies, further strengthening the connections.
6. Conclusions
The special year has been enormously successful
in meeting its goals. The field of computational biology
is obviously an important one with the demand for
qualified people far exceeding the supply. DIMACS
has played an important role in exciting interest in
this area, training new researchers, and retooling
researchers in related areas to work in this field.
However, specific aspects still need improvement.
While there have been several successful cases of
collaboration between individuals in computer
science and biology, there is still a systemic
difficulty in the osmosis of ideas between
these two disciplines. It still appears
hard for researchers in these disciplines to arrive
at problems of mutual interest. DIMACS and other
institutions still have work to do in trying to
bring about a better synthesis.
The focus so far has been on training computer scientists
in biology. It is important to reverse roles and train
biologists, chemists and other scientists
in computer science as well for a couple of
reasons given below.
The emerging field of DNA computing reverses the roles of
computer science and biology --- computer science is
the client here. Scientists with training
in computer science can better appreciate what computer
science can and cannot do for them.
The focus of DIMACS training has been at the post-doctoral
level although several attempts have been made to
involve high-school, undergraduate, and graduate students.
In order to achieve true synthesis it may be important
to focus more on graduate students and undergraduates.
New curricula and degree programs can produce students
who are capable of understanding both fields, performing
relevant research in computational biology, and
bringing the two communities closer together.
DIMACS home page
Contacting the Center
Document last modified on June 11, 1998.