Evaluation of the DIMACS Special Year on
Mathematical Support for Molecular Biology

Committee: Daniel Gusfield, Sampath Kannan (chair), Pavel Pevzner
June 11, 1998

1. Introduction

Over the past few years computational biology and bio-informatics have matured to the point that biologists routinely use techniques from these fields to complement or even replace their wet lab experiments. On the computer science side, computational biology research has become a significant component in the research on algorithms. Similarly bio-informatics research has acquired a prominent place in the field of database research. While the origins of these new and exciting fields can be traced back to the human genome project, the DIMACS Special Year on Mathematical Support for Molecular Biology has greatly contributed to the recent rapid growth in these areas.

This report examines the goals and accomplishments of this effort. In this report, when we refer to the "special year" we are referring to the entire period between 1994 and 1998.

A large number of biologists, computer scientists and discrete mathematicians have been involved. Participants from biology have mostly been senior researchers who are already involved with computational and mathematical techniques. Participants from the other disciplines have included undergraduates, graduate students, post-doctoral fellows, junior and senior researchers.

In our view the goals of the DIMACS special year were the following.

In keeping with the asymmetric relationship between biology and computer science in computational biology, there does not seem to have been a significant effort to train young researchers in biology in the techniques of computer science. We will examine each of the above goals in turn and describe how the special year went about accomplishing these goals.

2. Training

The foremost accomplishment has been the training of several post-doctoral fellows who have gone on to successful careers in the new field. A list of such fellows along with their terms at DIMACS and their current position is given below: During the peak period of funding for the special year from 1994 to 1996 there were at least 4 postdocs in residence at DIMACS at all times. They benefitted immensely from the variety of DIMACS-sponsored workshops, seminars, and conferences, from extended visits by senior researchers in biology and computer science, and from interactions with each other. The tangible benefits will be described later in this report.

In addition to training post-doctoral fellows, the special year was used for educational outreach to undergraduate and even high school students. Post-doctoral fellows and permanent members of DIMACS have given lectures at local high schools and to undergraduates.

A number of extended tutorials have been held at DIMACS and at other locations to help with the training aspect. To kickoff the special year, in August of 1994 a 4 week long tutorial with lectures by William Sofer, Martin Farach, Mick Noordewier, and David Axelrod was held to introduce participants to molecular biology and to current mathematical and computational approaches to biological problems. Another tutorial on computational biology was held in July 1997 in Buenos Aires. A number of other events such as workshops, distinguished lectures, and conferences have been held throughout the period from 1994 to the present. While these events have had a significant training component we describe them later in the report.

The training and general research environment for computational biology provided by the special year benefitted not only the post-doctoral fellows but also graduate students at Rutgers and other places throughout the country. DIMACS hosted visits by several graduate students working in this area in the course of their Ph.D.s. Such visitors included Daniella Martin and Fengzhu Sun from the University of Southern California, Shibu Yooseph from the University of Pennsylvania, Andris Ambainis from the University of Latvia (currently a graduate student at UC Berkeley), Mary Cryan from the University of Warwick, and Dario Robak and Christian Rodriguez from the University of Buenos Aires. In addition graduate students at Rutgers who benefitted from the training included Jaime Cohen, Rick Desper, and Vincenzo Liberatore.

Finally, several senior researchers from around the country were able to use their visits to DIMACS to develop a research direction in computational biology or to further their research interests in this area. The list of such visitors is long and the dramatic increase in the number of theoretical computer scientists working in computational biology attests to the success of the special year in this regard.

The success of this effort at training computer scientists in biology depended in large part on the participation of biologists. Many prominent biologists both at Rutgers and from elsewhere in the country participated in the workshops and distinguished lectures organized during the special year. During the first year several of the post-docs at DIMACS visited Leroy Hood's sequencing laboratory at the University of Washington and gained invaluable first-hand experience with wet lab techniques.

Until about five years ago there were no good textbooks available for computational biology. A book titled, "Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology", Cambridge Press 1998 was authored by Dan Gusfield. The time that the author spent at DIMACS during the special year helped shape this book. The visit to DIMACS not only gave the author time to work on the book but also gave him exposure to a wide variety of topics and helped shape his choice of content and the intended audience. Another important book by Michael Waterman titled "Introduction to Computational Biology" (Chapman and Hall, 1995) also benefitted from the author's presence at DIMACS.

These books provide an invaluable training tool and will begin to standardize the syllabus for a course on this subject --- a sign of the maturing of computational biology.

3. Bridges

The special year helped launch a number of cross-disciplinary collaborations. Prior to the special year research in computational biology had been rather compartmentalized. The work of computer scientists did not usually find its way into the biologist's toolkit. In addition, computer scientists were working on a small set of cleanly formulated problems without a full knowledge of their biological relevance. It was therefore important to greatly encourage collaborative research.

We list below some of the important collaborative projects that arose out of the special year.

Computer Modeling of Ultraselfish Genes in Mice:
Lee Silver, a Princeton biologist and Dannie Durand, a DIMACS member from Bellcore have used Monte Carlo computer simulation to suggest new hypotheses for the persistence of a deleterious gene in mice. The collaboration is ongoing and has led to a number of important research papers already.

DNA Computing:
A model of computation that has generated a lot of excitement in recent years is the DNA Computer. The goal is to use biological processes undergone by DNA and other biomolecular sequences to perform useful and difficult computation. Laura Landweber, a Princeton biologist and Richard Lipton, a permanent member of DIMACS from Princeton are working together in this area with the goal of using DNA computers to solve difficult computational problems about DNA themselves. Their work has progressed to the point of beginning wet lab experiments.

Models of Duplication:
A process of gene duplication (and subsequent mutation) is thought to be responsible for the great diversity of genes that are observed in extant organisms. Boris Mirkin, a long-term visitor to DIMACS has been collaborating with Martin Vingron at the National Cancer Center of Germany in Heidelberg to investigate theoretical models of gene duplication.

Prediction of Functional Domains on a Herpesvirus Gene:
Lynn Enquist, a Princeton biologist, DIMACS member Dannie Durand, and DIMACS postdocs Vineet Bafna and R. Ravi collaborated on prediction of functional domains on a herpesvirus gene. The research aimed to find similar sequences in other species by using new similarity algorithms and search techniques. Although this research did not produce positive results it led the researchers to seek similarity based on structure rather than sequence. The collaboration led Ravi to supervise an undergraduate student at Princeton.

Phylogenetic Tree Construction:
David Swofford, a Smithsonian biologist and DIMACS permanent member Martin Farach have collaborated on algorithms for finding maximum agreement subtrees --- a useful tool in phylogeny construction. Swofford has coded up and included Farach's algorithm for this problem in his popular package, PAUP.

These projects represent some of the collaborations during the special year at DIMACS. Perhaps the more important function of DIMACS has been in training young computer scientists in computational biology so that they go on to establish collaborations later in their career. The handful of post-docs who have gone on to research positions in Pharmaceutical companies attests to success in this regard.

4. Research

The DIMACS special year provided a stimulating research environment for all of its participants. This is attributable to a number of reasons. The steady stream of high quality visitors, post-docs, and students produced a "critical mass" of enthusiastic researchers at all times. This led to several fruitful collaborations that started out as conversations in the corridors or at a talk.

There was also a steady diet of organized workshops and distinguished lectures. In the year 1994-95 alone there were the following meetings and talks --- Workshop on Combinatorial Methods in DNA Mapping (Oct. 6--9), Mini-Workshop on Combinatorial Structures in Molecular Biology (Nov. 4), Workshop on Sequence Alignment (Nov. 10--12), Workshop on DNA Topology and Structure (Dec. 9), Phylogeny Workshop (Feb. 6--8), Workshop on Global Minimization of Nonconvex Energy Functions: Molecular Conformation and Protein Folding (Mar. 20--21), Mini-Workshop on Sequence-Based Methods for Protein Folding (Mar. 24), DNA Based Computers: A Mini Workshop (Apr. 4), Workshop on HIV Sequence Analysis (May 3--5), Mini-Workshop on Geometrical Methods for Conformational Modeling (Aug. 2--3), Algorithm Implementation Challenge Workshop: Two Problems in Computational Biology: Fragment Assembly and Genome Rearrangements (Sept. 14--15), and Mini-Workshop on Mathematics of Drug Discovery (Sept. 21--22).

The above list gives an idea of the level of activity and the breadth of scope of the special year. The number of workshops organized in subsequent years seems to have dropped significantly but this is perhaps related to the level of funding.

Speaking from personal experience and from the record of research accomplishments emanating from DIMACS, we would say that the research program was an unqualified success. A few of the tangible consequences are listed below:

It is difficult to list all the research results for which the special year provided the backdrop. We mention a few below:

Approximating Numerical Taxonomy:
DIMACS post-doctoral fellows Richa Agarwala, Vineet Bafna, and Babu Narayanan, in collaboration with Martin Farach, a permanent DIMACS member and visitors Mikkel Thorup, and Mike Paterson presented an algorithm for inferring an evolutionary tree from a matrix of interleaf distances in such a manner that the difference between the tree distances and the matrix distances is at most three times the best possible. The experimental evaluation of this algorithm performed by Farach and a student, Jaime Cohen, indicate that a good implementation of this algorithm often outperforms neighbor-joining, one of the popular distance-based methods. This result was used by Farach and Kannan to design an efficient algorithm for finding good approximations to the maximum likelihood tree.

Sequencing by Hybridization:
Sequencing by hybridization (SBH) is a promising alternative to the classical DNA sequencing approaches. Roughly, the technology involves choosing a set S of "short" DNA strings and finding out what subset of S occurs in an unknown DNA strand. The goal is to use this information to reconstruct the unknown strand. Unfortunately, it is difficult to get enough resolving power to sequence long strands using this technique. A promising variant called positional SBH allows one to determine not only what strings occur, but also approximately where they occur in the unknown strand. Special year visitors Sridhar Hannenhalli, Pavel Pevzner, and Steven Skiena together with Herbert Lewis have studied the Positional Eulerian Path Problem and shown that it is NP-complete in general. However, when the positions can be determined fairly accurately, they showed that the problem can be solved in polynomial time. Such a result could serve to motivate a refinement of the technology to determine positions more accurately.

Phylogenetic Tree Reconstruction from Quartet Splits:
Special year visitors Peter Erdos, Mike Steel, Laszlo, Szekeley, and Tandy Warnow have worked on phylogenetic tree reconstruction under stochastic models of nucleotide substitution. While it is well-known that the set of valid quartet splits determines the tree, Erdos, Steel, Szekeley, and Warnow strengthened this result by showing that local quartets (i.e., quartets with small diameter) can be used to infer the tree as well. The authors and various other collaborators have performed several experimental studies of their algorithm on real and simulated data.

Combinatorial Approaches to Gene Finding:
The problem of identifying the introns and exons constituting a gene is an important one. Traditional approaches to this problem have typically been statistical. Pavel Pevzner in collaboration with Gelfand has proposed a new dynamic programming algorithm. Experimental evaluation of this algorithm has shown that it is very successful at prediction.

Parametric Alignment:
Dan Gusfield, a visitor to the Special year, and his collaborators have been studying the problem of parametric alignment. The task is to design an alignment algorithm which finds optimal alignments for all possible settings of the values of parameters such as the cost of insertion/deletion, the cost of a mismatch, the cost for a gap, and the score for a match. Gusfield has designed efficient algorithms for this problem and has implemented these in a software package.

5. Industrial Connections

Biotech companies have been heavily involved in the Special year. Participating companies include CuraGen, Seq Limited, Roche Molecular Systems, Hoffman LaRoche, Pharmagenics, SmithKline Beecham, Johnson and Johnson, Biosym Tech., Ciba Geigy, Merck, Perkin-Elmer, Genome Therapeutics and the Institute for Genomic Research. These companies have provided financial support, speakers, committee members, organizers, attendees, and data and feedback for the algorithm implementation challenge.

In addition, there are several people trained at DIMACS or associated with DIMACS who have gone on to take important positions in these and other companies, further strengthening the connections.

6. Conclusions

The special year has been enormously successful in meeting its goals. The field of computational biology is obviously an important one with the demand for qualified people far exceeding the supply. DIMACS has played an important role in exciting interest in this area, training new researchers, and retooling researchers in related areas to work in this field.

However, specific aspects still need improvement. While there have been several successful cases of collaboration between individuals in computer science and biology, there is still a systemic difficulty in the osmosis of ideas between these two disciplines. It still appears hard for researchers in these disciplines to arrive at problems of mutual interest. DIMACS and other institutions still have work to do in trying to bring about a better synthesis.

The focus so far has been on training computer scientists in biology. It is important to reverse roles and train biologists, chemists and other scientists in computer science as well for a couple of reasons given below. The emerging field of DNA computing reverses the roles of computer science and biology --- computer science is the client here. Scientists with training in computer science can better appreciate what computer science can and cannot do for them.

The focus of DIMACS training has been at the post-doctoral level although several attempts have been made to involve high-school, undergraduate, and graduate students. In order to achieve true synthesis it may be important to focus more on graduate students and undergraduates. New curricula and degree programs can produce students who are capable of understanding both fields, performing relevant research in computational biology, and bringing the two communities closer together.

DIMACS home page
Contacting the Center
Document last modified on June 11, 1998.