DIMACS SPECIAL YEAR IN
MATHEMATICAL SUPPORT
FOR
MOLECULAR BIOLOGY

ANNUAL REPORT
JANUARY 1996

The DIMACS "Special Year" in Mathematical Support for Molecular Biology was started in August 1994 and will continue until August 1996. The chairs are Joachim Messing, Director of the Waksman Institute for Molecular Genetics at Rutgers and Fred Roberts, Professor of Mathematics at Rutgers; the co-chairs are Larry Shepp of AT&T Bell Labs and Michael Waterman, Professor of Biology and Mathematics at the University of Southern California; and the Assistant Chair is Martin Farach of the Rutgers CS Department. A calendar of special year events, the list of the steering committee, and a list of special year visitors are appended.

In some sense, this has been like a "normal" DIMACS special year. There have been workshops, mini-workshops, an algorithm implementation challenge, a seminar series, a distinguished lecture series, a visitor program, a postdoc program, special graduate courses, technical reports and books, and involvement of "DIMACS types". However, this special year is very different.

It is different in duration. It is lasting 25 months, beginning in August 1994 with a tutorial series, having a full focused program from September 1994 to August 1995, and continuing with postdoc training, a seminar series, and workshops during the period September 1995 to August 1996. It is also different in size. A "normal" special year has been budgeted at between $250K to $450K. This "year" was budgeted at between $1.1M and $1.2M. Why the extra money? The main reason is that there were four postdocs instead of the usual one or two, and the postdocs are staying for two years instead of one. Also, there were biological scientists who visited, and there were long-term visits from prominent computational biologists to supplement the small number of DIMACS members who specialize in this field. With so many postdocs and visitors, DIMACS has been "bursting at the seams". Where have the funds come from? About $375K has come from the DIMACS STC award, the Rutgers STC matching, and the NJ Commission on Science and Technology. A sum of $360K came from an additional NSF award from computational biology and other programs. The National Center for Human Genome Research has contributed $100K and is expecting to contribute another $100K. We have also received miscellaneous amounts from Rutgers, DOE, Sandia, Centers for Disease Control, National Institute of Allergy and Infectious Diseases, Biosym Tech., Ciba Geigy, Merck Research, Roche Molecular Systems, Smith Kline Beecham, Cura-Gen, etc. This certainly connects to the STC goal of developing a strong base of support beyond the base STC award.

This special year has also been different in postdoc training. We are really introducing postdocs to new fields. We have gotten senior people from different disciplines and different institutions to be involved in mentoring. We have run a tutorial program for p'docs. The increased number of postdocs has created a critical mass, leading to a great deal of interaction and scientific progress.

A crucial difference in this special year is in its real outreach to another community. This is in line with the STC goal of creating "linkages that involve significant intellectual exchange with other appropriate communities and institutions beyond the sponsors". Here is a summary of some of the new partnerships established through the special year. First is the Waksman Institute for Molecular Genetics at Rutgers. Joachim Messing, Director of Waksman, is a chair of the year. Many Waksman faculty have been involved in mentoring, tutoring, organizing. Many events have been held at Waksman. Waksman has made available its Molecular Biology Computing Laboratory. A second partnership has been with the Center for Molecular Biotechnology, our sister STC at the University of Washington. We have planned postdoc visits to that center. Leroy Hood, Director of the center, is on the steering committee for the year and has played a central role in convincing us that the year was a good idea. There have been visits by by Hood, Maynard Olson, Phil Green, and their postdocs. Still a third partnership has been with the Genome Center and with the Department of Computer Science at the University of Pennsylvania. One mini-workshop was held through the auspices of the Genome Center; our seminar series has been held there occasionally; a second seminar series (on phylogeny) was initiated there and held in collaboration with DIMACS; DIMACS is collaborating in a conference to be held at Penn in the Spring of 1996; and Sampath Kannan, David Searls, and Tandy Warnow from Penn are on the special year steering committee, Searls representing the Genome Center. The Center for Theoretical and Applied Genetics at Rutgers has been an active partner in the year. Its Director, Robert Vrijenhoek, and its Associate Director, Peter Smouse, are involved in our seminar series, have served on workshop committees, and are participating in joint research. Postdocs from there are involved and faculty from there have posed problems for mathematical scientists. We have formed active and growing partnerships with the Departments of Molecular Biology and of Ecology and Evolutionary Biology at Princeton. Scientists there have been busy formulating problems for mathematicians and computer scientists. We visited those departments. as part of our tutorial program and those visits have led to major collaborations (details below).

Biotech and pharmaceutical companies have also been involved heavily in the special year. Participating companies include CuraGen, Seq Limited, Roche Molecular Systems, Hoffman LaRoche, Pharmagenics, Smith Kline Beecham, Johnson and Johnson, Biosym Tech., Ciba Geigy, Merck, Perkin-Elmer, Genome Therapeutics, and The Institute for Genomic Research. These companies have provided financial support, speakers, committee members, organizers, attendees, and data and feedback for our algorithm implementation challenge. To give just one example of industrial participation, we note that in our miniworkshop on mathematical problems in drug discovery, held in September 1995, the majority of speakers and participants were from industry. There was financial sponsorship from Biosym Tech., Ciba Geigy, Merck Research, and Smith Kline Beecham. This miniworkshop was designed to acquaint pharmaceutical and computational chemists with the latest developments in such fields as discrete and computational geometry and to introduce discrete mathematicians and theoretical computer scientists to the problems of the field coming to be known as "combinatorial chemistry," a field that so far consists largely of heuristic algorithms based upon chemical intuition. The workshop led to lots of discussion and to some promising beginning contacts, for example between a Chiron Corp. chemist and DIMACS mathematician who have common interests and ideas about the use of large-scale cluster analysis in drug design.

Prominent biologists are playing a leadership role in all special year activities. On the steering committee for the year are Joshua Lederberg, Leroy Hood, Charles Cantor, Michael Bulmer, and other biological scientists. Each workshop has had two principal organizers, a mathematical scientist and a biological scientist. Each workshop committee has had additional biologists.

The biological/biotech participation in our algorithm implementation challenge was noteworthy. The challenge, fourth in the line of such challenges, was devoted to "shotgun sequencing" and "genome rearrangements." Participants were challenged to test their algorithms against both random and real data sets. A workshop describing the challenge participants' results was held in September 1995. Ellson Chen, a "cutting edge biological sequencer," defined real data sets and provided intermediate biological feedback. Participants are highly motivated to continue coordinated work on testing their algorithms on common benchmark data. The addition of further, large, and difficult data sets was agreed upon. A major result was that scientists from industry not only participated actively, but helped with the benchmark data sets by giving away their raw data at a very early stage -- something rather unusual. We received data from Applied Biosystems, a Division of Perkin-Elmer and Genome Therapeutics and anticipate more from The Institute for Genomic Research. These are among the leading genome sequencing companies. Since participants from both academia and industry tested their programs in the same way, there was a lively discussion. It has become apparent that current experimental techniques lead to a need for extremely powerful software and this is now strongly influencing ongoing research.

In addition to visitors from the mathematical scientists, many biological scientists have been special year visitors. Prominent biological visitors include Leroy Hood, Herbert Hauptman, Charles Cantor, Temple Smith, Gary Stormo, and Maynard Olson. We have also gotten inquiries and requests to visit from less prominent biologists around the world.

Many biological scientists have been speakers in our weekly seminar series. In the Fall of 1995 alone, we have scheduled Laura Landweber, Princeton, Ecology and Evolution - on molecular evolution; Fred Hughson, Princeton Chemistry - on protein structure; Doug Deutschman, Cornell Ecology - on max likelihood models of forest ecology; and Lee Silver, Princeton Molecular Biology - on statistical significance in the genetics of addition.

This remarkably successful outreach effort underlines our achievement of a major STC goal: Demonstrating the value of the center mode of funding over individual research grants: This kind of interdisciplinary program could not be accomplished through individual research grants.

The special year has had a number of specific goals. These are to expose a large community of discrete mathematicians /theoretical computer scientists to problems of molecular biology that are fundamentally DM/CS problems; provide an opportunity for intensive collaboration to discrete mathematicians/theoretical computer scientists who have already discovered these problems; introduce some out- standing young people to the field in a very concentrated way; forge lasting partnerships between mathematical scientists and biological scientists; involve biological scientists in a primary role in defining the agenda of computational biology; and introduce molecular biologists to a community of mathematical scientists who are interested in helping them with their problems. We note how these goals relate to STC goals: They involve "a unifying theme that requires a center;" achieving them will lead to "nationally and internationally recognized research" and should have a lasting effect "on the infrastructure of science."

We have had many significant research accomplishments during the special year. Perhaps most important are the results of or beginnings of significant interdisciplinary collaborations. We give some examples of these.

Lee Silver, a Princeton biologist, and Dannie Durand, a DIMACS member from Bellcore, have been working on the t-haplotype regions in mice. Silver and Durand met through the DIMACS tutorial. They began modeling work on the effect of environmental and hereditary factors on the "t-haplotype" region on the chromosome of mice; this region has unusual sperm killing properties. They initiated modeling of evolution of t-haplotypes. This requires self-modifying code, which raises interesting computational questions related, for example to compiler theory.

Lynn Enquist, a Princeton biologist, Dannie Durand, and DIMACS postdocs Vineet Bafna and R. Ravi, collaborated on the prediction of functional domains on a herpesvirus gene. Enquist sought help with a sequence comparison problem involving the herpesvirus gene. He hoped to find similar sequences in other species and that that would help him identify the function of the virus proteins, but searches through the GenBank data base were not very revealing. The interaction led to negative results: Bafna, Durand, and Ravi did not find new comparison algorithms for better searches or discover that the methods used in the searches were faulty. However, they learned a lot about interaction with biologists. Moreover, the negative results have pointed us and the biologists in the direction of seeking similarities in the 3-dimensional structures of the proteins rather than in their amino acid sequences. It has pointed us in the direction of new research questions about protein structure. As a side note: This led to Ravi supervising an undergraduate senior thesis at Princeton.

Fredda Ginsberg-Fellner, a physician at Mt. Sinai in New York City, Larry Shepp, DIMACS member from AT&T Bell Labs, Martin Farach, (DIMACS member from Rutgers CS, and Mick Noordewier, DIMACS member from Rutgers CS and Waksman), have worked on diabetes and auto-immune diseases. Shepp had the idea that the existence of auto-immune diseases itself might provide the long-sought window into understanding how DNA and function are related. It is now possible to perform an experiment to test the hypothesis that there is a similarity between the triggering virus and some part of the DNA in those individuals susceptible to these diseases. It is hoped that by sequencing the DNA of the viruses and the presumed relevant parts of the human genome, a match can be found at the DNA level rather than the protein level. Shepp got Ginsberg, Farach, and Noordewier interested. Ginsberg presented a lecture on this topic at DIMACS . She will be applying for a grant from the Juvenile Diabetes Foundation to support work on acquiring the sequence data and Shepp, Farach, and Noordewier will search for the matches and design further experiments.

David Swofford, a Smithsonian biologist, and Martin Farach have collaborated on phylogenetic tree reconstruction. Farach had developed a theoretical algorithm for tree reconstruction. Farach and Swofford met at our workshop on Phylogeny. Swofford returned home and coded up the algorithm for practical implementation ("a theoretician's dream").

Peter Smouse, Rutgers biologist, Richa Agarwala, DIMACS postdoc, and Martin Farach have collaborated on phylogenetic trees and minimum spanning tree problems. When analyzing data, Smouse came up with the problem: Given a distance matrix M, find an edge-weighted spanning tree in which the distance D(i,j) in the tree between i and j is such that

sum[D(i,j) - M(i,j)2]
is minimized. We have shown that the problem is NP-complete and are working on coming up with good lower bounds for it.

The year has led to many other research results, not so directly related to collaborations between mathematical scientists and biological scientists, but motivated by problems of the biological sciences. We mention a few of these. DIMACS postdocs V. Bafna, S. Muthukrishnan, and R. Ravi have found efficient algorithms for exact and approximate matching of RNA strings. DIMACS postdocs B. Naruyanan and R. Ravi, motivated by fragment assembly problems, have studied the problem of finding a large number of axis parallel rectangles whose projections on both axes are disjoint. (The rectangles correspond to regions of high local similarity.) They have found tight upper and lower bounds on performance for a class of heuristics based on local search. DIMACS visitor J. Kececioglu and postdoc R. Ravi have studied the problem of aligning sequences related via a given evolutionary tree and provided a good approximation algorithm for this problem when the tree is regular. Kececioglu and Ravi have studied the basic problem of determining the minimum number of events that transfer one genome into another. Specifically, they made the first algorithmic study of genome rearrangement by translocation, exchange of material at the end of two chromosomes within a genome, and found an approximation algorithm for determining the translocation distance between two sets of strings. DIMACS postdoc R. Agarwala and visitor D. Fernandez-Baca have developed fast algorithms for inferring evolutionary trees. Agarwala and Fernandez-Baca have also developed fast and simple algorithms for determining if a set of species has a "perfect phylogeny," which leads to a good algorithm for triangulating colored graphs. DIMACS visitors Ming Li, Ron Shamir, and Jiang Tao have found a polynomial algorithm for the problem of "shortest common supersequence of sets," arising in mapping with nonunique probes. DIMACS visitors Pavol Hell and Ron Shamir have developed a dynamic algorithm for unit interval graphs, motivated by a problem in physical mapping with cosmid clones when data is changing according to experiments.

Some interesting outcomes of the special year have been in undergraduate and high school education, in course and seminar contacts between mathematical and biological scientists, and in impacts on careers of established scientists. DIMACS postdoc R. Ravi has supervised the senior thesis of Princeton student Jennifer Perone. The thesis deals with a profile-building method that they are developing to employ in homology search, and was developed as a result of the interaction with Princeton biologist Lynn Enquist. DIMACS member Dannie Durand gave a talk to high school teachers in our Leadership program, on the subject of evolutionary tree algorithms. DIMACS postdoc R. Ravi gave a 5-lecture session in the Leadership program in the summer of 1994. He did a follow-up lecture in December 1994. Ravi delivered a guest lecture at a high school in High Valley, NJ, in February 1995. DIMACS graduate student Reuben Settergren, who is doing research on computational biology under the direction of DIMACS member Farid Alizadeh, has gotten involved in editing a book on "discrete math in the schools," together with DIMACS members Debbie Franzblau, Joe Rosenstein, and Fred Roberts. Several DIMACS visitors, including Dan Gusfield, sat in on David Axelrod's course in the biology of cancer at Waksman and have interacted with Axelrod in his research. DIMACS visitors, postdocs, and permanent members met Jody Hey of Waksman through his seminar in our seminar series, and took or audited his course on molecular evolution. This has given them an opportunity to work with real data and find out what concerns biologists when working with evolutionary trees. Some senior mathematical scientists have gotten heavily involved in computational biology, partly as a result of their exposure to it during the special year. At least one permanent DIMACS member has made a commitment to a major change in career as a result of our special year.

Appendices

Appendix I:

Special Year in Mathematical Support for Molecular Biology
1994-96 Calendar

Locations of Events:

Events are to be held at CoRE Building, Busch Campus, Rutgers University, except as noted.

Special Year Seminar Series:

Starting in September, 1994, there will be a special year seminar every Monday at 3 P.M., except when there is a workshop, miniworkshop, or Distinguished Lecture Series Lecture scheduled. Starting in September, 1995, the seminar will be on Tuesday at 1 P.M. Most seminars will be at DIMACS-Rutgers, but some will be at the Princeton University Computer Science Building or at the Waksman Institute for Molecular Genetics at Rutgers, and others will be at the Genome Center at University of Pennsylvania.

Algorithm Implementation Challenge:

Ongoing program through September 1995; participants are challenged to develop algorithms dealing with fragment assembly and genome rearrangement.

August 1994:

Tutorials in Molecular Biology and Computational Molecular Biology
Week of Aug. 15: Introduction to Molecular Biology for Non-Experts (lectures by William Sofer)
Weeks of Aug. 22 and 29: Current Computational and Mathematical Approaches to Biological Problems and Current Topics in Biocomputing (Lectures by Martin Farach and Michiel Noordewier)
Week of Sept. 5: Biological Maps: A Guided Tour (Lectures by David Axelrod)

Sept. 1994:

16: Kickoff of Distinguished Lecture Series: Michael Waterman
22, 23: Distinguished Lecture Series: Leroy Hood (Sept. 22 at Waksman)
28: Distinguished Lecture Series: Daniel Gusfield

Oct. 1994:

6-9: Workshop on Mapping and Sequencing
20, 21: Distinguished Lecture Series: Samuel Karlin (Oct. 20 at Waksman)

November 1994:

3: Distinguished Lecture Series: Temple Smith
4: MiniWorkshop on Combinatorial Structures in Molecular Biology
10-12: Workshop on Sequence Alignment
(To be held at Computer Science Building, Princeton U.)

December 1994:

9: MiniWorkshop on DNA Topology and Structure

January 1995:

20: Distinguished Lecture Series: Joachim Messing (at Waksman)

February 1995:

3: Distinguished Lecture Series: Richard Karp
6-8: Workshop on Phylogeny
(To be held at 83 Prospect Ave. - Stevenson Hall - Princeton U.)

March 1995:

20-24: Two MiniWorkshops on protein structure.
20-21: Global Minimization of Nonconvex Energy Functions including a Distinguished Lecture Series Lecture by Herbert Hauptman
24: Sequence-Based Methods for Protein Folding

April 1995:

4: MiniWorkshop on DNA Based Computing (at Princeton University)
17, 18: Distinguished Lecture series (two lectures) by Charles Cantor (April 17 at Waksman)

May 1995:

3-5: International Conference on HIV Sequence Variation and Statistical Methods

August 1995:

2-3: MiniWorkshop on Geometrical Methods for Conformational Modeling

September 1995:

13-15: Algorithm Implementation Challenge Workshop (fragment assembly and genome rearrangement)
21-22: MiniWorkshop on Mathematics of Drug Discovery

October 1995:

13-15: MiniWorkshop on Gene-Finding and Gene Structure Prediction
(To be held at the Genome Center at Univ. of Pennsylvania)

March 1996:

4-6: Second Sandia National Laboratories Workshop on Computational Molecular Biology, organized in collaboration with DIMACS
(To be held in Albuquerque, N.M.)

May 1996:

17-19: Workshop on Computational Biology, as part of the 50th Anniversary of ENIAC, organized in collaboration with DIMACS (To be held at University of Pennsylvania)

June 1996:

10-12: Second Conference on DNA and Computing, Princeton
Dates to be set: MiniWorkshop on Mathematical Hierarchy and Biology

Organizers: Joachim Messing and Fred Roberts (chairs), Lawrence Shepp and Michael Waterman (co-chairs), Martin Farach (Assistant Chair)

Further information: special@dimacs.rutgers.edu, (908)455-5928


Appendix II:

DIMACS Special Year in Mathematical Support for Molecular Biology
List of Visitors, 1994-96


Appendix III:

DIMACS Special Year in Mathematical Support for Molecular Biology
Steering Committee