University of Toronto, Department of Computer Science, Toronto, Canada M5S 1AI

A common step in the analysis of a genome is the generation of a
*physical map*, i.e., an ordering of overlapping fragments of the genome
and a specification of the binding sites of certain probes. An
important data structure for this problem is a graph in which each
node represents a DNA fragment. Edges represent experimental or
derived relationships between the fragments, such as the overlap of
two fragments. For data derived from ideal experiments, *interval
graphs* are particularly important, since they capture the linear
structure of a chromosome. Many problems that are NP-complete for
general graphs, have efficient algorithms for interval
graphs. Unfortunately, interval graphs rarely arise in practice
because many non-linearities are introduced by experimental error and
noise in the data. Nevertheless, the graphs are not arbitrary, and
they are derived from a linear structure (a chromosome). The challenge
is to understand the sources of non-linearity and to extract a linear
genome map.

To address this problem, we are developing a system that attacks map assembly from three angles: logic programming, data visualization and graph theory. A key idea in our approach is that the layout and display of graphs can ameliorate many of the problems caused by problematic data. Our implementation platform is Hy+, a data visualization system developed at the University of Toronto. Hy+ integrates a logic- programming backend with numerous graph layout algorithms. We have tested the system on real and simulated mapping data provided by the Whitehead Institute/MIT Center for Genome Research. Because the initial results are encouraging, Whitehead has installed the system in their laboratory for further evaluation.

The data we are using contains two main sources of experimental error:
chimerism and false negatives. Informally, chimerism leads to graphs
that are a collection of "linear" subgraphs tied together by "weak"
edges. It is not difficult to identify the weak edges and to break up
the graph into many connected components, each having a linear
structure and representing a contiguous portion of DNA (a
contig). Unfortunately, because of the high rate of false negatives in
the data (up to 40%), these components are not interval graphs. False
negatives also lead to a high degree of ambiguity, so that many
different maps may be consistent with the experimental data, even for
a single contig. Situations easily arise in which *n* data points have
2^{n} possible maps.

These problems lead to several computational challenges: (i) to succinctly represent the large number of maps that are consistent with the data; (ii) to efficienty answer queries about these maps; and (iii) to develop algorithms and heuristics for exploiting the underlying linear structure of the graphs, especially algorithms for display and layout. Robust algorithms for interval graphs should be useful since although our graphs are not interval graphs, they are "approximately" interval and "locally" interval. Formalizing these notions, and the notion of "linear structure", is an important open problem.

Program

DIMACS Homepage

Contacting the Center

Document last modified on March 28, 2000.