Department of Computer Science, University of California, Davis, CA 95616
The trend towards large DNA sequencing projects, such as those being undertaken as part of the Human Genome program, necessitates the development of robust and efficient algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. We present a three-phase approximation algorithm based on a decomposition into three subproblems: computing overlaps between pairs of fragments, selecting overlaps to form a shortest layout of the fragments, and determining a multiple sequence alignment from the layout. The approach can accomodate high sequencing error rates and list alternate solutions in the event that several appear equally good. Experiments on simulated data indicate that non-repetitive sequences of length 50,000 can be successfully reconstructed even when sampled at error rates of as high as 10 percent.
This is joint work with Eugene Myers.