An improved sequence assembly program

Xiaoqiu Huang
Department of Computer Science
Michigan Technological University
Houghton, MI 49931

We have made a number of improvements to a sequence assembly
program named CAP (Huang, 1992). These improvements are:
(1) identification of repetitive fragments and
    resolution of ambiguities in assembly of those fragments,
(2) automated refinement of poorly aligned regions of fragment alignments,
(3) identification of chimeric fragments,
(4) generation of fragment-specific error vectors and
    use of the vectors in evaluation of overlap strength, and
(5) design of a more efficient algorithm for filtering fragments.

The improved assembly method consists of three phases. In phase 1,
for each pair of fragments, if the pair passes through a filter,
then the overlap between the two fragments is computed by a dynamic
programming algorithm. The error vector of each fragment is calculated
using the overlaps involving the fragment. Chimeric fragments are identified
using error vectors and overlaps. The overlaps that are not strong
relative to the error vectors of the corresponding fragments are removed.
In phase 2, an initial assembly of fragments in proper orientation
is produced by a greedy strategy. The overlaps that are inconsistent
with the initial assembly are identified. These inconsistencies are used
to find the fragments that are from copies of a repetitive sequence.
For each repetitive sequence, an alignment of the fragments from the copies
of the sequence is constructed. By making use of the differences in
the alignment, the fragments are partitioned into groups such that
the fragments in a group are from the same copy of the repetitive sequence.
The initial assembly is adjusted accordingly. In phase 3, an alignment
of fragments in each contig is constructed. Then the alignment of
each contig is refined. The consensus sequence of each contig is produced.

The improved program (CAP2) produces satisfactory results on
the data set of Seto et al. (1993), the data sets provided
by Hershel Safer (WWW: http://www.cric.com/), and the data sets
generated by the GenFrag package (Engle and Burks, 1993) from
the human beta-like globin sequence (Genbank Locus: HUMHBB). 

This project is supported in part by Applied Biosystems Division of
Perkin-Elmer. The author thanks Timothy Burcham for help and discussion.

References:

Engle, M.L. and Burks, C. (1993).
Artificially Generated Data Sets for Testing DNA Fragment Assembly Algorithms.
Genomics 16, 286-288.

Huang, X.  (1992)
A contig assembly program based on sensitive detection of fragment overlaps.
Genomics 14, 18-25.

Seto, D., Koop, B. F. and Hood, L.  (1993).
An experimentally derived data set constructed
for testing large-scale DNA sequence assembly algorithms.
Genomics 15, 673-676.