An improved sequence assembly program Xiaoqiu Huang Department of Computer Science Michigan Technological University Houghton, MI 49931 We have made a number of improvements to a sequence assembly program named CAP (Huang, 1992). These improvements are: (1) identification of repetitive fragments and resolution of ambiguities in assembly of those fragments, (2) automated refinement of poorly aligned regions of fragment alignments, (3) identification of chimeric fragments, (4) generation of fragment-specific error vectors and use of the vectors in evaluation of overlap strength, and (5) design of a more efficient algorithm for filtering fragments. The improved assembly method consists of three phases. In phase 1, for each pair of fragments, if the pair passes through a filter, then the overlap between the two fragments is computed by a dynamic programming algorithm. The error vector of each fragment is calculated using the overlaps involving the fragment. Chimeric fragments are identified using error vectors and overlaps. The overlaps that are not strong relative to the error vectors of the corresponding fragments are removed. In phase 2, an initial assembly of fragments in proper orientation is produced by a greedy strategy. The overlaps that are inconsistent with the initial assembly are identified. These inconsistencies are used to find the fragments that are from copies of a repetitive sequence. For each repetitive sequence, an alignment of the fragments from the copies of the sequence is constructed. By making use of the differences in the alignment, the fragments are partitioned into groups such that the fragments in a group are from the same copy of the repetitive sequence. The initial assembly is adjusted accordingly. In phase 3, an alignment of fragments in each contig is constructed. Then the alignment of each contig is refined. The consensus sequence of each contig is produced. The improved program (CAP2) produces satisfactory results on the data set of Seto et al. (1993), the data sets provided by Hershel Safer (WWW: http://www.cric.com/), and the data sets generated by the GenFrag package (Engle and Burks, 1993) from the human beta-like globin sequence (Genbank Locus: HUMHBB). This project is supported in part by Applied Biosystems Division of Perkin-Elmer. The author thanks Timothy Burcham for help and discussion. References: Engle, M.L. and Burks, C. (1993). Artificially Generated Data Sets for Testing DNA Fragment Assembly Algorithms. Genomics 16, 286-288. Huang, X. (1992) A contig assembly program based on sensitive detection of fragment overlaps. Genomics 14, 18-25. Seto, D., Koop, B. F. and Hood, L. (1993). An experimentally derived data set constructed for testing large-scale DNA sequence assembly algorithms. Genomics 15, 673-676.