1 Genome Therapeutics Corporation, 100 Beaver Street, Waltham, MA 02154, USA
2 MIT, Cambridge, MA 02139, USA
The full sequence of a clone can frequently be determined automatically from the sequences of shotgun fragments using currently available software. These programs, however, do not work well on sequences with certain characteristics, especially repeats longer than the fragment read length.
The large-scale sequencing project at CRI has worked with various cosmid clones containing repeated regions of 1 kb or longer, far longer than the read lengths generally achieved in current production projects. Our goal is to develop a program that can assemble these clones automatically and correctly. The software must function well in the presence of observed rates of problematic data such as chimeric subclone inserts and sequencing errors.
The sequencing protocol used at CRI reads fragments from both ends of each subclone insert. We have developed an algorithm for contig building that uses this double-ended sequence information to assemble correctly in the presence of long repeats. It can also order contigs in assemblies that cannot be joined into a single contig. The algorithm uses the overlap graph representation introduced by Gene Myers and is being implemented as an extension of his Fragment Assembly Kernel.
We have achieved some promising results. We have, for example, completely assembled a 44 kb mycobacterial cosmid containing two copies of a 1.6 kb repeat. The five popular programs that we tested on this clone merged the copies of the repeat and assembled the clone into three contigs. We are continuing to improve our algorithm and are integrating the program into the production environment.