Methods for assembly and error identification in mapping and sequencing

Phil Green

Department of Molecular Biotechnology, University of Washington, FJ-20, Seattle, WA 98195

A major computational challenge in current large-scale DNA mapping and sequencing projects is to develop efficient methods for dealing with noisy data, where the "noise" arises from both technical (e.g. chimeric clones, laboratory error) and biological (e.g. genomic repeats) sources. Assembly algorithms must be robust to noise, and - perhaps even more important - errors or anomalies in the data must be identified as precisely as possible, so that they can be efficiently reviewed, corrected and/or used to guide additional data collection. I will describe our strategies for constructing STS content maps of YAC contigs (implemented in the program SEGMAP) and for DNA sequence assembly (implemented in PHRAP). SEGMAP uses a flexible scoring system (with the ability to incorporate several types of ancillary information, including information from cytogenetic and genetic mapping) as the basis for comparing site orders, and searches a large set of locally permuted orders in attempting to maximize the score. It uses a greedy approach to find "minimal inconsistent sets", which must contain data errors. It has been in use for several years in several Genome Centers doing large scale YAC mapping. PHRAP uses a greedy assembly algorithm (somewhat similar to that in X. Huang's CAP) with several refinements to help avoid misassembly due to repeats or chimeras and to systematically identify sites where the assembly is ambiguous (these can be given a simple theoretical characterization) and additional data collection is therefore required. It makes use of the full reads and does not require prescreening to remove less accurate regions or repeats. Data quality information (generated by our base-calling program PHRED) is used in generating the "consensus" as a mosaic of the highest quality parts of the highest quality reads; our approach to this formulates the problem in terms of weighted directed graphs, permitting the use of efficient classical algorithms due to Tarjan. Following assembly, PHRAP performs a number of accuracy checks, flagging repeated (or possibly misassembled) regions, candidate chimeric reads, and forward/reverse read pairs with inconsistent orientation or spacing. In tests with real data PHRAP correctly assembled several C. elegans and human cosmids with complex repeats. It is rapid enough (< 1 minute to assemble an 800-read cosmid on a workstation) that repeated, interactive reassembly is feasible.

DIMACS Homepage
Contacting the Center
Document last modified on March 28, 2000.