Tools for Automated DNA Sequence Assembly

C. Burks1, M.L. Engle1, R.J. Parsons2,J.E. Stewart3

1 Theoretical Biology and Biophysics Group, Los Alamos National Laboratory
2 Department of Computer Science, University of Central Florida
3 Computer Science Department, University of New Mexico

We are focusing on the assembly of DNA sequence fragments in the context of genomic sequencing:

Stochastic Assembly Optimization Methods. Traditional tools for the assembly of sequence data in the context of random (shotgun) sequencing have relied primarily on greedy algorithms, and more recently approximations of exact algorithms. We are exploring alternative approaches based on stochastic search strategies, including relaxation [3], simulated annealing [1,3], and genetic algorithms [2,4]. Our focus has been, first, on developing a model for assembly optimization under these strategies and, second, on implementing them and comparing the effects of different search structures and schedules.

Artificial Data Sets. As we began developing algorithms and testing them, it became clear to us that the systematic and independent variation of input data parameters - extremely useful in the early stages of evaluating and correcting software implementations - was very difficult with the experimental data sets that are available. Thus, we developed software, GenFrag, that allows one to fragment known "parent" sequences, specifying coverage, mean fragment length, distribution of errors, and repeat complexity [5,6].

Characterization of Data Parameters in Assembly Optimization. One aspect of DNA sequencing technology currently being focussed on is increasing the length of individual sequence reads. We have chosen an empirical approach to characterize the effects of this methodological improvement: start with known DNA sequences in databases; use GenFrag to generate artificial fragment sets drawn from the parent; and accumulate statistics on the false positive rates in scoring pairwise overlaps among the fragments as read length and other parameters (e.g., repeat complexity) are varied.

Integration of Ancillary Assertions. DNA assembly relies primarily on the input fragment sequences, but often also relies on several streams of related but distinct kinds of data for completeness and accuracy of the final construction. These secondary data sets, which we term ancillary information, (i) usually contain errors (as do the primary data. sets, therefore creating the possibility of conflict between data sets), (ii) often arise from different experimental protocols and correspond to different scales of measurement, and (iii) occasionally include non- quantitative statements about the data. We are developing an approach for integration of ancillary assertions in the optimization of genome assembly, based on simultaneous balancing among the primary and secondary data sets [7], and have begun implementing software for testing this approach on several different sequencing strategies.

[1] Churchill, et al. (1993) Los Alamos National Laboratory, Technical Report LA-UR-93-2287.

[2] Parsons, et al. (1993) In "Proceedings: !st Int. Conf. on Intelligent Systems for Molecular Biology", Hunter, et al., Eds., AAAI Press, Menlo Park, CA, 310-318.

[3] Burks, et al. (1994) In "Automated DNA Sequencing and Analysis", Adams et al., Eds., Academic Press, publication.

[4] Parsons et al., (1994) Machine Learning (accepted)

[5] Engle and Burks (1993) Genomics, 16, 286-288.

[6] Engle and Burks (1994) CABIOS, in press.

[7] Burks, et al. (1994) In "Proceedings: Second International Conference on Intelligent Systems for Molecular Biology", Searls, et al., Eds., AAAI Press, Menlo Park, CA, in press.

DIMACS Homepage
Contacting the Center
Document last modified on March 28, 2000.