Integral Genetics Group, Center for Mechanistic Biology and Biotechnology, Argonne National Laboratory, Argonne, IL 60439
Sequencing by hybridization (SBH) requires sophisticated computational procedures for data acquisition and evaluation and also for DNA screening, mapping and sequencing applications. We have been developing algorithms and programs and performing simulations to establish many SBH possibilities in addition to the sequencing of short DNA fragments. For example, a 10- to 50-fold increase of SBH efficiency has been achieved by using overlapped and similar sequences in the assembly process. Furthermore, partial sequences obtained by 100 to 1000 probes are sufficient for gene identification and recognition of overlapped and similar sequences.
Recently we have started to produce large sets of hybridization data using our facilities for scoring filters containing 31,000 DNA dots with 24 oligomer probes per day. A few computational programs have been developed for use of the data: SCORES for the data evaluation and normalization; CLUSTERS for the identification of groups of similar clones; and CORD for the ordering of shotgun clones. These programs are written in C for the UNIX platform with an X-Windows interface. The programs are based on heuristic rules and resemble expert systems resistant to the common experimental imprecisions. All the programs use hybridization intensities without conversion to a 0/1 form.
By using SCORES and CLUSTERS 25,000 cDNA clones have been sorted into 13,000 groups, which have been confirmed by partial sequencing. The goal is to screen one million cDNA clones from 10 tissues. The CORD program defines contigs of 1- to 2-kb clones hybridized by 200 probes and provides a sequence-ready map. In various simulation experiments, CORD has shown a tolerance for more hybridization errors then observed in our experiments and for the abundance of Alu repeats found in many human sequences. Restriction mapping has confirmed the predicted clone overlaps defined for 860 M13 subclones of a human cosmid hybridized with 250 heptamers and octamers. The maps allow selection of a nonredundant set of clones for inexpensive complete sequencing by matching single-pass gel data and hybridization data of 3000 probes. The projected cost is 10 cents per base pair and can be further reduced by comparative sequencing of similar genomes.
This work was supported by the U.S. Department of Energy, Office of Health and Environmental Research, under Contract No. W-31-109-ENG- 38.