Genome Structure Group
Center for Mechanistic Biology and Biotechnology
Argonne National Laboratory
Argonne, Illinois 60439-4833
A technology for performing massive hybridization experiments has been developed as part of the sequencing by hybridization project. Arrays of tens of thousands of clones are interrogated with short oligomer probes in order to reconstruct a DNA sequence. While the approach could in principle work on random DNA sequences, there are several outstanding problems that limit its practical applicability: several thousand probes need to be hybridized to a clone in order to exactly determine its sequence; repeated occurrences of short oligomers within a clone may preclude sequence determination; and, errors in the interpretation of hybridization data may cause false positive scores and thus impede sequence reconstruction. One of the main uses of a reconstructed DNA sequence is in a similarity search against databases of known DNA. We argue that sequence reconstruction is not only unnecessary, but even harmful, for this particular purpose. We show that oligomer lists obtained from hybridization experiments should be used directly for similarity searches. We consider a similarity search method that takes full advantage of the subword structure of positively identified oligomers within a clone while avoiding all the main problems that are inherent in sequencing by hybridization. To enable direct sequence recognition, we apply the recently developed method of sequence comparison that is based on minimal length encoding and algorithmic mutual information. In essence, we ask how many bits of information about a particular sequence in the database is revealed by hybridization experiments; the recently developed algorithmic significance method tells us the d bits of information can be revealed by chance with a probability of at most 2-d. The approach has been extensively tested on a real data and has led to correct identification of clones based on hybridizations with 110 short oligomer probes.