Department of Computer Science
University of Arizona, Tucson, AZ 85721
There is a long precedent of formulating the problem of assembling a collection of shotgunned DNA fragments as shortest common superstring problem over "noisy" data. While this formulation is adequate when the original subject sequence is relatively random, it consistently produces over-compressed solutions to problems involving repetitive DNA. Given that we are finding the DNA of higher eukaryotes to be quite repetitive (e.g. stretches of human DNA have been found that are 30% Alu), it is clear that the need for a better formulation of the problem is necessary. We present such a formulation based on the one-sided Kolmogorov-Smirnov statistic and sketch preliminary results on how we lever this to deliver a most-likely reconstruction as opposed to a most-parsimonious one.