For Drosophila, we collected 1.6 million pairs whereby the sum of the lengths of the reads is roughly 13 times the length of the genome (~120Mbp), a so called 13X shotgun data set. For the human genome 12.5 million pairs for a 4.5X data set was generated at Celera, and then an addition 2X of faux reads was added by shredding the rough draft data obtainable at Genbank, for an aggregate 6.5X data set consisting of 37 million reads totalling 20Gbp of sequence.
By layering the ideas of uncontested interval graph collapsing, confirmed read pairs, and mutually confirming paths, one obtains an assembly algorithm that makes remarkably few errors. The assembler correctly identifies all unique stretches of a genome, correctly building contigs for each and ordering them into scaffolds, spanning each of the chromosomes. Thus all useful proteomic information is firmly assembled. For Drosophila, with a 13X data set, the results of assembly, without any of the finishing effort that ensues for all projects, meets the community standards set by Chromosome 22 and C. Elegans, for completion and accuracy of finished sequence. This assembler also completed a preliminary whole genome assembly of the 6.5X human data set in 3 weeks using 160 Alpha porcessors and a 64Gbp memory.
In order to cross-validate the results of our human whole genome assembly, we have also built a regional assembler that combines Celera's data with the BAC-localized contigs being produced by the Human Genome Project. This assembler orders the contigs, fills roughly 2/3rds of the resulting gaps, and then, with the help of user curation, tiles the assembled BACs into meta-level assemblies that cover megabase-sized regions of the genome.