Abstract:

Whole Genome Assemblies of the Drosophila and Human Genomes
Gene Myers, Celera Genomics

We report on the design of a whole genome shotgun assembler and its application to the sequencing of the Drosophila and Human genomes. Celera's whole genome strategy consists of randomly sampling pairs of sequence reads of length 500-600 that are at approximately known distances from each other - short pairs at a distance of 2Kbp, long pairs at 10Kbp, and ultra-long pairs at 50-150Kbp. Reads are collected in a 1-to-1 ratio of short to long pairs, and enough ultra-long pairs to give 20-30X clone coverage is desirable. The experimental accuracy of the read sequences is roughly 99.5% with all but 1 in 10,000 being better than 98% accurate. Given such a data set, the computational problem is to infer the sequence of the euchromatic portion of the genome.

For Drosophila, we collected 1.6 million pairs whereby the sum of the lengths of the reads is roughly 13 times the length of the genome (~120Mbp), a so called 13X shotgun data set. For the human genome 12.5 million pairs for a 4.5X data set was generated at Celera, and then an addition 2X of faux reads was added by shredding the rough draft data obtainable at Genbank, for an aggregate 6.5X data set consisting of 37 million reads totalling 20Gbp of sequence.

By layering the ideas of uncontested interval graph collapsing, confirmed read pairs, and mutually confirming paths, one obtains an assembly algorithm that makes remarkably few errors. The assembler correctly identifies all unique stretches of a genome, correctly building contigs for each and ordering them into scaffolds, spanning each of the chromosomes. Thus all useful proteomic information is firmly assembled. For Drosophila, with a 13X data set, the results of assembly, without any of the finishing effort that ensues for all projects, meets the community standards set by Chromosome 22 and C. Elegans, for completion and accuracy of finished sequence. This assembler also completed a preliminary whole genome assembly of the 6.5X human data set in 3 weeks using 160 Alpha porcessors and a 64Gbp memory.

In order to cross-validate the results of our human whole genome assembly, we have also built a regional assembler that combines Celera's data with the BAC-localized contigs being produced by the Human Genome Project. This assembler orders the contigs, fills roughly 2/3rds of the resulting gaps, and then, with the help of user curation, tiles the assembled BACs into meta-level assemblies that cover megabase-sized regions of the genome.

For more information on this event please see: http://dimacs.rutgers.edu/Workshops/CMBkickoff/announcement.html