DIMACS Workshop on Next Generation Sequencing: Making the most of what you read

August 25, 2010
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Alexander Schliep, Rutgers University, schliep at cs.rutgers.edu

Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology.

Abstracts:

Title: Statistical mechanics and next-generation sequence assembly

De novo assembly of genomes from short reads generated from high throughput sequencing platforms remains a significant challenge. Given the small size of the contigs built from these reads, it is essential to use some additional information to stitch these contigs together into large scaffolds. One way is to utilize the mate pair technology, which provides pairs of short reads separated by an approximately known distance and orientation along the genome. The problem is that a part of the mate-pair information could be false or misleading. To deal with this problem, we have developed SOPRA, a scaffold building tool which manages to select a consistent and reliable subset of mate pair constraints.

Scaffold assembly is presented as a series of optimization problems for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. These optimization problems are related to finding ground states of well-known hamiltonians in statistical physics. We show that, for the typical structure of these graphs generated by real sequence data, these optimization problems can be solved quite satisfactorily by our method. Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process.

Slimane Ben Miled

Title: Life Complexity and Data Processing: an experience of a transversal master diploma in Tunisia Abstract: We created in 2006 in the National Engeneering School of Tunis (http://www.enit.rnu.tn) at the University of Tunis el Manar (http://www.utm.rnu.tn) a transversal biomathematical master Life Complexity and Data Processing (see enclose the presentation of the master). This master program is co-diplomed with University Paris Descarte (France).

Our goal in this talk is:

To present the master Diploma.
To discuss the program choice.
To present what the difficulties we encounter.
To present an evaluation of this program.

Dana Price

Title: Single-cell genomes reveal the dynamic nature of the marine environment

Single-cell amplified genomes (SAG) uncover remarkably dynamic biotic interactions among marine microorganisms without need for cultivation. I will discuss cell isolation, amplification and analysis of the genomes of two such unicellular organisms: 1.) the Picobiliphyta - a group of eukaryotic algae recently described as being among the smallest photosynthetic picoplankton, and 2.) the genus Paulinella - a lineage of thecate amoeboids which contains both phototrophic and heterotrophic species. Using next-generation sequencing techniques I will reveal new discoveries within the biology and ecology of the Picobiliphytes, and help elucidate how one of the most important events to shape our planet (the acquisition of photosynthesis by eukaryotes) may have progressed using Paulinella as a model.

Ariella Sasson

Title: From Millions to One: De Novo Assembly

One of the most significant advances in biology has been the ability to sequence the DNA of organisms. While the technology to sequence genomes has been around for decades, large scale sequencing has only recently been made possible by the development of modern computing techniques, such as shotgun sequence assemblers like Velvet. The Sanger method, considered the gold standard of DNA sequencing, has long been the dominant approach, but is still costly even though it has been around for over 20 years. Even if cost were not an issue and even considering the various improvements in techniques and automation that have been made, it is still time-consuming to sequence a large genome. Even today, after the human genome project has been labeled as completed, problems still lurk in the current shotgun method. Intractable regions, regions of repetitive sequences in the chromosomes that result in gaps in the genome assembly, remain unsequenced. New whole genome sequencing technologies are needed to reach the goal of the $1000 genome. The next generation of sequencing technologies is now emerging capable f enerating far cheaper, but at the same time far shorter reads (50 to 0 bp nstead of 800 to 1000 bp), presenting new computational problems and pportunities. Although greater coverage depths are thus affordable 100-300x instead of 2-10x), de novo sequence assembly with these shorter equences is significantly more complex. The question arises; can an ccurate de novo assembly of a genome be computed at acceptable omputational costs? There remain things that must be considered: 1) emory osts are an issue when dealing with so many elements, and 2) the short ead ength implies that the assembler must be able to deal with numerous mbiguous overlaps. In addition, the assembler must be able to deal ith the orrection of sequence errors and the assembly of reads containing ismatches. There are a few assemblers that have been developed or odified o assemble short read sequences; however, each has its limitations, nd hile some have shown success on smaller bacterial artificial hromosomes(BACs), larger genomes still prove challenging. While completing a ingle enome without gaps still proves difficult, assemblies of small genomes aid n understanding various aspects of sequencing and their interactions. The information learned from these assemblies help devise strategies for tackling larger de novo projects.

Cheong Xin Chan

Title: Deciphering the complicated history of eukaryote evolution using phylogenomics

The explanatory power of evolutionary genomics among the eukaryotes is often hindered by limited taxon sampling, particularly of genes derived from mesophilic or non-parasitic taxa. Current sequencing technology enables inexpensive genome sequencing of these missing taxa in a timely manner. Here I explore the comprehensive phylogenomic approach for assessing eukaryote evolution, and the application of next-generation sequencing technology on such approach. As an example, I present the use red algal genomic data to assess three key issues in eukaryote evolution: (a) assessing support for Plantae monophyly, (b) extent of horizontal gene sharing, and (c) putative functions of shared genes. I will highlight the impact of taxon sampling on the testing of diverse controversial aspects of eukaryotic evolution, and to understand their often complex patterns of inheritance.

Anna Zdepski

Title: Polymorphic Marker Discovery in Strawberry: An Application of Next-Generation Sequencing

Two inbred lines of diploid strawberry were used to create reduced representation SOLiD libraries. A combination of restriction enzymes, size selection, and specially designed adapters were implemented to sequence a specific fraction of the strawberry genome. SOLiD sequence from these two inbred lines were compared to locate thousands of SNPs (Single Nucleotide Polymorphisms), potential polymorphic markers for the creation of a genetic map for diploid strawberry Fragaria vesca.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on August 24, 2010.