Title: Constructing and Clustering of Similarity Graphs from Large-Scale Metagenomic Collections
Speaker: Jaroslaw Zola, Rutgers Discovery Informatics Institute
Date: Tuesday, March 24, 2014 11:00am - 12:00pm
Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ
Abstract:
Metagenomics is the study of a population of organisms by fragmenting and sequencing their collective DNA. With the advent of next-generation high-throughput DNA sequencing, large-scale metagenomic studies became routine producing data collections with millions of DNA reads. Metagenomic clustering is a strategy to organize such data collections by identifying taxonomic units from which they have been obtained.
In this talk, I will present a parallel graph-based approach to metagenomic clustering. The method exploits sketching techniques to construct large-scale similarity graphs while alleviating the prohibitive cost of all pairs comparisons. It also employs carefully designed dynamic load balancing techniques to scale to parallel machines with thousands of cores. I will then show how the metagenomic clustering problem can be posed as that of identifying dense sub-graphs, and will describe a MapReduce realization of the corresponding heuristic.