Interdisciplinary Seminar Series

Title: Indel-sensitive Read Mapping with 2 and 3-gram Frequencies and Cache Oblivious kd-Trees

Speaker: Pavel Mahmud, Rutgers University

Date: Monday, April 16, 2012 11:00am - 12:00pm

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University, Busch Campus, Piscataway, NJ


Mapping billions of reads from Next Generation Sequencing (NGS) experiments to reference genomes is a crucial task which requires thousands of hours of running time on a single CPU even for the fastest known implementations. The most frequently used methods rely on the existence of exact q-gram matches, which are either used explicitly in the form of q-gram filtering, or implicitly to accelerate computations. Current approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics.

We demonstrate the effectiveness of embedding genomes and reads in a vector space by mapping fixed-length reads and sub-sequences of the reference genomes to q-gram frequency vectors for q = 2 or 3. The approximate matching problem is thus rephrased as one of finding nearest neighbors using spatial index data structures; all q-grams in a putative match are considered simultaneously. The L_1 distance between frequency vectors also has the benefit of providing lower bounds for an edit distance with affine gap-costs. With the use of a cache-oblivious kd-tree we realize running times which match the state-of-the-art. Our approach however is the first which can map large proportions of reads containing insertions and deletions of length 1--15bp. Additionally, running time and memory requirements are constant for read lengths between 100--1000bp.

DIMACS/CCICADA Interdisciplinary Series, Complete Spring Calendar 2012