DIMACS Theoretical Computer Science Seminar

Title: Algorithmic Embeddings for Comparing Large Text Streams

Speaker: Graham Cormode, DIMACS, Rutgers University

Date: September 29, 2003 3:30-4:30pm

Location: DIMACS Center, CoRE Bldg, Room 433, Rutgers University, Busch Campus, Piscataway, NJ


Texts are ubiquitous in daily life, varying in size from small (SMS and 
email) to potentially immense (automatically generated reports, biological 
sequences).  When scanning for similarities between new data and  
previously stored examples, we need a model that takes account of how  
texts are changed: pieces are inserted or deleted, and sections are moved 
around.  With a large number of large texts, trying to compare all pairs 
is not feasible, and for the largest texts we may not even be able to hold 
the whole of any text in memory at once. I will describe an approach to
approximately comparing large texts quickly and in sublinear space.  It
relies on finding combinatorial structures present in any string of
characters, and generating a vector representation.  This allows rapid
comparison of sequences based on a succint representation, and the
application of clustering, nearest neighbor searching, and other data
mining techniques.