Algorithmic embeddings for comparing large text streams, June 2002. CCR/DIMACS Workshop/Tutorial on Mining Massive Data Sets and Streams: Mathematical Methods and Algorithms for Homeland Defense.

Texts are ubiquitous in daily life, varying in size from small (SMS and email) to potentially immense (automatically generated reports, biological sequences). When scanning for similarities between new data and previously stored examples, we need a model that takes account of how texts are changed: pieces are inserted or deleted, and sections are moved around. With a large number of large texts, trying to compare all pairs is not feasible, and for the largest texts we may not even be able to hold the whole of any text in memory at once. We describe an approach to approximately comparing large texts quickly and in sublinear space. It relies on finding combinatorial structures present in any string of characters, and generating a vector representation. This allows rapid comparison of sequences based on a succint representation, and the application of clustering, nearest neighbor searching, and other data mining techniques.

bib | http | .pdf ] Back

This file was generated by bibtex2html 1.92.