Algorithms for processing massive data at network line speed, March 2004. Talk at U. Iowa; U. Minnesota; Dartmouth; Google; AT&T; CWRU; Poly.

Many fundamental data mining problems gain additional complexity when applied to massive amounts data that is being generated at high speed. Networks are a driving force in massive data analysis, and require new approaches to produce the analyses network managers require. This motivates the data stream approach: using a small amount of space to process data very quickly in one pass. This has resulted in many strong and deep algorithmic results in this model of computation. But there is frequently a gap between theory and practice, since existing algorithms are too slow for typical high data rates by several orders of magnitude.

I will describe my recent work aimed at bridging the gap. Firstly, I will describe the Count-Min sketch, which is a simple and hence fast data structure that can be used to approximate entries of a high dimensional vector in very small space. It can be used as a ”black box” to solve several problems, including finding frequent items, and quantiles of the data distribution, and gives provable guarantees about their results. Secondly, I will describe and analyze a solution to the change detection problem, where the aim is to identify items (eg IP addresses) whose behavior has changed between two periods. Both these methods have been implemented and are running on network data at network line speeds with high accuracy.

bib | .pdf ] Back

This file was generated by bibtex2html 1.92.