Abstracts of Papers

What's New: Finding Significant Differences in Network Data Streams

Graham Cormode, S. Muthukrishnan

Monitoring and analyzing network traffic usage patterns is vital for managing IP Networks. An important problem is to provide network managers with information about changes in traffic, informing them about ``what's new''. Specifically, we focus on the challenge of finding significantly large differences in traffic: over time, between interfaces and between routers. We introduce the idea of a deltoid: an item that has a large difference, whether the difference is absolute, relative or variational.

We present novel algorithms for finding the most significant deltoids in high speed traffic data, and prove that they use small space, very small time per update, and are guaranteed to find significant deltoids with pre-specified accuracy. In experimental evaluation with real network traffic, our algorithms perform well and recover almost all deltoids. This is the first work to provide solutions capable of working over the data with one pass, at network traffic speeds.

Improved Data Stream Summaries: The Count-Min Sketch and its Applications

Graham Cormode, S. Muthukrishnan
Journal of Algorithms. Extended Abstract in LATIN 2004.

We introduce a new sublinear space data structure---the Count-Min Sketch--- for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known --- typically from 1/eps^2$ to 1/eps in factor.

Estimating Dominance Norms of Multiple Data Streams

Graham Cormode, S. Muthukrishnan
Proceedings of European Symposium on Algorithms, 2003

There is much focus in the algorithms and database communities on designing tools to manage and mine data streams. Typically, data streams consist of multiple signals. Formally, a stream of multiple signals is (i,a_i,j) where i's correspond to the domain, j's index the different signals and a_i,j >= 0 give the value of the jth signal at point i. We study the problem of finding norms that are cumulative of the multiple signals in the data stream.

For example, consider the max-dominance norm, defined as sum_i max_j {a_i,j}. It may be thought as estimating the norm of the "upper envelope" of the multiple signals, or alternatively, as estimating the norm of the "marginal" distribution of tabular data streams. It is used in applications to estimate the "worst case influence" of multiple processes, for example in IP traffic analysis, electrical grid monitoring and financial domain. In addition, it is a natural measure, generalizing the union of data streams or counting distinct elements in data streams. We present the first known data stream algorithms for estimating max-dominance of multiple signals. In particular, we use workspace and time-per-item that are both sublinear (in fact, poly-logarithmic) in the input size. In contrast other notions of dominance on streams a, b --- min-dominance (sum_i min_j {a_i,j}), count-dominance (|{i | a_i > b_i}|) or relative-dominance (sum_i a_i/max{1,b_i} ) --- are all impossible to estimate accurately with sublinear space.

Finding Hierarchical Heavy Hitters in Data Streams

Graham Cormode, Flip Korn, S. Muthukrishnan, Divesh Srivastava
In Proceedings of 29th International Conference on Very Large Databases, 2003

Aggregation along hierarchies is a critical summary technique in a large variety of online applications including decision support, and network management (e.g., IP clustering, denial-of-service attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on sets (e.g., quantiles, hot items), surprisingly little attention has been paid to summarizing hierarchical structure in stream data.

The problem we study in this paper is that of finding Hierarchical Heavy Hitters (HHH): given a hierarchy and a fraction phi, we want to find all HHH nodes that have a total number of descendants in the data stream larger than phi of the total number of elements in the data stream, after discounting the descendant nodes that are HHH nodes. The resulting summary gives a topological "cartogram" of the hierarchical data. We present deterministic and randomized algorithms for finding HHHs, which builds upon existing techniques by incorporating the hierarchy into the algorithms. Our experiments demonstrate several factors of improvement in accuracy over the straightforward approach, which is due to making algorithms hierarchy-aware.

Stable Distributions for Stream Computations: it's as easy as 0,1,2

Graham Cormode
In Proceedings of Workshop on Management and Processing of Data Streams (MPDS) 2003. [slides]

A surprising number of data stream problems are solved by methods involving computations with stable distributions. This paper will give a short summary of some of these problems, and how the best known solutions depend on use of stable distributions; it also lists some related open problems.

What's Hot and What's Not: Tracking Frequent Items Dynamically

Graham Cormode and S. Muthukrishnan
In Proceedings of Principles of Database Systems, 2003. [slides]

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications. We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. Our algorithm relies on the idea of ``group testing'', is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

Sequence distance embeddings

PhD Thesis

Sequences represent a large class of fundamental objects in Computer Science - sets, strings, vectors and permutations are considered to be sequences. Distances between sequences measure their similarity, and computations based on distances are ubiquitous: either to compute the distance, or to use distance computation as part of a more complex problem. This thesis takes a very specific approach to solving questions of sequence distance: sequences are embedded into other distance measures, so that distance in the new space approximates the original distance. This allows the solution of a variety of problems including: Solutions are given for these problems for a variety of distances, including fundamental distances on sets and vectors; distances inspired by biological problems for permutations; and certain text editing distances for strings. Many of these embeddings are computable in a streaming model where the data is too large to store in memory, and instead has to be processed as and when it arrives, piece by piece. The embeddings are also shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

Comparing Data Streams Using Hamming Norms

Graham Cormode, Mayur Datar, Piotr Indyk, S. Muthukrishnan
In Proceedings of 28th International Conference on Very Large Data Bases, 2002. [slides]

Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases, and instead must be processed ``on the fly'' as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams, and hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently.

We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalises ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have multiple uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the "l_0 sketch" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.

The String Edit Distance Matching Problem with Moves

Graham Cormode, S. Muthukrishnan
In the proceedings of the Symposium on Discrete Algorithms (SODA), 2002. [slides]
A longer version of this paper is available as a DIMACS Tech Report

The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. A well known dynamic programming algorithm takes time O(n m) to solve this problem, and it is an important open problem in Combinatorial Pattern Matching to significantly improve this bound.

We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we approximate the string edit distance up to a factor of O(log n log* n) Our result is a near linear time deterministic algorithm for this version of the problem. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L_1 vector space using a simplified parsing technique we call Edit Sensitive Parsing (ESP). This embedding is approximately distance preserving, and we show many applications of this embedding to string proximity problems including nearest neighbors, outliers, and streaming computations with strings.

Fast Mining of Massive Tabular Data via Approximate Distance Computations

Graham Cormode, Piotr Indyk, Nick Koudas and S. Muthukrishnan
In Proceedings of the 18th International Conference on Data Engineering (ICDE), 2002. [slides]

Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. For example, consider the geographic distribution of cell phone traffic at different base stations across the country (which can be represented by a table indexed by latitude and longitude), or the evolution of traffic at Internet routers over time (which can be represented by a table indexed by time, source and destination IP addresses). Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over time intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two ``regions'') as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address.

We present methods for determining similar regions in massive tabular data. Our methods are for computing the "distance" between any two subregions of a tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use L_p norms. A novelty of our distance computation procedures is that they work for any L_p norms - not only the traditional p=2 or p=1, but for all p <= 2 the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T's data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.

Permutation Editing and Matching via Embeddings

Graham Cormode, S. Muthukrishnan, Cenk Sahinalp, 2001
In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP) 2001.
Lecure Notes in Computer Science volume 2076 (this paper © Springer-Verlag) [slides]

If the genetic maps of two species are modelled as permutations of (homologous) genes, the number of chromosomal rearrangements in the form of deletions, block moves, inversions etc. to transform one such permutation to another can be used as a measure of their evolutionary distance. Motivated by such scenarios, we study problems of computing distances between permutations as well as matching permutations in sequences, and finding most similar permutation from a collection (``nearest neighbor''). We adopt a general approach: embed permutation distances of relevance into well-known vector spaces in an approximately distance-preserving manner, and solve the resulting problems on the well-known spaces. Our results are as follows:

Communication Complexity of Document Exchange

Graham Cormode, Mike Paterson, Cenk Sahinalp, Uzi Vishkin
An earlier version of this paper was published in the proceedings of the SODA'00 conference. [slides]

We have two users, A and B, who hold documents x and y respectively. Neither of the users has any information about the other's document. They exchange messages so that B computes x; it may be required that A compute y as well. Our goal is to design communication protocols with the main objective of minimizing the total number of bits they exchange; other objectives are minimizing the number of rounds and the complexity of internal computations. An important notion which determines the efficiency of the protocols is how one measures the distance between x and y. We consider several metrics for measuring this distance, namely the Hamming metric, the Levenshtein metric (edit distance), and a new LZ metric, which is introduced in this paper. We show how to estimate the distance between x and y using a single message of logarithmic size. For each metric, we present the first communication-efficient protocols, which often match the corresponding lower bounds. A consequence of these are error-correcting codes for these error models which correct up to d errors in n characters using O(d log n) bits. Our most interesting methods use a new histogram transformation that we introduce to convert edit distance to L1 distance.

Electronic Books in Digital Libraries

Tekin Ozsoyoglu, Hurkan Balkir, Graham Cormode, Meral Ozsoyoglu
Published in Proceedings of IEEE Advances in Digital Libraries 2000.

Electronic Book is an application with a multimedia database of instructional resources, which include hyperlinked text, instructor's audio/video clips, slides, animation, still images, etc. as well as content-based infomration about these data, and metadata such as annotations, tags, and cross-referencing information. Electronic books in the Internet or on CDs today are not easy to learn from. We propose the use of a multimedia database of instructional resources in constructing and delivering multimedia lessons about topics in an electronic book.

We introduce an electronic book data model containing (a) topic objects and and (b) instructional resources, called instruction modules which are multimedia presentations possibly capturing real-life lectures of instructors. We use the notion of topic prerequisites for topics at different detail levels, to allow electronic book users to request / compose multimedia lessons about topics on the electronic book. We present automated construction of the "best" user-tailored lesson (as a multimedia presentation).

See also the listings from the dblp and their co-authors list.