Abstracts of Papers
What's New: Finding Significant Differences
in Network Data Streams
Graham Cormode, S. Muthukrishnan
INFOCOM 2004
Monitoring and analyzing network traffic usage patterns is vital for
managing IP Networks.
An important problem is to provide network managers with information
about changes in traffic, informing them about ``what's new''.
Specifically, we focus on the challenge of finding significantly large
differences in traffic: over time, between interfaces and between
routers.
We introduce the idea of a deltoid: an item that has a large
difference,
whether the difference is absolute, relative or
variational.
We present novel algorithms for finding the most significant deltoids in
high
speed traffic data, and prove that they use small space, very small time
per
update, and are guaranteed to find significant deltoids with
pre-specified accuracy.
In experimental evaluation with real network traffic, our algorithms
perform well and recover
almost all deltoids.
This is the first work to provide solutions
capable of working over the data with one pass, at network traffic speeds.
Improved Data Stream Summaries: The Count-Min Sketch and its
Applications
Graham Cormode, S. Muthukrishnan
Journal of Algorithms. Extended Abstract in LATIN 2004.
We introduce a new sublinear space data structure---the
Count-Min Sketch---
for summarizing data streams.
Our sketch allows fundamental queries in data stream summarization
such as point, range, and inner product queries to be
approximately answered very quickly; in addition, it can be applied
to solve several important problems in data streams
such as finding quantiles, frequent items, etc.
The time and space bounds we show for using the CM sketch
to solve these problems significantly improve those previously
known ---
typically from 1/eps^2$ to 1/eps in factor.
Estimating Dominance Norms of Multiple Data Streams
Graham Cormode, S. Muthukrishnan
Proceedings of European Symposium on Algorithms, 2003
There is much focus in the algorithms and database communities on
designing tools to manage and mine data streams.
Typically, data streams consist of multiple signals.
Formally, a stream of multiple signals is (i,a_i,j)
where i's correspond to the domain, j's index the different
signals and a_i,j >= 0 give the value of the jth signal at
point i.
We study the problem of finding norms that are cumulative of
the multiple signals in the data stream.
For example, consider the max-dominance norm,
defined as sum_i max_j {a_i,j}.
It may be thought as estimating the norm of the "upper
envelope" of the multiple signals, or alternatively, as
estimating the norm of the "marginal" distribution of tabular data
streams.
It is used in applications to estimate the "worst case influence" of
multiple processes, for example in IP traffic analysis, electrical
grid monitoring and
financial domain.
In addition, it is a natural measure, generalizing the union of data
streams or counting distinct elements in data streams.
We present the first known data stream algorithms for estimating
max-dominance of multiple signals.
In particular, we use workspace and time-per-item that are both sublinear
(in fact, poly-logarithmic) in the input size. In contrast other
notions of dominance on streams a, b --- min-dominance
(sum_i min_j {a_i,j}),
count-dominance (|{i | a_i > b_i}|) or relative-dominance
(sum_i a_i/max{1,b_i} ) --- are all impossible to estimate
accurately with sublinear space.
Finding Hierarchical Heavy Hitters in Data Streams
Graham Cormode, Flip Korn, S. Muthukrishnan, Divesh Srivastava
In Proceedings of 29th International Conference on Very Large Databases,
2003
Aggregation along hierarchies is a critical summary
technique in a large variety of online applications including
decision support, and network management
(e.g., IP clustering, denial-of-service attack monitoring).
Despite the amount of recent study that has been
dedicated to online aggregation on sets (e.g., quantiles,
hot items), surprisingly little attention has been paid
to summarizing hierarchical structure in stream data.
The problem we study in this paper is that of finding
Hierarchical Heavy Hitters (HHH): given a hierarchy
and a fraction phi, we want to find all HHH nodes
that have a total number of descendants in the data stream larger
than phi of the total number of elements in the data stream,
after discounting the descendant nodes that are HHH nodes.
The resulting summary gives a topological "cartogram" of the
hierarchical data. We present deterministic and randomized algorithms
for finding HHHs, which builds upon existing techniques by
incorporating the hierarchy into the algorithms.
Our experiments demonstrate several factors of improvement
in accuracy over the straightforward approach, which is due
to making algorithms hierarchy-aware.
Stable Distributions for Stream Computations: it's as easy as 0,1,2
Graham Cormode
In Proceedings of Workshop on Management and Processing of Data Streams
(MPDS) 2003.
[slides]
A surprising number of data stream problems are solved by methods
involving computations with stable distributions.
This paper will give a short summary of some of these problems, and how
the best known solutions depend on use of stable distributions;
it also lists some related open problems.
What's Hot and What's Not: Tracking Frequent Items Dynamically
Graham Cormode and S. Muthukrishnan
In Proceedings of Principles of Database Systems, 2003.
[slides]
Most database management systems maintain statistics
on the underlying relation. One of the important statistics
is that of the "hot items" in the relation: those
that appear many times (most frequently, or more than some threshold).
For example, end-biased histograms keep the hot items as part
of the histogram
and are used in selectivity estimation.
Hot items are used as simple outliers in data mining, and
in anomaly detection in networking applications.
We present a new algorithm for dynamically
determining the hot items at any time in the relation
that is undergoing deletion operations as well as inserts.
Our algorithm maintains a small space data structure that
monitors the transactions on the relation, and when required, quickly
outputs
all hot items, without rescanning the relation in the database.
With user-specified probability, it is able to report all hot items.
Our algorithm relies on the idea of
``group testing'', is simple to implement, and has provable
quality, space and time guarantees.
Previously known algorithms for this problem that make similar quality and
performance guarantees can not handle deletions, and those that handle
deletions can not make similar guarantees without rescanning the
database. Our
experiments with real and synthetic data shows that our
algorithm is remarkably accurate in dynamically tracking the hot items
independent of the rate of insertions and deletions.
PhD Thesis
Sequences represent a large class of
fundamental objects in Computer Science -
sets, strings, vectors and permutations are considered to be sequences.
Distances between sequences measure their similarity, and computations
based on distances are ubiquitous: either to compute the distance, or
to use distance computation as part of a more complex problem.
This thesis takes a
very specific approach to solving questions of sequence distance:
sequences are embedded into other distance measures, so that distance in the
new space approximates the original distance.
This allows the solution of a variety of problems including:
-
Fast computation of short `sketches' in a variety of computing models, which
allow sequences to be compared in constant time and space irrespective of
the size of the original sequences.
-
Approximate nearest neighbor and clustering problems,
significantly faster than the naïve exact solutions.
-
Algorithms to find approximate occurrences of pattern sequences in
long text sequences in near linear time.
-
Efficient communication schemes to approximate the distance between, and
exchange, sequences in close to the optimal amount of communication.
Solutions are given for these problems for a variety of distances, including
fundamental distances on sets and vectors; distances inspired by biological
problems for permutations; and certain text editing distances for strings.
Many of these embeddings are computable in a streaming model where the data
is too large to store in memory, and instead has to be processed as
and when it arrives, piece by piece.
The embeddings are also shown to be practical, with a series of large scale
experiments which demonstrate that given only a small space, approximate
solutions to several similarity and clustering problems can be found that are
as good as or better than those found with prior methods.
Graham Cormode, Mayur Datar, Piotr Indyk, S. Muthukrishnan
In Proceedings of 28th International Conference on Very Large Data Bases,
2002.
[slides]
Massive data streams are now fundamental to many data processing
applications. For example, Internet routers produce large scale
diagnostic data streams. Such streams are rarely stored in
traditional databases, and instead must be processed ``on the fly'' as
they are produced. Similarly, sensor networks produce multiple data
streams of observations from their sensors. There is growing focus on
manipulating data streams, and hence, there is a need to identify
basic operations of interest in managing data streams, and to support
them efficiently.
We propose computation of the Hamming norm as a basic operation of
interest. The Hamming norm formalises ideas
that are used throughout data processing. When applied to a single stream,
the Hamming norm gives the number of distinct items that are
present in that data stream, which is a statistic of great interest in
databases. When applied to a pair of streams, the Hamming norm gives
an important measure of (dis)similarity: the number of unequal item
counts in the two streams. Hamming norms have multiple uses
in comparing data streams.
We present a novel approximation technique for estimating the Hamming norm
for massive data streams; this relies on what we call the "l_0 sketch"
and we prove its accuracy. We test our approximation method on a large
quantity of synthetic and real stream data, and show that the
estimation is accurate to within a few percentage points.
The String Edit Distance Matching Problem with Moves
Graham Cormode, S. Muthukrishnan
In the proceedings of the Symposium on Discrete Algorithms (SODA), 2002.
[slides]
A longer version of this paper is available as a
DIMACS Tech Report
The edit distance between two strings S and R
is defined to be the minimum number of character inserts,
deletes and changes needed to convert R to S.
Given a text string t of length n, and a pattern string p
of length m, informally,
the string edit distance matching problem is to
compute the smallest edit distance between p and
substrings of t.
A well known dynamic programming algorithm
takes time O(n m) to solve this problem, and it is an important
open problem in Combinatorial Pattern Matching to
significantly improve this bound.
We relax the problem
so that (a) we allow an additional operation, namely,
substring moves, and (b) we approximate the string edit distance
up to a factor of O(log n log* n)
Our result is a near linear time deterministic
algorithm for this version of the problem.
This is the first
known significantly subquadratic algorithm for a string
edit distance problem in which the distance involves
nontrivial alignments.
Our results are obtained by embedding strings into L_1
vector space using a simplified parsing technique we call
Edit Sensitive Parsing (ESP).
This embedding is
approximately distance preserving, and we show many applications
of this embedding to string proximity problems including nearest neighbors,
outliers, and streaming computations with strings.
Fast Mining of Massive Tabular Data via Approximate Distance
Computations
Graham Cormode, Piotr Indyk, Nick Koudas
and S. Muthukrishnan
In Proceedings of the 18th International Conference
on Data Engineering (ICDE), 2002.
[slides]
Tabular data abound in many data stores: traditional
relational databases store tables, and
new applications also generate massive tabular
datasets.
For example, consider the geographic distribution of cell
phone traffic at different base stations across the
country
(which can be represented by a table indexed by
latitude and longitude),
or the evolution of traffic at Internet
routers over time
(which can be represented by a table
indexed by time, source and destination IP addresses).
Detecting similarity patterns in such data
sets (e.g., which geographic regions have similar cell phone
usage distribution, which IP subnet traffic distributions
over time intervals are similar, etc) is of great importance.
Identification of such patterns poses many
conceptual challenges (what is a suitable similarity
distance function for two ``regions'') as well as
technical challenges (how to perform similarity computations
efficiently as massive tables get accumulated over time) that
we address.
We present methods for determining
similar regions in massive tabular data.
Our methods are for computing the "distance" between
any two subregions of a tabular data: they are approximate,
but highly accurate as we prove mathematically, and they are
fast, running in time nearly linear in the table size. Our methods
are general since these distance computations can be
applied to any mining or
similarity algorithms that use L_p norms.
A novelty of our distance computation procedures is that they
work for any L_p norms - not only the traditional
p=2 or p=1, but for all p <= 2 the choice of p,
say fractional p,
provides an interesting alternative similarity behavior!
We use our algorithms in a detailed
experimental study of the clustering patterns in real tabular
data obtained from one of AT&T's data stores and show that our
methods are substantially faster than straightforward
methods while remaining highly accurate, and able to detect
interesting patterns by varying the value of p.
Permutation Editing and Matching via Embeddings
Graham Cormode, S. Muthukrishnan, Cenk Sahinalp, 2001
In Proceedings of
International Colloquium on Automata, Languages and Programming
(ICALP) 2001.
Lecure Notes in Computer Science volume 2076
(this paper ©
Springer-Verlag)
[slides]
If the genetic maps of two species are modelled as permutations of
(homologous) genes, the number of chromosomal rearrangements
in the form of deletions, block moves, inversions etc. to transform one
such permutation to another can be used as a measure of their evolutionary
distance. Motivated by such scenarios, we study problems of
computing distances between permutations
as well as matching permutations in
sequences, and finding most similar permutation from a collection
(``nearest neighbor'').
We adopt a
general approach: embed permutation distances of relevance into well-known
vector spaces in an approximately distance-preserving manner,
and solve the resulting problems on the well-known spaces.
Our results are as follows:
-
We present the first known approximately distance preserving
embeddings of these permutation distances into well-known spaces.
-
Using these embeddings, we obtain several results, including
the first known efficient
solution for approximately solving nearest neighbor
problems with permutations
and
the first known
algorithms for finding permutation distances in the "data stream"
model.
-
We consider a novel
class of problems called permutation matching problems
which are similar to string matching problems,
except that the
pattern is a permutation (rather than a string)
and present linear or near-linear time
algorithms for approximately solving permutation matching problems;
in contrast, the corresponding string problems take significantly
longer.
Communication Complexity of Document Exchange
Graham Cormode, Mike Paterson, Cenk Sahinalp, Uzi Vishkin
An earlier version of this paper was published in the
proceedings of the SODA'00 conference.
[slides]
We have two users, A and B, who hold documents x and y
respectively. Neither of the users has any information about the
other's document. They exchange messages so that B computes x; it may
be required that A compute y as well. Our goal is to design
communication protocols with the main objective of minimizing the
total number of bits they exchange; other objectives are minimizing
the number of rounds and the complexity of internal computations. An
important notion which determines the efficiency of the protocols is
how one measures the distance between x and y. We consider several
metrics for measuring this distance, namely the Hamming metric, the
Levenshtein metric (edit distance), and a new LZ metric, which is
introduced in this paper. We show how to estimate the distance between
x and y using a single message of logarithmic size. For each metric,
we present the first communication-efficient protocols, which often
match the corresponding lower bounds. A consequence of these are
error-correcting codes for these error models which correct up to d
errors in n characters using O(d log n) bits. Our most interesting
methods use a new histogram transformation that we introduce to
convert edit distance to L1 distance.
Electronic Books in Digital Libraries
Tekin Ozsoyoglu, Hurkan Balkir, Graham Cormode, Meral
Ozsoyoglu
Published in Proceedings of IEEE Advances in Digital Libraries 2000.
Electronic Book is an application with a multimedia database of
instructional resources, which include hyperlinked text,
instructor's audio/video clips, slides, animation, still images,
etc. as well as content-based infomration about these data, and
metadata such as annotations, tags, and cross-referencing information.
Electronic books in the Internet or on CDs today are not easy to learn
from.
We propose the use of a multimedia database of instructional resources
in constructing and delivering multimedia lessons about topics
in an electronic book.
We introduce an electronic book data model containing (a)
topic objects and and (b) instructional resources, called
instruction modules which are multimedia presentations possibly
capturing real-life lectures of instructors.
We use the notion of topic prerequisites for topics at
different detail levels, to allow electronic book users to
request / compose multimedia lessons about topics on the electronic
book. We present automated construction of the "best" user-tailored
lesson (as a multimedia presentation).
See also the listings from the
dblp and their
co-authors
list.