We study the algorithmic complexity of constructing workload-optimal histograms on data streams, and expose the effect of the workload on the complexity of histogram construction problem.
Consider the case when the workload is explicitly stored and the input data is streamed in the time series model---the input is a vector of N components, and we are presented with the components, one at a time, in order. We present an algorithm that uses polylogarithmic space, processes each new item in constant time, and, in polylogarithmic post-processing time, determines a (1+epsilon)-approximation to the optimal histogram. This matches the space complexity, up to polylogarithmic factors, of the histogram construction on the stream when workload is uniform [Guha, Indyk, Muthukrishnan, and Strauss]. To get this result, we need to identify and exploit the notion of linearly robust histograms. These is the first known algorithmic result with provable guarantees for workload-optimal histogram construction on streams.
Now consider the case when workload is not stored fully, but is
compressed. As we show, trivial lossy compression can be used (one
can drop low-order bits of individual weights) but algorithmic results
such as the above are not possible if any nontrivial lossy
compression is used. However, we show that our algorithmic results
can be extended efficiently to the case when the workload is
compressed without loss by using, for example, a universal
compression scheme of Lempel-Ziv. This requires supporting a
symbol-range-count data structure on compressed data which may be of
independent interest. Also, the direction of exploring data stream
problems when one of the inputs (workload) is compressed losslessly,
is novel.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2005/2005-19.ps.gz