Histograms are used in many ways in conventional databases and in data stream processing for summarizing massive data distributions. Previous work on constructing histograms on data streams with provable guarantees have not taken into account the workload characteristics of databases which show some parts of the distributions to be more frequently used than the others; on the other hand, previous work for constructing histograms that do make use of the workload characteristics---and have demonstrated the significant advantage of exploiting workload information---have not come with provable guarantees on the accuracy of the histograms or the time and space bounds needed to obtain reasonable accuracy.

We study the algorithmic complexity of constructing workload-optimal histograms on data streams, and expose the effect of the workload on the complexity of histogram construction problem.

Consider the case when the workload is explicitly stored and the input data is streamed in the time series model---the input is a vector of N components, and we are presented with the components, one at a time, in order. We present an algorithm that uses polylogarithmic space, processes each new item in constant time, and, in polylogarithmic post-processing time, determines a (1+epsilon)-approximation to the optimal histogram. This matches the space complexity, up to polylogarithmic factors, of the histogram construction on the stream when workload is uniform [Guha, Indyk, Muthukrishnan, and Strauss]. To get this result, we need to identify and exploit the notion of linearly robust histograms. These is the first known algorithmic result with provable guarantees for workload-optimal histogram construction on streams.

Now consider the case when workload is not stored fully, but is
compressed. As we show, trivial lossy compression can be used (one
can drop low-order bits of individual weights) but algorithmic results
such as the above are not possible if any nontrivial lossy
compression is used. However, we show that our algorithmic results
can be extended efficiently to the case when the workload is
compressed without loss by using, for example, a universal
compression scheme of Lempel-Ziv. This requires supporting a
symbol-range-count data structure on compressed data which may be of
independent interest. Also, the direction of exploring data stream
problems when one of the inputs (workload) is compressed losslessly,
is novel.

Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2005/2005-19.ps.gz

DIMACS Home Page