Optimal Parsing Schemes for Dynamic Dictionary Compression

Suleyman Cenk Sahinalp

                       Bell Laboratories
                    Murray Hill, NJ 07974

A dictionary data compression algorithm parses an input string to phrases from the dictionary in use, and outputs the corresponding codeword for each of the phrases. In dynamic dictionary algorithms, the dictionary is constructed adaptively as the input is read and parsed. In this paper we address the issues of dictionary construction and the parsing method separately, and consider the following question: Given a dictionary construction scheme, what is the optimal parsing method, i.e., the one which outputs the smallest number of codewords possible on any given input. In addition, we investigate how good is the greedy parsing method, which is employed by almost all dynamic dictionary compression algorithms found in the literature including the well-known Lempel-Ziv algorithms and their variants.

We first show that greedy parsing can be far from optimal for some of the most popular dictionary construction schemes, including the popular Lempel-Ziv-78 scheme and its LZW variant: for some input strings that can be parsed into $n$ phrases, greedy parsing obtains $\Omega(n \sqrt{n/\log n})$ phrases. We then investigate whether greedy parsing can be improved by a few steps lookahead. Perhaps our most surprising result is that greedy parsing with a single step lookahed - which we call flexible parsing (\FP), is optimal for all dictionary construction schemes that satisfy the prefix property; those include the Lempel-Ziv algorithms and their variants. More generally, we provide the necessary and sufficient conditions of a dictionary construction scheme for which greedy parsing with any $k\geq 0$ lookaheads is optimal. As a result, we show the greedy parsing (without any lookaheads) is optimal for dictionary construction schemes that satisfy the suffix property (and is identical to \FP). Those dictionaries include the Lempel-Ziv-77 scheme (which is known to be superior to the Lempel-Ziv-78 algorithm for some applications, and inferior for others).

We finally provide a data structure that enables efficient implementation of \FP, which, in terms of running time and space, is as good as the best data structure for greedy parsing: for the case of Lempel-Ziv-78 or the LZW dictionaries our algorithm runs in $O(1)$ time per character, and require space proportional to the number of phrases in the dictionary.

Joint work with Yossi Matias