Bell Laboratories Murray Hill, NJ 07974 jenk@research.bell-labs.comAbstract:
We first show that greedy parsing can be far from optimal for some of the most popular dictionary construction schemes, including the popular Lempel-Ziv-78 scheme and its LZW variant: for some input strings that can be parsed into $n$ phrases, greedy parsing obtains $\Omega(n \sqrt{n/\log n})$ phrases. We then investigate whether greedy parsing can be improved by a few steps lookahead. Perhaps our most surprising result is that greedy parsing with a single step lookahed - which we call flexible parsing (\FP), is optimal for all dictionary construction schemes that satisfy the prefix property; those include the Lempel-Ziv algorithms and their variants. More generally, we provide the necessary and sufficient conditions of a dictionary construction scheme for which greedy parsing with any $k\geq 0$ lookaheads is optimal. As a result, we show the greedy parsing (without any lookaheads) is optimal for dictionary construction schemes that satisfy the suffix property (and is identical to \FP). Those dictionaries include the Lempel-Ziv-77 scheme (which is known to be superior to the Lempel-Ziv-78 algorithm for some applications, and inferior for others).
We finally provide a data structure that enables efficient implementation of \FP, which, in terms of running time and space, is as good as the best data structure for greedy parsing: for the case of Lempel-Ziv-78 or the LZW dictionaries our algorithm runs in $O(1)$ time per character, and require space proportional to the number of phrases in the dictionary.
Joint work with Yossi Matias