## Optimal Parsing Schemes for Dynamic Dictionary Compression

### Suleyman Cenk Sahinalp

Bell Laboratories
Murray Hill, NJ 07974
jenk@research.bell-labs.com

**Abstract:**

A dictionary data compression algorithm parses an input string to phrases
from the dictionary in use, and outputs the corresponding codeword for each of
the phrases.
In dynamic dictionary algorithms, the dictionary is constructed adaptively
as the input is read and parsed.
In this paper we address the issues of dictionary
construction and the parsing method separately, and consider the following
question:
Given a dictionary construction scheme, what is the optimal parsing
method, i.e., the one which outputs the smallest number of codewords possible
on any given input.
In addition, we investigate how good is the greedy parsing method, which is
employed by
almost all dynamic dictionary compression algorithms found in the literature
including
the well-known Lempel-Ziv algorithms and their variants.
We first show that greedy parsing can be far from optimal for some of the most
popular dictionary construction schemes, including the popular Lempel-Ziv-78
scheme
and its LZW variant: for some input strings that can be parsed into $n$
phrases,
greedy parsing obtains $\Omega(n \sqrt{n/\log n})$ phrases.
We then investigate whether greedy parsing can be improved by a few steps
lookahead.
Perhaps our most surprising result is that greedy parsing with a single step
lookahed - which we call flexible parsing (\FP),
is optimal for all dictionary construction schemes that satisfy the prefix
property;
those include the Lempel-Ziv algorithms and their variants.
More generally, we provide the necessary and sufficient conditions of a
dictionary
construction scheme for which greedy parsing with any $k\geq 0$ lookaheads is
optimal.
As a result, we show the greedy parsing (without any lookaheads) is optimal
for dictionary construction schemes that satisfy the suffix property
(and is identical to \FP).
Those dictionaries include the Lempel-Ziv-77 scheme (which is known to
be superior to the Lempel-Ziv-78 algorithm for some applications, and inferior
for others).

We finally provide a data structure that enables efficient implementation of
\FP,
which, in terms of running time and space, is as good as the best data
structure for
greedy parsing: for the case of Lempel-Ziv-78 or the LZW dictionaries our
algorithm
runs in $O(1)$ time per character, and require space proportional to the
number
of phrases in the dictionary.

Joint work with Yossi Matias