DIMACS Working Group on The Burrows-Wheeler Transform: Ten Years Later

DIMACS Working Group on The Burrows - Wheeler Transform: Ten Years Later

August 19 - 20, 2004
DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ

Organizers:: Paolo Ferragina, University of Pisa; Giovanni Manzini, University of Piemonte Orientale; S. Muthukrishnan, Rutgers University, muthu@cs.rutgers.edu

Presented under the auspices of the Special Focus on Special Focus on Data Analysis and Mining.

Abstracts:

Alberto Apostolico, University of Padova, Italy

Title: Fast Gapped Variants of LZW compression

This talk presents variants of classical data compression paradigms by Ziv, Lempel, and Welch in which the phrases used in compression are strings of intermittently solid and wild characters. Such strings are suitably selected among patterns that are produced in turn, in a deterministic fashion, by the self-correlation of the source string. In particular, adaptations and extensions of the classical LZ78 paradigm as implemented by Welch are developed along these lines, and they are seen to be susceptible of simple linear time implementation. Both lossy and lossless schemata are considered, and preliminary analyses of performance are attempted.

Mike Burrows, Google
(Joint with David Wheeler, University of Cambridge)

Title: The Pre-history and Future of the Block-Sorting Compression Algorithm

I have organized the talk in two parts. The first is about the years before the algorithm was published, and the second looks forward to the next ten years.

Unfortunately, my co-author (and the algorithm's inventor) David Wheeler is unable to come to the workshop. However, he has given me some notes about the invention of the algorithm. I will begin by recounting some of his early thoughts; some have stood the test of time, while others have not. I'll also describe our efforts to make a practical compressor based on the algorithm, and various things that didn't make it into the original paper.

The compression achieved by the algorithm was good from the first, and further gains have been achieved since its publication. However, its compression and decompression speed have received less attention. This is curious, because it is its speed rather than its compression ratio that potential users find most irksome. One might think that the steady improvement in CPU performance would lead to fewer concerns about its performance; certainly, larger memories have made its memory footprint less troublesome that it was. Nevertheless, the unusual structure of the algorithm has rendered recent advances in CPU microarchitecture largely ineffective at improving its performance, while other compression algorithms have benefitted. In contrast, I suspect that the next ten years of improvements in hardware are more likely to favour block-sorting over competing techniques, given suitable algorithmic changes. I hope to report on one such promising change (joint work with my colleagues Sean Quinlan and Sean Doward at Google).

Neva Cherniavsky, University of Washington, USA
(Joint with Richard Ladner, University of Washington)

Title: Grammar-based Compression of DNA Sequences

Grammar-based compression methods have shown success for many different types of data. The central idea behind these methods is to use a context-free grammar to represent the input text. Grammars can capture repetitions occurring far apart in the data. This is a limitation on sliding window or block sorting algorithms, such as LZ77 or bzip2. Compression of DNA sequences is a notoriously hard problem. DNA contains only four symbols, and so can be represented by two bits per symbol. It is very hard to beat the bound of two bits per symbol. However, DNA is also known to have long repetitions that a compressor could hope to capture. Furthermore, DNA has a unique kind of repetition, because it contains both exact matches and reverse complement matches. We explore the effectiveness of grammar-based compression on DNA sequences. We exploit the different types of repetition in DNA by modifying a successful grammar inference algorithm, Sequitur [3]. We then endeavor to maximize our gain in the next two stages of grammar compression: grammar encoding and entropy encoding. We present several different techniques for grammar encoding, ranging from very simple methods to more complicated pointer-based methods. All skew the data in some way to make the entropy encoding step more effective. We implement a custom arithmetic coder as the entropy coder, which is specifically designed to work well with our symbol stream. After evaluating our algorithm, we return to the grammar inference portion of the algorithm and improve the grammar by quantifying the efficiency of the rules. After striving to optimize each stage of our compression algorithm, we conclude that grammar-based compression does not work well on DNA sequences. The best compressors for DNA are GenCompress [2] and DNACompress [1]. These algorithms, while achieving a bit rate smaller than two bits per symbol, still do not compress much more than a standard first order entropy coder. We pose this as a challenge to the compression community: is there a clever algorithm which can compress DNA significantly better than a simple first order arithmetic coder?

References
[1] Chen, X., Li, M., Ma, B., and Tromp, J. DNACompress: Fast and effective DNA sequence compression. Bioinformatics 18, 12 (Dec. 2002), 1696-1698.
[2] Grumbach, S., and Tahi, F. A new challenge for compression algorithms: genetic sequences. Information Processing and Management 30, 6 (1994), 875-886.
[3] Nevill-Manning, C. G., and Witten, I. H. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7 (Sept. 01 1997), 67-82.

Fred Douglis, IBM T.J. Watson, USA

Title: Toward Ubiquitous Compression

Introduction

Mark Weiser once defined ubiquitous computing as computers that blend into the environment so well that they are effectively invisible [14]. Basic compression (such as with zip) is starting to achieve the same level of ubiquity, but it has far to go. This paper considers the evolution of on-line compression in a number of real-world applications and conjectures on ways in which compression should continue to evolve. It is written from the perspective of a consumer of compression technologies, not one who writes the compressors in the first place.

About Compression

In this paper, compression refers to any technique that encodes data more compactly. Thus it includes not only "traditional" compression [7], which encodes an object by reducing the redundancy within it or by encoding the object relative to a well-known dictionary, but also approaches that encode different objects relative to each other. Examples include duplicate suppression [1], which replaces identical objects (at some granularity) with references to an existing copy, and delta-encoding [5], which encodes an object with respect to another specific object.

Tradeoffs

Compression requires computing resources, such as processor cycles and memory. In some environments the expenditure of these resources is clearly justified. This may be because the savings from compression are substantial, because the compressed object will be saved in compressed form for a prolonged time, or other reasons. Some forms of compression are automatic and implicit. For example, modems typically compress data being transmitted over a phone line. By doing the compression in hardware, modems ensure that the performance of compression is comparable to the speed of transmission; .i.e, transmission is not delayed by the act of compressing or uncompressing data. For stored data, the notion of compressing data that will not be accessed for a long time has been well understood for decades.

Quite some time ago, I expounded on the trends in processing speed and suggested that as processors got faster more quickly than networks, distributed systems should consider automatically compressing data in software in an end-to-end network connection [3]. While I did not anticipate the rapid increase in network bandwidth that accompanied the ever-increasing processor performance, and memory bandwidth is proving to be another important factor, the general approach still applies:

Systems should automatically compress data whenever the benefits from transmitting or storing compressed data outweigh the costs.

Just as modems compress transparently, distributed systems, storage systems, and especially distributed storage systems should be cognizant of the tradeoffs in dynamic compression and integrate compression techniques.

Where We Are

On-line compression, where everything that goes over a link or into or out of a file system is compressed and later uncompressed on the "critical path" to accessing the data, is by now a well-understood and commonly used technique. For several years, users have had an option to compress everything written to a Windows file system. More recently, there have been examples of other techniques to trade processing for data size, using technology beyond simple zip-style data compression, such as:

Rsync [13], which allows a user to synchronize versions of a file on two different systems by identifying common blocks. It uses a rolling checksum to match blocks on the sender that are already contained in the receiver's copy. The same rolling checksum technique can be applied to storage systems to find duplication across files [2].
Link-level duplicate detection [12], which stores fingerprints representing occasional substrings of streamed data and then matches repeated data by seeing the same fingerprint again.
Storage-level duplicate detection, which uses strong checksums to find when files or blocks are stored multiple times and instead save only one copy. Blocks can be fixed-size [11] or with boundaries defined by the content itself [8], the latter preventing changes in one block from affecting subsequent ones but at a higher processing cost [10].
Delta-encoding [5], which compresses a file or block against another file or block that is similar, by representing only the differences. It can use resemblance detection to find similar files [4, 9] or blocks [6].

I make two observations about these techniques. First, they are generally done in isolation. A system that does block-level duplicate detection may not simultaneously perform compression at the level of entire files, even though whole-file compression could exploit inter-block redundancy and dramatically reduce overall storage consumption [6]. Second, the techniques are applied all-or-none: a system that does one type of compression most likely always does that type.

Where We Should Go

Not all data are created equal. Some are accessed often; some are write-once, read-rarely-if-ever. Some compress well with "traditional" compressors; some contain duplication with other pieces of data. The same is true of computing environments, where the speed of processing, accessing a disk, or communicating over a network can be extremely variable. It is the combination of all these factors-data contents, access patterns, and execution environment-that determines how compression should best be applied, if at all. One size does not fit all.

Ideally, one can give systems (and applications) the ability to pick and choose automatically among a suite of data reduction techniques. With my colleagues at IBM Research, I have explored and quantified the benefits of a number of techniques across a variety of data sets [6]. Ultimately, one would offer systems the ability to select compression techniques dynamically as a function of the data and the execution environment. Eventually, we may enable compression and other data reduction techniques to fade into the background, just like Weiser's ubiquitous computers.

Acknowledgments

Jason LaVoie and John Tracey provided helpful comments on this paper.

References
[1] William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur. Single instance storage in windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, August 2000.
[2] Timothy E. Denehy and Windsor W. Hsu. Reliable and efficient storage of reference data. Technical Report RJ10305, IBM Research, October 2003.
[3] Fred Douglis. On the role of compression in distributed systems. In Proceedings of the Fifth ACM SIGOPS European Workshop. ACM, September 1992. Also appears in ACM Operating Systems Review, 27(2):88-93, April 1993.
[4] Fred Douglis and Arun Iyengar. Application-specific delta-encoding via resemblance detection. In Proceedings of 2003 USENIX Technical Conference, June 2003.
[5] David G. Korn and Kiem-Phong Vo. Engineering a differencing and compression data format. In Proceedings of the 2002 Usenix Conference. USENIX Association, June 2002.
[6] Purushottam Kulkarni, Fred Douglis, Jason LaVoie, and John M. Tracey. Redundancy elimination within large collections of files. In Proceedings of the 2004 Usenix Conference, June 2004. To appear.
[7] D. A. Lelewer and D. S. Hirschberg. Data compression. ACM Computing, Springer Verlag (Heidelberg, FRG and NewYork NY, USA)-Verlag Surveys, ; ACM CR 8902-0069, 19(3), 1987.
[8] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidth network file system. In Symposium on Operating Systems Principles, pages 174-187, 2001.
[9] Zan Ouyang, Nasir Memon, Torsten Suel, and Dimitre Trendafilov. Cluster-based delta compression of a collection of files. In International Conference on Web Information Systems Engineering (WISE), December 2002.
[10] Calicrates Policroniades and Ian Pratt. Alternatives for detecting redundancy in storage systems data. In Proceedings of the 2004 Usenix Conference, June 2004.
[11] S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In Proceedings of the First USENIX Conference on File and Storage Technologies, Monterey,CA, 2002.
[12] Neil T. Spring and David Wetherall. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of ACM SIGCOMM, August 2000.
[13] Andrew Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, Australian National University, 1999.
[14] Mark Weiser. Some computer science issues in ubiquitous computing. Communications of the ACM, 36(7):74-84, July 1993.

Martin Farach-Colton, Rutgers University, USA

Title: A Survey of Suffix Sorting

Suffix trees are the fundamental data structure of stringology. The suffix array is a low-memory cousin of the suffix tree. In the last 10 years, we have come to realize that building such data structures is equivalent to sorting the suffixes of a string. Such suffix sorting is the key algorithmic component of the Burrows-Wheeler transform. We will show the equivalences between suffix trees, suffix arrays and suffix sorting, as well as the algorithm that suffix sorts in the same time taken to simply sort the characters of the string. The field has matured to the point where we may well have the suffix-sorting algorithm "from the book".

Peter Fenwick, Auckland University, NZ

Title: Using the Burrows Wheeler Transform for PPM compression without escapes

High performance lossless data compression is dominated by two algorithms. The older algorithm, PPM or Prediction by Partial Matching, gives slightly better compression, but requires rather more complex data structures and has rather slower execution. The newer algorithm, based on the Burrows Wheeler transform, uses simpler data structures and is faster but with slightly poorer compression.

Although PPM is now a relatively mature technique, Burrows Wheeler compression is still not well understood, although knowledge is certainly improving. (It must be recognized that PPM at the same "age", 10 years, was also still largely a collection of ad hoc rules.) Some previous work has tried to combine Burrows Wheeler and PPM techniques, usually by adapting PPM to process the permuted output of the Burrows Wheeler transformation and bypassing the more usual Move-to-Front recoding of Burrows Wheeler. This work has generally given little advantage, largely because the permuted output has a context structure which is not easily exploited by PPM techniques.

The performance of PPM is limited largely by the handling of escapes, to control the coding of symbols that have not been seen in a higher order. An escape is a symbol that must be encoded as any other, according to its probability. But whereas we can give a good estimate of the probabilities of known symbols (which we have seen), it is much more difficult to estimate the probability of encountering an unexpected symbol. Improvements in PPM have largely followed from improvements in estimating escape probabilities.

This paper discusses work-in-progress on a new technique for combining Burrows Wheeler and PPM. While most prior work on Burrows Wheeler compression emphasises data manipulation after the forward transform and before its inverse, the current approach looks much more closely at the inverse transform.

In conventional Burrows Wheeler compression the reverse transform generates a single text string, corresponding to the original compressed text. But the reverse transform can do much more than this and can generate contexts for all of the symbols in the permuted data (and therefore contexts for every input symbol). The usual transform generates the context for either the last input symbol (with forward contexts) or the first symbol (with reverse contexts and producing reversed output). Conventional Burrows Wheeler produces its single output context to an unbounded order, so recovering the entire input text.

Here we use the reverse transform to generate contexts for every symbol, but only to some small constant order, say 6 or 8. Because we generate all contexts for the input, a PPM compressor can operate at constant order, without escapes. With contexts needed to only say 6 or 8 symbols, the forward transform can be simpli?ed to use comparisons to only a similar length. A slight modification to the forward comparison allows like symbols to be grouped within their contexts, with run-length encoding improving the information density.

The compressor starts by using a normal forward Burrows Wheeler transform, but with limited comparison lengths. Its permuted output is encoded (in any "standard Burrows Wheeler" manner) as the context-definition part of the compressed output. The permuted output is also used to generate the complete suite of constant order PPM contexts, using the modified reverse transform. Finally, the original data is processed by the constant order PPM compressor, using the contexts generated from the Burrows Wheeler processing. The PPM output forms the second part of the compressed output.

The decompressor first creates the contexts, using the same code as was used in compression. It then uses the second part of the compressed data to drive the PPM decoder and recover the original text. Note here that PPM codes are generated and emitted only as needed; many PPM contexts are deterministic and their one symbol can be emitted immediately with no guidance from the compressor.

Results so far do not give the expected improvements, because it turns out not that much cheaper to send the reduced context information than the full Burrows Wheeler output, and also because many symbols are encoded twice, once for the contexts and once for PPM. But two lines of future study are suggested -

1. It is often possible to develop contexts rather longer than the Burrows Wheeler comparison. This may reduce the costs of sending the contexts.

2. It is suggested that there may be a connection between error correction coding and compression which would allow some PPM codes to be replaced by erasure symbols. Techniques allied to trellis or Viterbi coding may then allow the contexts to be recovered.

James J. Hunt, Fzi, Germany

Title: Block Sorting Lossless Delta Compression Algorithms

The ideas behind block sorting lossless data compression algorithm have influence not only file compression but also delta compression. Modern delta compression algorithms based on Ziv-Lempel techniques significantly outperform diff, a popular but older delta compressor, in terms of compression ratio. The modern compressors also correlate better with the actual difference between files without sacrificing performance.

Two different delta compression algorithms will be presented. The algorithms used different strategies for finding common blocks. These strategies will also be compared and their relationship to the Burrows-Wheeler Transform will be presented. Performance issues will also be addressed.

The performance of a delta algorithm is dependent upon the size of the difference between pairs of files. A metric using the Longest Common Subsequence (LCS) as the reference against which to measure compression will be presented. This metric applies mainly to one-dimensional data, but it may also apply to higher dimensional data that can be linearized without fragmenting typical changes.

Delta algorithms are also related to software differencing and merging. By using tokens instead of byte, one can use delta algorithms to improve differencing and merging. In order to attain accurate results, differencing and merging must respect the block structure of programs. A method to extend delta algorithms to this domain will also be presented.

Giovanni Manzini, University Piemonte Orientale, Italy
(Joint with Paolo Ferragina, University of Pisa, Italy, Raffaele Giancarlo, University of Palermo, Italy, Marinella Sciortino, University of Palermo, Italy)

Title: Compression Boosting Using the Burrows-Wheeler Transform

We discuss a general boosting technique for data compression. Qualitatively, our technique takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. Our technique displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the "best possible" contexts; (b) it is very simple and optimal in terms of time; (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.

Technically, our boosting technique builds upon three main ingredients: the Burrows-Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically we show that there exists a proper partition of the Burrows-Wheeler Transform of a string s that shows a deep combinatorial relation with the k-th order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost.

Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster we get better compression algorithms than some of the best existing ones, i.e., LZ77, LZ78, PPMC and the ones derived from the Burrows-Wheeler Transform. Further, we settle analytically some long standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process.

Nasir Memon, Polytechnic University, USA
(Joint with Lee Butterman, Brown University, USA)

Title: An Error-Resilient Blocksorting Compression Algorithm

One key limitation of adaptive lossless compression systems is the susceptibility to errors in the compressed bit stream. The inherent design of these techniques often requires discarding all data subsequent to the error. This is particularly problematic in the Burrows-Wheeler Blocksorting Transform (BWT), which is used in practice to compress files around 1MB at a time. In the original BWT, each corrupted block must be completely discarded, but decreasing blocksize increases the bit-per-character (bpc) rates dramatically. Error-correcting codes (such as Reed-Solomon codes) can be used, but their design allows for a maximum pre-fixed error rate, and if the channel errors exceed this, the whole block is again lost. In this paper we present an error-resilient version of the BWT that has error-free output in low channel noise, and gracefully degrades output quality as errors increase by scattering output errors, thereby avoiding the significant error propagation typical with adaptive lossless compression systems (and BWT in particular). The techniques we develop also yields interesting new insights into this popular compression algorithm. We show how to perform the inverse blocksort to decode both forwards and backwards by inverting the original permutation vector T . We periodically save the values of T as overhead. When the inverse blocksorter encounters an error, it can start at a specific position that corresponds to one of the saved T anchors, and it can bi-directionally decode until reaching an error. When there is more than one error between two anchors, we can decode forwards up to the first error and backwards from the next anchor until the last error between the anchors. This also generates subloops, decodable text between the anchors that is not connected. The problem of reconnecting them can be formulated as a TSP problem and can be solved by an appropriate heuristic.

S. Cenk Sahinalp, Simon Fraser University, CA

Title: Comparing Sequences with Segment Rearrangements

Computational genomics involves comparing DNA sequences based on similarity/distance computations for detecting evolutionary and functional relationships. Until recently available portions of published genome sequences were fairly short and sparse and the similarity between such sequences was measured based on character level differences. With the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long DNA sequences must capture segmental rearragements found in abundance in the human genome. In this talk we will focus on block edit distance, which is defined as the smallest number of character edits (replacement and indel) and block edits (segmental duplication, deletion and translocation) to transform one sequence into another. Although it is NP hard to compute the block edit distance between two sequences, we show that it can be approximated within a constant factor in linear time via a simple one pass algorithm. The approximation method, based on the Lempel-Ziv'77 compressibility of one sequence when concatenated with the other, enhances the long suspected link between mutual compressibility and evolutionary relationship between genomic sequences.

Marinella Sciortino, University of Palermo, Italy
(Joint with Sabrina Mantaci, Antonio Restivo, University of Palermo, Italy)

Title: Generalized Burrows-Wheeler Transform

Michael Burrows and David Wheeler introduced in 1994 (cf. [1]) a reversible transformation on strings (BWT from now on) that arouses considerable interest and curiosity in the field of Data Compression.

Most of the studies on the Burrows Wheeler Transform have been experimental and developed within the Data Compression community. Very recently some combinatorial aspects of this transform have been investigated (cf. [4, 5]). For instance, by using the Burrows-Wheeler transform, one can derive a further characterization of Standard words, that are very important objects in the field of Combinatorics on Words: more specifically, Standard words correspond to the extremal case, for binary alphabets, of BWT,inthe sense that the transform produces a total clustering of all the instances of any character.

Moreover it has been proved (cf. [2]) that there exists a very close relation between BWT and a transformation, described by Gessel and Reutenauer in [3], based on a bijection between finite words and the set of permutations with a given cyclic structure and a given descent set. Actually in this formalism, BWT corresponds to the special case in which the considered permutations are cyclic and the cardinality of its descent set is less than the cardinality of the alphabet.

Now, by starting from the Gessel and Reutenauer transform, we can define a generalized Burrows-Wheeler transform (denoted by GBW T) applied to a set of n words. This transformation involves an order relation between words that is different from the lexicographic one. The sorting process derived from such an order relation is realized by using the Fine and Wilf Theorem for n periods (cf. [6]).

Moreover the GBW T allows to introduce a new notion of similarity between words that takes into account how equal factors of two words appear in similar contexts.

References
[1] M. Burrows and D.J. Wheeler. A block sorting data compression algorithm. Technical report, DIGITAL System Research Center, 1994.
[2] M. Crochemore, J. D�esarm�enien, and D. Perrin. A note on the Burrows-Wheeler transformation. Theoret. Comput. Sci. to appear.
[3] I. M. Gessel and C. Reutenauer. Counting permutations with given cycle structure and descent set. J. Combin. Theory Ser. A, 64(2):189-215, 1993.
[4] S. Mantaci, A. Restivo, and M. Sciortino. Burrows-Wheeler transform and Sturmian words. Informat. Proc. Lett., 86:241-246, 2003.
[5] S. Mantaci,A.Restivo,and M. Sciortino. Combinatorial aspects of the Burrows-Wheeler transform. TUCS (Turku Center for Computer Science) General Pubblication, 25:292-297, 2003. proc. WORDS 2003.
[6] R. Tijdeman and L.Zamboni. Fine and Wilf words for any periods. Indag. Math., 14(1):135-147, 2003.

Torsten Suel, Polytechnic University, USA

Title: Remote file and data synchronization: State-of-the-art and open problems

Remote file and data synchronization tools are used to efficiently maintain large replicated collections of files or record-based data in distributed environments with limited bandwidth. Common application scenarios include synchronization of data between accounts or mobile devices, content distribution and web caching networks, web site mirroring, storage networks, database replication, and large scale web search and mining. A number of techniques have been studied by academic researchers over the last few years, and many companies have developed and deployed tools for various applications.

After a general introduction, we will focus on the remote file synchronization problem, which in simplified form is stated as follows: given two versions of a file on different machines, say an outdated and a current version, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity that often exists between versions? A well-known open source tool for this problem called rsync is widely used in practice, and recently researchers have proposed several possible improvements over the rsync protocol. We will describe theoretical and practical approaches to the file synchronization problem, discuss relationships to compression and coding theory, and list open challenges. We will also discuss some of our own recent work on the problem. We hope to convince the audience that these problems are important in practice and of fundamental interest, and that the compression and algorithms communities in particular have much to contribute.

Kiem-Phong Vo, AT&T Labs Research, USA

Title: Vcodex: A Platform of Data Transformers

Modern data compression methods are often built upon other existing techniques. However, such works tend to focus on the theoretical side of algorithm engineering, i.e., the design and analysis of the underlying data structures and algorithms while implementation quality is mostly constrained to producing working tools and often only for the few hardware/OS platforms used by the tool developers. Not much attention is given to software engineering, i.e., the design and implementation of standardized interfaces to make such data structures and algorithms reusable. In addition, virtually every compressor uses its own ad-hoc data format and little thought is ever given to data engineering so that data produced by one tool can be further processed by another. Inadequate engineering considerations make it hard to experiment with different combinations of compression techniques in solving particular data needs. Below are a few relevant aspects of algorithm, software and data engineering in building and using compression tools:

Algorithm engineering: Algorithm design aims at achieving desired computational bounds while producing good compression rates. But, to maximize performance, attention must also be paid to tuning the performance of such algorithms by making effective use of hardware and OS features such as specialized instructions or available cache and memory.
Software engineering: Data compressors are routinely built from smaller components. For example, the popular gzip compressor combines a front-end Ziv-Lempel compressor and a back-end Huffman coder. Similarly, Burrows-Wheeler compressors often contains phases based on variations of move-to-front, run-length and entropy coding. Such data transformers should be encapsulated in standardized interfaces allowing easy composition in building new compression tools.
Data engineering: Large gains can be obtained by classifying data based on certain common characteristics and designing algorithms to match such classifications. Examples of this are specialized compressors such as the Pzip compressor by Buchsbaum et al. for compressing table data and the Xmill compressor by Liefke and Suciu to compress XML data. Another aspect of data engineering is to design the persistent data, i.e., the compressed output data, so that they can be easily extended and used as new data transforming techniques are invented.

Vcodex is a software platform of data transforming algorithms that can be used as building blocks to construct data compressors as well as other data processing tools. Its main constributions include an extensible software architecture allowing easy additions of new data transformers, a flexible and self-describing data encoding format, and a number of new and efficient algorithms and heuristics for suffix sorting, compressing table data, etc. This talk discusses the software and data architectures provided by Vcodex. Examples will be given to show how to build common types of compressors based on generalized Ziv-Lempel and Burrows-Wheeler transforms as well as other exotic compressors dealing with different types of table data. Time permitting, the talk will also give overviews of the algorithms and data structures underlying the various techniques.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on July 15, 2004.