Abstracts: 1. Information Content of Junk DNA Cenk Sahinalp Case Western Reserve Univerity The human genome contains several classes of repeat segments, i.e., sequences that appear more than once in the genome. Two well identified repeat classes are tandem and interspersed repeats. Both of these repeats tend to be short, have high copy numbers and do not consist genetic material; they are part of the "junk DNA". They differ mainly by the mechanisms responsible for their propagation: Tandem repeats are believed to be generated by unequal cross-overs or replication slippage whereas interspersed repeats are propagated via mechanisms of retrotransposition. There are also less well-understood classes of repetitive DNA such as the long repeat segments within the pericentromeric regions of the chromosomes. These repeats usually have low copy numbers (2 to 10) and high similarity (over 90%) and may include partial or complete gene segments and hence are only partially junk. In the course of meiosis, these repeats may lead to misalignment between sister chromatids, which may result in an excess or lack of developmentally important genes and cause genomic disorders. It is conjectured that there may be unique mechanisms for duplication and dispersal of such segments, different from the mechanisms mentioned above. In fact, there are potentially many more types of structural rearrangement mechanisms waiting to be discovered. This massive task calls for novel sequence search, discovery and evolutionary analysis algorithms for understanding the propagation of repeats. Furthermore, the abundance and the complexity of repeats provide opportunities for the development of new algorithms for their compression.
In this talk we will explore the potential interplay between discovery and evolutionary analysis of genome repeats and their compression. We will discuss possible applications of combinatorial pattern matching tools for lossless compression and HMMs for controlled lossy compression of repeats as a measure of the information content of the (partially)junk DNA sequences.
2. Correlation Patterns in Genomic Sequences Khalid Sayood University of Nebraska - Lincoln The explosive growth of information in the form of DNA sequences provides an obvious application for compression technologies. Unfortunately, the letters that form the sequence have an almost uniform distribution and the sequences do not seem to have the word structure that makes dictionary compression schemes effective. Thus rendering most standard compression algorithms useless. However, the lack of the ``standard'' forms of structure does not mean that these sequences do not possess structure. In fact there is a significant amount of structure in the sequences in the form of repeats and palindromes. Approaches that make use of these facts have shown significant success in providing compression. Clearly there is a need for investigating the kinds of structures peculiar to these sequences. In this work we investigate the correlation profile of genomic sequences. We empirically demonstrate the existence of correlation patterns in genomic sequences which are tied to species of origin. We speculate that these patterns could lead to context based compression algorithms where the context is the species from which the DNA was extracted. (Joint work with Mark Bauer)
3. Experimental Results on the Compressed Suffix Array for Human DNA Kunihiko Sadakane University of Tokyo We show experimental results on the compressed suffix array for human DNA. Because the length of human DNA is about three billion, it is difficult to use the suffix array of it for sequence search. The compressed suffix array we construct has size about 2 gigabytes. Therefore we can store it in main memory of a workstation. The search time for exact match is quite fast. We can find a pattern of length 1000 in 0.018 seconds from the human DNA.
4. Compressed Bloom Filters and Compressing the Web Graph Michael Mitzenmacher Harvard University Following the theme of compression, I plan to compress two talks into one (time permitting). Both talks are meant for general audiences. Compressed Bloom Filters (in PODC 01) A Bloom filter is a simple space-efficient randomized data structure for representing a set that supports approximate membership queries. Bloom filters are increasing being used in distributed systems; for example, in a distributed Web cache, proxies do not need to share exact the URL lists of their caches, but instead periodically broadcast Bloom filters respresenting their cache. We introduce compressed Bloom filters, which improve on Bloom filters in distributed settings. Compressing the Web Graph (joint work with Micah Adler, in the 2001 Data Compression Conference) We consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web. We also provide hardness results demonstrating limitations on natural extensions of our approach.
5. Application-Independent, Stream-Based Compression in the Transport Layer Adam Buchsbaum AT&T Labs - Research Intuitively, the best place to compress data being sent over a network is at the stream level, before data are broken into packets, which present less context for compression algorithms to exploit. To achieve universal compression at the application layer, however, where streams are easily accessed, is problematic: every target application must be modified, and modified applications may not be able to communicate with unmodified counterparts on remote hosts. We present a method to install arbitrary stream filters at the TCP layer, independent of the calling applications. Which filters to install are negotiated within the standard, three-way TCP handshake, so no additional latency is incurred. The method is incrementally deployable, so that enhanced TCP clients can talk to vanilla TCP clients with no further modification. In the talk, I will describe the filter method and its application to compression. I will also present data that validate the intuition that compressing at the stream level outperforms compressing at the packet level. This is joint work with Steve Bellovin and S. Muthukrishnan.
6. Applications of delta compression and file synchronization techniques Torsten Suel Polytechnic University Delta compression and file synchronization techniques are concerned with efficient data transfer over a slow communication link in the case where the receiver already has similar data. This case arises naturally, e.g., when distributing updated versions of software or synchronizing files between different accounts and devices, and may become increasingly common in a highly networked environment where data and content are widely replicated, frequently modified, and cut and reassembled in different contexts and packagings. In this talk, we give an overview of delta compression and file synchronization techniques and software tools, and discuss a few applications in the context of web access, account synchronization, and peer-to-peer information retrieval.
7. Fragile Watermarks for LZ-77 Stefano Lonardi Univ California, Riverside. The massive distribution of documents allowed by the present technology makes it increasingly important to devise new methods to ensure authenticity. In this talk, we describe a simple variation on the classic Lempel-Ziv 77 algorithm that allows one to hide enough information to guarantee the authenticity of the compressed document. The design is based on cryptographically-secure pseudo-random generators, in such a way that the hidden data cannot be retrieved in a reasonable amount of time by an attacker unless the secret key is known. The embedding is totally transparent: to a casual observer, the augmented compressed file appears perfectly normal. In fact, it can still be decompressed by the original LZ-77 algorithm. Preliminary experiments show also the degradation in compression due to the embedding is almost negligible.
8. Pattern Discovery and the Algorithmics of Surprise Alberto Apostolico University of Padova and Purdue University The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. The discovery, particularly on a massive scale, of significant patterns and correlations thereof poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and descriptors grow faster and bigger than the phenomena they are meant to encapsulate. This talk reviews some results at the crossroads of statistics, pattern matching and combinatorics on words that enable us to control such paradoxes, and presents related constructions, implementations and empirical results.
9. Practical Compressed Suffix Array in Sublinear Space for Full-Text Searching Roberto Grossi University of Pisa. We describe a simple implementation of compressed suffix arrays that efficiently reduces the complex problem of full-text indexing to a simple problem on a special form of compressed inverted lists. We also show some heuristics to improve the index size and access time. (Joint work with Kunihiko Sadakane, Jeffrey Scott Vitter)
10. Higher Order Entropy Analysis of Compressed Suffix Arrays Ankur Gupta Duke University (Joint work with Roberto Grossi and Jeff Vitter)
11. Prediction via Data Compression Jeff Vitter Duke University We introduce universal predictors that are optimal in the limit in terms of error rate. Our novel approach is to use data compression techniques that are both theoretically optimal and good in practice. The motivating intuition is that in order to compress data effectively, you have to be able to predict future data well, and thus good data compressors trained on previous events should predict future events well. We can use our predictor for applications such as prefetching pages from slow memory, deciding when to spin down a disk on a mobile computer, and estimating query output sizes in a database system. For prefetching on a sequence of page requests produced by a Markov source or mth-order Markov source, we show analytically that the page fault rates incurred by our prefetching algorithms are optimal in the limit for almost all sequences of page accesses. We strengthen the result to the worst case and derive a randomized algorithm that we prove converges almost surely to the optimal page fault rate for every sequence, with respect to the class of finite state prefetchers. In particular we make no assumptions at all about how the sequence of page requests is generated. We also achieve the important computational goal of implementing our prefetcher in optimal constant expected time per prefetched page, using a novel application of the optimal and practical dynamic discrete random variate generator of Matias, Vitter, and Ni. Simulations using data bases and CAD traces and taking into account practical considerations show that moderate use of this prefetching approach can reduce the page fault rate from the 80-100 percent range (with LRU caching only) to around 25-50 percent, achieving better rates than currently used prefetchers. Joint work with P. Krishnan and part with K. Curewitz.
12. Low Delay Audio Coding for Communications Applications Gerald Schuller Fraunhofer Group for Electronic Media Technology, Ilmenau, Germany. New communication networks like next generation wireless networks enable high quality communications scenarios such as high quality teleconferencing, or new applications like musicians playing together over long distances. A requirement for success is a very low end-to-end delay, which includes the network delay and the delay necessary for data coding/decoding. For the audio part this leads to problems because conventional audio coders (like MP3, PAC, or AAC) introduce a considerable delay, unsuitable for communications applications. One suitable candidate is the MPEG-4 low delay audio coder. But its principle leads to a trade-off between the delay and the compression performance. Whereas it is suitable for two-way communications like teleconferencing, its delay is not low enough, for instance, for the musicians scenario. In the talk I will present the basic workings of current audio coders such as the MPEG-4 low delay coder, show the causes of their delay, and a new coding approach to achieve a very low delay while obtaining state-of-the art compression ratios. The new method uses adaptive cascaded prediction and noise shaping according to perceptual criteria at its core.
13. Preordering Binary Images for Better Compression G. Sampath College of New Jersey, Ewing, NJ The possibility of increasing the compression ratio of a bilevel image matrix by first transforming it into an ordered form is considered. This preordering step is lossless, its output is the ordered matrix and a pair of permutations (which are to be applied to the decompressed data to recover the original image). With m rows and n columns this step adds O(m log m + n log n) bits to the compressed data. Experiments with test images have shown increases of 25% to 70% in the compression ratio achieved with gzip and compress over that with the JBIG compression program pbmtojbg for certain types of images. The preordering also serves to encrypt the image matrix, with the permutation pair acting as the decryption key. The technique can be extended in a simple way to grayscale and color images.
14. Compression From The Database Perspective Zhiyuan Chen Cornell University Compression has been widely used in many fields, and the goal of using compression in those fields is to save space or transfer cost. However, the main function of database systems is to answer queries posted by users. As a result, in database systems, compression is often used to improve performance rather than to save space. This different goal of using compression introduces some novel issues, and thus compression has not been widely used in commercial database systems. In this talk, I will first give an overview of some of these issues and then I will deal with one of the issues that is to balance the savings of compression and the overheads of compressing and decompressing information stored in databases. My solution to this issue consists of two approaches that are complementary to each other. The first approach is to choose compression methods that have low decompression overhead and take advantage of the structure of the information stored in database. The second approach is to choose the appropriate point of time to decompress the information during query execution. Experimental results demonstrate that the combination of these two approaches achieve up to an order speed up over existing database systems.
15. SPARTAN: A Model-based Semantic Compression System for Massive Data Tables Rajeev Rastogi Bell Labs, Lucent Tech. While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this talk, I'll describe SPARTAN, a system that exploits data semantics to perform lossy compression of massive data tables. SPARTAN has two distinct characteristics that enable it to outperform existing compression algorithms: (1) for a certain subset of attributes, referred to as predicted attributes, no values are explicitly stored in the compressed table; instead concise Classification and Regression Tree (CaRT) models that predict the values of each predicted attribute are stored in the final compressed form, and (2) users are allowed to specify bounds on the "loss" that can be tolerated on individual data attributes. I'll also present the results of some experiments we carried out with real-life data sets which demonstrate the effectiveness of SPARTAN`s model-based approach.
16. Surfing Wavelets on Streams: Compressing Dynamic Data Using Wavelets. Yannis Kotidis AT&T Labs - Research, NJ. We present techniques for computing small space representations of massive data streams. These are inspired by traditional wavelet-based approximations that consist of specific linear projections of the underlying data. We describe general "sketch" based methods for capturing various linear projections of the data and use them to provide point-wise and range-sum estimation of data streams. These methods use small amounts of space and per-item processing time while streaming through the data, and provide for an accurate representation as our experiments with real data streams show. (Joint work with A.Gilbert, S.Muthukrishnan and M.J. Strauss)