DIMACS Center, CoRE Building, Rutgers University

**Organizers:****Emina Soljanin**, Bell Labs, emina@lucent.com**Paul Siegel**, University of California - San Diego, psiegel@ucsd.edu**Bane Vasic**, University of Arizona, vasic@ece.arizona.edu**Adriaan J. van Wijngaarden**, Bell Laboratories, alw@research.bell-labs.com

Title: Can We Explain the Faithful Communication of Genetic Information?

Genomes are conserved with astonishing faithfulness through geological ages. To explain this fact, we made the hypothesis that the genetic message borne by DNA involves powerful error-correction means, consisting of ested codes' which provide a stronger protection to the most basic (and oldest) parts of the genomic message. It turns out that, assuming very long codes and performance close to that promised by the channel coding theorem of information theory, many basic features of the living world and of biological evolution can be explained: the existence of discrete species, the Possibility of their taxonomy, and the trend of evolution towards complexity. Moreover, answers to controversial points are proposed, e.g., that evolution proceeds by jumps, which suggests a nonDarwinian mechanism for the origin of species. We assume that the hypothesized error-correction means use oft codes', a soft code being defined as a set of constraints which link symbols together. Not only deterministic constraints expressed by operations in mathematical structures like finite fields are considered, but also probabilistic constraints, or rules stating incompatibilities. We only require that these constraints modify the conditional probability of any symbol given subsets of other ones. Then, decoding (or, more precisely, regeneration) consists of reassessing the symbol probabilities so as to take account of all their mutual constraints. We examine how soft codes can arise from mechanical, geometrical and chemical constraints obeyed by the DNA molecule as well as, inside a gene, by the protein that DNA specifies by means of the enetic code'. Moreover, the genome being a lueprint' for the development and maintenance of a living thing, it must use some kind of language, implying morphologic and syntactic constraints which can be interpreted as soft codes. Although the paper is mainly speculative, we show that experimental works agree with the hypotheses made. Some decoding (regeneration) principles are outlined but the detailed mechanisms for their implementation, which involve chemical means, remain to be discovered.

Title: A Nanotechnology-based Approach for Highly-parallel, Ultra-dense Data Storage

Ultrahigh storage densities of up to 1 Tb per square inch or more can be achieved by using local-probe techniques to write, read back, and erase data in very thin polymer films. The thermomechanical scanning-probe-based data-storage concept, internally dubbed "millipede", combines ultrahigh density, small form factor, and high data rates. High data rates are achieved by parallel operation of large 2D arrays with thousands micro/nanomechanical cantilevers/tips that can be batch-fabricated by silicon surface-micromachining techniques. The very high precision required to navigate the probe-tips over the storage medium is achieved by MEMS-based x/y actuators that position the large arrays of probe tips for parallel write/read/erase operations. After illustrating the principles of operation of the "millipede", a channel model for the analysis of the readback process is introduced, and analytical results are compared with experimental data. Furthermore, the arrangement of data-storage fields as well as dedicated fields for servo and timing control is discussed, and system aspects related to the readback process, multiplexing, synchronization, and position-error-signal generation for tracking are introduced. The inherent parallelism, the ultrahigh areal densities, and the small form factor may open up new perspectives and opportunities for application in areas beyond those envisaged today.

Title: Coding and Signal Processing for Two-Dimensional Optical Storage

A new concept for two-dimensional optical storage (TwoDOS) is currently being developed, in which the information on the disc fundamentally has a two-dimensional character. The aim is to realize an increase over the 3rd generation of optical storage (Blu-ray Disc, BD, with a wavelength of 405nm and a Numerical Aperture NA=0.85) with a factor of 2 in data density and a factor of 10 in data rate (for the same physical parameters of the optical read-out system).

In this new concept, the bits are organized in a broad spiral. Such a spiral consists of a number of bit-rows stacked upon each other with a fixed phase relation in the radial direction, so that the bits are arranged on a 2D lattice. A 2D closed-packed hexagonal ordering of the bits is chosen because it has a 15% higher packing fraction than the square lattice. Successive revolutions of the broad spiral are separated by a guard band consisting of one empty bit row. A multi-spot light-path for parallel readout is realized, where each spot has BD characteristics. Bits with a value "1" are represented physically as circular pit-holes on the disc, with a pit-hole size that is smaller than the hexagonal bit-cell in order to reduce the non-linearity of the channel.

The signal processing with equalization, timing recovery and bit-detection is carried out in a 2D fashion, that is, jointly over all the bit-rows within the broad spiral. A stripe-wise Viterbi detector is proposed to perform 2D bit-detection with a limited state complexity of the trellis. Only two iterations are required to reach a BER of 1E-04 at practical SNR levels. Other aspects for a limitation of the hardware complexity will also be discussed. A 2D modulation code is applied to eliminate "worst-case bit-patterns" that yield a high probability of erroneous detection: the modulation code has a very high code rate, and enumerative coding is applied together with a Bliss-like scheme for the combination with the error-correction coding.

Title: A Survey of Coding Technology for Optical Recording

Codes based on runlength-limited sequences have been the state of the art corner stone of three generations of optical disc recording systems. The length of time usually expressed in channel bits between consecutive transitions is known as the runlength. For instance, the runlengths in the word 0111100111000 are of length 1, 4, 2 3, and 3. Runlength-limited sequences are characterized by two parameters, (d+1) and (k+1), which stipulate the minimum (with the exception of the very first and last runlength) and maximum runlength, respectively, that may occur in the sequence. The parameter d controls the highest transition frequency and has a bearing on intersymbol interference when the sequence is transmitted over a bandwidth-limited channel. In the transmission of binary data it is desirable that the received signal is self-synchronizing. The maximum runlength parameter k ensures adequate frequency of transitions for synchronization of the read clock.

Recording codes that are based on RLL sequences have found almost universal application in disc recording practice. We have the EFM code (rate = 8/17, d=2, k=10), used in the Compact Disc (CD), the EFMPlus code (rate = 8/16, d=2, k=10) used in the DVD, and 17PP (rate =2/3, d=1, k=7) used in the BluRay Disc.

In our presentation we will discuss state of the art coding technology, and we will show how the information theoretical limits on code performance can be approached.

Title: Ghostbusting - Constrained Coding for Optical Communication

In this talk, we consider a variety of constrained coding techniques that can be used to combat a nonlinear effect in the optical fiber channel that causes the formation of spurious pulses called ghost pulses. If b(0) b(1) ... b(n-1) is a sequence of bits being sent across the optical channel, such that b(k) = b(l) = b(m) = 1 for some k,l,m, but b(k+l-m) = 0, then the ghost pulse effect causes b(k+l-m) to change to a 1, creating an error. Such errors do not occur if the sequence of bits were to satisfy the following binary ghost pulse (BGP) constraint: for all integers k,l,m such that b(k) = b(l) = b(m) = 1, b(k+l-m) is also 1. However, we show that the capacity of the BGP constraint is zero, implying that sequences satisfying this constraint cannot carry much information. Consequently, we consider a more sophisticated coding scheme which uses ternary sequences satisfying a certain ternary ghost pulse (TGP) constraint, but there are reasons to believe that this scheme too has zero capacity. We further relax these constraints by ignoring the interactions between symbols that are more than a certain distance t apart in the transmitted sequence. Analysis of the resultant BGP(t) and TGP(t) constraints show that these have nonzero capacities, and furthermore, the TGP(t) codes significantly outperform the BGP(t) codes. We also discuss the design of encoders and decoders for coding into the BGP, BGP(t) and TGP(t) constraints.

Title: Iterative Timing Recovery

The past decade has seen the development of iteratively decodable error-control codes of unprecedented power, whose large coding gains enable reliable communication at very low signal-to-noise ratio (SNR). A by-product of this trend is that timing recovery must be performed at an SNR lower than ever before. Conventional timing recovery ignores the presence of error-control coding, and is thus doomed to fail when the SNR is low enough. This talk introduces iterative timing recovery, a method for implementing timing recovery in cooperation with iterative error-control decoding so as to approximate a more complicated receiver that jointly solves the timing recovery and decoding problems. Several methods of iterative timing recovery are presented, and an emphasis is placed on new problems. In particular, channels with synchronization errors are viewed from an information-theoretic perspective, and questions regarding their capacities and optimal codes are raised.

Title: Capacity and Beyond

In the first part of this presentation we give an overview of the magnetic recording channels, in specific the longitudinal and perpendicular systems. We show how the signals and noise models change with density as well as different patterns. We then give a simple method to estimate lower bounds for information rates of these systems when they are modeled as a communication channel. We present the effectiveness of this information rate estimation technique on real data. In the second part of the presentation, we give a brief introduction to heat assisted magnetic recording (HAMR) systems as a way to increase the areal density in the future recording systems. Finally, we conclude the presentation with an introduction to object based storage systems as a system level enabler for intelligent storage devices.

Title: Design of Modulation Encoders

We will survey the long history of methods developed for designing modulation encoders.

Title: Macro-Molecular Data Storage with Petabyte/cm3 Density, Highly Parallel Read/Write Operations, and Genuine 3D Storage Capability

Many of the traditional problems of disk and tape data storage can be overcome if data-blocks were to be released from the confines of a disk (or tape) and allowed to float freely between read/write stations (i.e., heads) and permanent "parking spots." The heads and parking spots thus become fixed structures within an integrated chip, while the macro-molecular data blocks themselves become the (mobile) storage media. In this scheme, a large number of read/write heads could operate in parallel, the heads and parking spots would be constructed (layer upon layer) in a truly 3-dimensional fashion, and individual nanometer sized molecules - strung together in a flexible macromolecular chain - will represent the 0's and 1's of binary information. We discuss the potential advantages of this alternative scheme for secondary data storage, and present results of experiments based on DNA molecules that travel within micro-fluidic chambers to demonstrate the feasibility of the concept.

Title: The Information Processing Mechanism of DNA and Efficient DNA Storage

Living beings are characterized by their unique information storage and processing ability, and indisputably the most important carrier of biological information is the deoxiribonucleic acid (DNA) sequence. The theory of code analysis and storage has a firm theoretical foundation in the area of discrete mathematics and telecommunications, but almost no attempt was made so far to investigate the DNA code within this setting. Recent years have witnessed a great deal of research in computational molecular biology, trying to bridge the large gap between genetics and information theory. The goal of this talk is to propose how to approach problems in genetics from the perspective of coding theory and narrow the currently existing rift from the opposite direction.

The talk will focus on two major issues related to important open problems in molecular biology and information theory. The first is concerned with investigating the characteristics of the DNA error-correcting mehanism and its possible connection to codes on graphs. Codes on graphs and iterative decoding represent recent conceptual advances in communication theory, and their close relationship to neural networks, self-organizing systems and spin glasses makes them universal models for a wide variety of phenomena. It seems plausible that the similarity between interactions of genes within regulatory systems, modeled by Boolean networks with sparse connectivity, and iterative decoding on graphs is not accidental; hence, this new development in coding theory could be utilized to provide partial information about the intricate and multifaceted error-correction mechanisms of the genome and its relationship to the gene regulatory network. The theory of structured Boolean networks, attractor and graph sensitivity analysis will be used for infering the gene network structure, its error-correcting potential as well as its operational influence on disease development and classification. The second problem to be addressed in the talk is concerned with developing specialized methods for DNA storage. Since approximatelly twenty new DNA strands are sequenced every day, efficient compression of DNA codes has become a problem of great interest to genetic databases and researchers trying to reconstruct the tree of life. Many known compression techniques, including the Ziv-Lempel, Burrows-Wheeler and grammar based algorithms, were proposed for this task. All of the described algorithms showed poor performance when applied to genetic sequences. This finding can be explained through the fact that the DNA code has an extremely varying statistics for coding and non-coding regions. For example, an overwhelming percentage of the DNA code has a fractal nature, and such long term correlations are not captured by standard compression algorithms. In the talk, problems related to the statistical analysis of DNA sequences, applications of Tsallis entropy, DNA distance measures and their use in cross-reference compression will be discussed and new research problems emerging in this area will be proposed.

This is joint work with Bane Vasic.

Title: An Iterative Decoding Algorithm for Soft Decision Decoding of RS Codes and Its Applications

We present a soft input soft output (SISO) decoding algorithm for Reed Solomon (RS) codes using their binary image representations. The novelty of the proposed decoding algorithm is in reducing a submatrix of the binary parity check matrix to a sparse nature before each decoding iteration. This submatrix is chosen such that it corresponds to less reliable bits in the received word and is adapted from one iteration to another. The proposed algorithm provides significant gain over hard decision decoding and compares favorably with other popular soft decision decoding methods both in terms of error performance and complexity. More importantly, the iterative decoder naturally delivers soft outputs, which can be efficiently utilized in concatenated systems and channels with memory. The generic algorithm can also be readily extended to soft decision decoding of any binary linear block code with a dense parity check matrix, where most versions of iterative decoding algorithms do not perform well.

Title: ET Might Write not Radiate

It has long been understood that electromagnetic radiation - radio waves - can be used to communicate over interstellar distances. In contrast, sending physical artifacts seemed vastly wasteful of energy, and imagining human travel between the stars even more so. However, the key consideration in early work on interstellar communication was the perceived need for haste. If extraterrestrial civilizations existed within a few tens of light years, radio could be used for two way communication on a timescale comparable to a human lifetime or the longevity of human institutions. Here we show that if the requirement of haste is relaxed, sending messages inscribed on some material can be strikingly more energy efficient and durable than communicating by electromagnetic waves. This result suggests that initial contact by extraterrestrial civilizations may be more likely to occur through physical artifacts - essentially messages in a bottle - than via electromagnetic communication.

This is joint work with G. Wright

Title: Towards the Optimal Bit Aspect Ratio in Magnetic Recording

In this paper, we take an information-theoretic approach to obtaining optimal code rates for error-control codes on a magnetic storage channel approximated by the Lorentzian channel. Code rate optimality is in the sense of maximizing the information-theoretic user density along a track. To arrive at such results, we compute the achievable information rates for the Lorentzian channel as a function of SNR and channel density, and then use these information rate calculations to obtain optimal code rates and maximal linear user densities. We call such (hypothetical) optimal codes "Shannon codes." We then examine optimal code rates on a Lorentzian channel assuming low-density parity-check (LDPC) codes instead of Shannon codes. We employ as our tool extrinsic information transfer (EXIT) charts. We demonstrate that the optimal rates for LDPC codes coincide with those of Shannon codes and, more importantly, that LDPC codes are essentially capacity-achieving codes on the Lorentzian channel. Finally, we use the above results to estimate the optimal bit aspect ratio where optimality is in the sense of maximizing areal density.

Title: Information Rates for Two-Dimensional ISI Channels

One approach to achieving higher information storage density is the use of page-oriented data recording technologies, such as holographic memory. Instead of recording the data in one-dimensional tracks, these technologies store the data on two dimensional surfaces. A commonly used model for such a two-dimensional recording channel is the two-dimensional finite-state intersymbol-interference (ISI) channel with additive white Gaussian noise (AWGN).

As in the case of one-dimensional channels, the capacity of this two-dimensional channel is defined as the mutual information rate between the input and output, maximized over all input distributions. When the input is i.i.d., equiprobable, the mutual information rate is called the symmetric information rate (SIR). The capacity and the SIR provide useful measures of the storage density that can, in principle, be achieved with page-oriented technologies, and also serve as performance benchmarks for channel coding and detection methods.

Recently, several authors independently proposed a new Monte-Carlo approach to calculating a convergent sequence of lower bounds on these information rates for one-dimensional ISI channels. The method requires the simulation of an a-posteriori probability (APP) detector, matched to the information source and channel, on a long sample realization of the channel output. The forward recursion of the sum-product (BCJR) algorithm, applied to the combined source-channel trellis, can be used to reduce the overall computational complexity of the APP calculations. The method was further extended to evaluate the information rates of multi-track recording channels.

In this talk, we investigate the extension of this approach to two-dimensional ISI channels. The main problem is that there is no direct counterpart of the BCJR algorithm that can simplify the APP calculations for a large sample output array in the two-dimensional setting. To overcome this difficulty, we derive upper and lower bounds on the entropy rate of a large output array based upon conditional entropies of smaller output arrays. By adapting the one-dimensional Monte Carlo technique to the calculation of these conditional entropies, we are able to compute fairly tight upper and lower bounds on the SIR for two-dimensional ISI channels with small impulse response support. We also analyze the convergence of these bounds.

The computational complexity of these bounds grows exponentially with the size of the support region of the impulse response. We therefore introduce a modification of the upper bound that allows us to consider channels with somewhat larger ISI. This alternative upper bound permits a tradeoff between the tightness and the computational complexity of the bound.

Finally, we discuss extension of these bounds to d-dimensional ISI channels, with d>2.

This talk describes work done jointly with Jiangxin Chen.

Title: How Random is the Human Genome?

Now that the human genome is (mostly) sequenced, how do we know when some statistical fact about that random-looking string of 3 billion A's, C's, G's and T's is significant? The speaker will report on one approach, using some pleasing combinatorics, implemented with a group of scientists at Rockefeller University

This is joint work with Andy DeWan, Chad Hayes, Josephine Hoh, Jurg Ott, Tony Parrado and Richard Sackler.

Previous: Participation

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on February 24, 2004.