DIMACS Mixer Series, October 22, 2002

Location: Telcordia Technologies, Morristown, NJ


Speaker: Gabriela Alexe, RUTCOR, Rutgers University

Title: Logical Analysis of Biomedical Data

Abstract: Logical Analysis of Data (LAD) is a method for building data-driven classification models, based on combinatorics, optimization, and Boolean logic. Recently developed algorithms and procedures made possible to extend it to large datasets from proteomics and genomics, as well as to numerical outcome problems. The basic LAD steps - feature selection, discretization, pattern detection, and knowledge aggregation - will be illustrated on two datasets from proteomics and biomaterial design.

Speaker: Eric Allender, Dept. of Computer Science, Rutgers University

Title: Power from Random Strings

Abstract: Recent progress in the technology of "derandomization" has provided theorems of the form "If problem X cannot be computed by small circuits, then good pseudorandom generators exist". That is, hardness can be used to eliminate randomness.

In this talk, we turn this paradigm around to obtain theorems of the form "If there are no pseudorandom generators relative to Y, then hard problems can be reduced to Y."

In particular, we obtain fundamentally new classes of complete sets for PSPACE and EXP, where these sets are defined in terms of space- and time-bounded Kolmogorov complexity. These are the first natural examples of complete sets for important complexity classes, that are provably not complete under more familiar notions of reducibility (such as polynomial-time or logspace reductions); the sets we present are complete under P/poly reductions (i.e., reductions computable by small circuits).

This is joint work with Harry Buhrman, Michal Koucky, Dieter van Melkebeek, and Detlef Ronneburger, and will appear in FOCS 2002.

Speaker: Michael Capalbo, postdoc, DIMACS-IAS

Title: Explicit Construction of Unique-Neighbor Expanders

Abstract: We define a unique-neighbor expander to be a bipartite graph $\Gamma$ with sides $X$ and $Y$, such that (1) $Y$ has significantly fewer vertices than $X$, andq (2) for every 'reasonably small' subset $S$ of $X$, there are are 'large' number of vertices in $Y$ that are adjacent to in $\Gamma$ to exactly one vertex in $S$. (We later define 'reasonably small' and 'large'). Unique-neighbor expanders have applications in communications networks, but the explicit construction of an infinite family of these graphs has been a long-standing open problem. Here we solve this open problem by presenting the first explicit constructions of an infinite family of bounded-degree 'unique-neighbor' expanders.

Speaker: Graham Cormode, Postdoc, DIMACS

Title: Comparing Data Streams Using Hamming Norms
Joint work with Mayur Datar, Piotr Indyk, and S. Muthukrishnan

Abstract: Massive data streams are now fundamental to many data processing applications. Such streams are rarely stored in traditional databases, and instead must be processed ``on the fly'' as they are produced. There is growing focus on manipulating data streams, and hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalises ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have multiple uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the "l_0 sketch" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.

Speaker: Subhash Khot, grad student, CS, Princeton

Title: Hardness of Coloring 3-Colorable 3-Uniform Hypergraphs

Abstract: We consider the problem of coloring a 3-colorable 3-uniform hypergraph. In the minimization version of this problem, given a 3-colorable 3-uniform hypergraph, one seeks an algorithm to color the hypergraph with as few colors as possible. We show that it is NP-hard to color a 3-colorable 3-uniform hypergraph with constantly many colors. In fact, we show a stronger result that it is NP-hard to distinguish whether a 3-uniform hypergraph with n vertices is 3-colorable or it contains no independent set of size \delta n for an arbitrarily small constant \delta > 0. In the maximization version of the problem, given a 3-colorable 3-uniform hypergraph, the goal is to color the vertices with 3 colors so as to maximize the number of non-monochromatic edges. We show that it is NP-hard to distinguish whether a 3-uniform hypergraph is 3-colorable or any coloring of the vertices with 3 colors has at most 8/9+\epsilon fraction of the edges non-monochromatic where \epsilon > 0 is an arbitrarily small constant. This result is tight since assigning a random color independently to every vertex makes 8/9 fraction of the edges non-monochromatic.

These results are obtained via a new construction of a probabilistically checkable proof system (PCP) for NP. We develop a new construction of the "PCP Outer Verifier". An important feature of this construction is "smoothening of the projection maps". Our techniques also lead to simpler proofs of Hastad's PCP constructions and several other constructions based on it.

Dinur, Regev and Smyth independently obtained a better result for hardness of 3-uniform hypergraph coloring. They showed that it is NP-hard to color a 2-colorable 3-uniform hypergraph with constantly many colors. In the "good case", the hypergraph they construct is 2-colorable and hence their result is stronger. In the "bad case" however, the hypergraph we construct has a stronger property, namely, it does not even contain an independent set of size \delta n.

Speaker: Amit Sahai, DIMACS member, CS, Princeton

Title: Software Obfuscation

Informally, an obfuscator O is an (efficient, probabilistic) "compiler" that takes as input a program (or circuit) P and produces a new program O(P) that has the same functionality as P yet is "unintelligible" in some sense. Obfuscators, if they exist, would have a wide variety of cryptographic and complexity-theoretic applications, ranging from software protection to homomorphic encryption to complexity-theoretic analogues of Rice's theorem. Most of these applications are based on an interpretation of the "unintelligibility" condition in obfuscation as meaning that O(P) is a "virtual black box," in the sense that anything one can efficiently compute given O(P), one could also efficiently compute given oracle access to P.

In this work, we initiate a theoretical investigation of obfuscation. Our main result is that, even under very weak formalizations of the above intuition, obfuscation is impossible. We prove this by constructing a family of functions F that are unobfuscatable in the following sense: there is a property pi : F -> {0,1} such that (a) given any program that computes a function f in F, the value pi(f) can be efficiently computed, yet (b) given oracle access to a (randomly selected) function f in F, no efficient algorithm can compute pi(f) much better than random guessing.

We extend our impossibility result in a number of ways, including even obfuscators that (a) are not necessarily computable in polynomial time, (b) only approximately preserve the functionality, and (c) only need to work for very restricted models of computation (TC_0). We also rule out several potential applications of obfuscators, by constructing "unobfuscatable" signature schemes, encryption schemes, and pseudorandom function families.

We will also discuss the many open problems that remain regarding software obfuscation.

Joint work with Boaz Barak, Oded Goldreich, Russell Impagliazzo, Steven Rudich, Salil Vadhan and Ke Yang

Speaker: Akshay Vashist, grad student, CS, Rutgers

Title: Validation of a supervised classification scheme by hidden structure analysis of training data and its application in bioinformatics for gene annotation

We have developed a novel method for machine learning classification focused on validation of accuracy estimates, through an iterative multi-stage process of learning to stratify training objects in an ordinal scale of their probability to produce incorrect predictions.

The basic idea of the method is to find "competitive" objects from different classes through cross-validation analysis, join them temporarily in one class, and, learn the classifier on this biased data, and, repeat the process until either the considered class will be clasified 100% correctly or it will disappear. Such iteration process induces a stratification on the class, dividing it into ordered parts according how well they are classified correctly.

We found that such multi-stage learning process is critically important when labels of classes in a traning data have noise (when expert provided labels have high uncertainity). Gene functional annotation problem is a good example of this situation. We have applied the developed method for such problem to classify Rice genome genes from chromosome 10 into three classes: (1) protein coding regions, (2) polyproteins, (3) transposable elements.

Speaker: Xuerong Yong, Postdoc, DIMACS

Title: New Techniques for Bounding the Channel Capacity of Read/Write Isolated Memory

Joint work with Mordecai J. Golin.

Abstract: A read/write isolated memory is a binary rewritable medium in which two consecutive locations can not both store 1's and also in which two consecutive locations can not both be modified during the same rewriting pass. Its channel capacity $C$, in bits per symbol per rewrite, was first considered by Cohn who proved that 0.509... \leq C \leq 0.560297.... Recently, these bounds were refined to 0.53500... \leq C \leq 0.55209.... The standard technique for deriving such bounds is to define an infinite sequence of transfer matrices for the constraint, calculate the largest eigenvalues of as many transfer matrices as possible and then further calculate certain functions of the eigenvalues. Evaluating eigenvalues of larger and larger transfer matrices will yield better and better bounds. In this paper we introduce compressed matrix techniques that permit us to achieve bounds guaranteed to be as good as the original bounds yielded by the transfer matrices but that are derived by calculating the eigenvalues of smaller matrices. This technique permits us to improve the calculated bounds to 0.5350150... \leq C \leq 0.5396225.... The compressed matrix techniques introduced here are not specific to this particular problem but can also be used to derive bounds for the capacity of various other two-dimensional constrained systems.

DIMACS Home Page
Contacting the Center
Document last modified on October 9, 2002