DIMACS Center, Rutgers University, Piscataway, NJ

**Organizers:****Mark Hansen**, Bell Laboratories, cocteau@stat.ucla.edu**Paul Vitanyi**, CWI and the University of Amsterdam, Paul.Vitanyi@cwi.nl**Bin Yu**, UC Berkeley, binyu@stat.berkeley.edu

Title: Articulation and Intelligibility

As discussed by George Miller in his 1951 book Language and Communication, speech intelligibility depends on a code in that it is incredibly robust to mangling distortions. Can we analyze speech intelligibility using the scientific method? Miller's analysis shows we can, as it depends critically on the use of statistics, information theory, and psychophysical methods. In fact, the science of intelligibility is a sience of the error analysis of human speech communication.

Title: Individual Sequence Properties of Compounded Wealth,

Portfolio Estimation, Option Pricing and Model Selection

Individual sequence properties have played a fundamental role in examination of data compression and prediction, in a manner that mimics some aspects of ensemble statistical properties, while not tied to statistical assumptions. Some developments, such as Cover's universal portfolios, show individual sequence properties for compounded wealth in gambling and stock market settings. At the heart of the analysis of compounded wealth with constant rebalanced portfolios is an information-theoretic characterization of the wealth for any number of investment periods, that takes its simplest form in an idealized gambling setting. Here we extend the approach to provide exact information-theoretic charaterization of compounded wealth with general portfolios of stocks and for the case of a portfolio of options on a stock. When there is a sufficiently complete set of options, the wealth identities we derive lay bare the role of choices of option prices and portfolio via their implications for what would otherwise be hidden opportunities to gamble on the state of the security. In particular these identities permit an evaluation of the merit of stock options. Part of this work is developed jointly with Jianfeng Yu. Further consequences include stock selection and portfolio estimation methods that are close counterparts to model selection by minimum description length, and universal portfolios for stocks and options, with performance exactly tied to universal gambling and universal data compression.

Title: Complexity and vulnerability analysis

An active network allows packets to contain a mixture of code (algorithm) and data. The ratio of code to data can vary as the packet travels through the network. Such networks can also be vulnerable to attack via the transport of virus or worm code. A mitigation of this problem has been attempted via the use of active network probes to detect vulnerabilities in an active network. A complexity estimate of active protocols being transported within the network by active packets is obtained. In addition components within the active network contain probe points through which bit-level I/O can be collected. Kolmogorov Complexity estimates based upon simple inverse compression ratios have used to estimate vulnerability. The intent has been to experiment with better complexity measures as the research continues. [See the DIMACS presentation on "A new universal two part code for estimation of string Kolmogorov complexity and algorithmic minimum sufficient statistic" (Scott Evans) which discusses one of the estimators in further detail.] Consider the complexity of bit-level input and output strings concatenated together. That is, observe an input sequence to an arbitrary process (i.e. a potentially vulnerable process) at the bit-level and concatenate with an output sequence at the bit-level. This input/output concatenation can be applied to entire systems or to components of a system. If there is low complexity in the I/O observations, then it is likely to be easy for an attacker to "understand" and usurp that component.

**Nick Chater**, University of Warwick

Title: Science, Simplicity and Embodied Cognition

It is argued that induction from data is not possible, unless there are strict constraints on the representation for that data. Specifically, given any formal method for induction (e.g., finding the shortest description length), there will representations of the data according to which that formal method will make opposing predictions. This is a generalization of the philosopher Nelson Goodman's arguments concerning the pathological predicate 'grue.'

What rules out 'crazy' representations, and ensures that, in practice, induction is possible? It is argued that the constraint is that these predicates must be 'implemented' in real physical hardware; and, crucially, this is the very same physical hardware from which the data to be understood has been generated. Roughly, a reliance on a common underlying computational machinery ensures that data and representation cannot be arbitrarily varied; and an heuristic argument using Kolmogorov complexity is given to indicate that, with this constraint, induction is possible after all. This viewpoint amounts to the idea that cognition (and computation generally) is crucially 'embodied' in the physical world, and is not purely abstract. Various rather radical implications of this viewpoint are considered.

**Ian Davidson and Ke Yin**, SUNY Albany

Title: Message length estimators, probabilistic sampling and optimal prediction

The Rissanen (MDL) and Wallace (MML) formulations of learning by compact encoding only provide a decision criterion to choose between two or more models, they do not provide any guidance on how to search through the model space. Typically, deterministic search techniques such as the expectation maximization (EM) algorithm have been used extensively with the MML/MDL principles to find the single shortest model. However, the probabilistic nature of the MML and MDL approaches makes Markov chain Monte Carlo (MCMC) sampling readily applicable. Sampling involves creating a stochastic process that visits each model in the model space with a chance equal to its posterior probability and has many benefits. We show that for MML estimators using mixture modeling that sampling can find shorter models than deterministic EM search. Samplers can be used to perform optimal Bayesian prediction (OBP), also known as Bayesian model averaging which involves making predictions by calculating the expectation of the predictor with respect to the posterior over all models. We show that for prediction, OBP can outperform even the shortest model and discuss the implications of basing predictions from a collection of models rather than the shortest model. Furthermore, since MML/MDL effectively discretizes the parameter space attaching probability estimates to each region this makes possible sampling across model spaces of varying dimension/complexity.

**A. P. Dawid**, University College, London

Title: Prequential Statistics and On-Line Learning

Many applied problems, e. g. weather forecasting, can be regarded as a sequential game between Forecaster and Nature, where data arrive in sequence and at each point Forecaster has to issue a forecast (perhaps probabilistic) for the next value --- which is then revealed by Nature. After a sequence of such plays, we can compare the forecasts and outcomes, in a variety of ways, to make an empirical assessment of Forecaster's success.

Over the years I have developed "Prequential Analysis", a general approach to theoretical and applied probability and statistics, grounded in this on-line learning paradigm. It can also be regarded as a generalisation of stochastic complexity. Prequential analysis provides a versatile and productive standpoint from which to reconsider and advacnce many fundamental conceptual and technical issues.

In this talk I will outline the prequential approach to Statistics and the game-theoretic principles which underlie it.

**Alex Eleftheriadis**, Columbia University

**Daby Sow**, IBM T. J. Watson Research Center

Title: Lossy Compression and Kolmogorov Complexity

Complexity Distortion Theory (CDT) is an extension of Kolmogorov Complexity Theory to lossy representation of information. It is motivated by the need to provide a unifying perspective on media representation. In contrast with Shannon's rate distortion theory, CDT relies on Kolmogorov deterministic measure of information. We compare CDT and RDT and show that despite their different natures, these two theories predict asymptotically the same results under stationary and ergodic assumptions. This closes the circle of representation models, from probabilistic models proposed by Shannon to deterministic models proposed by Kolmogorov.

**Scott Evans**, General Electric Research/RPI

Title: A new universal two part code for estimation of string Kolmogorov complexity and algorithmic minimum sufficient statistic

A new universal compression algorithm is introduced that provides an estimate of an Algorithmic Minimum Sufficient Statistic for a given binary string. This algorithm computes and encodes an estimated Minimum Descriptive Length (MDL) partition of symbols in a string for compression. Using Symbol Compression Ratio (SCR) as a heuristic this algorithm produces a two-part code for a string where the codebook or model estimates the algorithmic minimum sufficient statistic.

This is joint work with Gary J. Saulnier (RPI), and Stephen F. Bush (General Electric Research).

**Lance Fortnow**, NEC Laboratories America

Title: Computational Depth

String with very low Kolmogorov complexity do not have much information. Random strings have information in that they cannot be simply described. However it is easy to generate random strings with a probabilistic source so the information they give is not that useful.

We describe Computational Depth, a method to, in some sense, measure the amount of "useful information" in a string. We concentrate the talk on two applications of this notion:

- If SAT reduces to a sparse set it has small circuits. If SAT reduces to a random set it has small circuits. We define a notion of "Shallow Sets" based on computational depth that generalizes both sparse and random sets. We show that if any computable set reduces to a shallow set then it must have small circuits that compute it.
- We show that an algorithm that runs in "polynomial-time on average" exactly when it runs in time exponential in its depth on all inputs.

This talk describes joint work with Luis Antunes, Dieter van Melkebeek and V. Vinodchandran.

**Peter Gacs**, Boston University

Title: Uniform randomness test, over a general space

The algorithmic theory of randomness is well developed when the underlying space is the set of finite or infinite sequences and the underlying probability distribution is the uniform distribution or a computable distribution. These restrictions seem artificial. Some progress has been made to extend the theory to arbitrary Bernoulli distributions (by Martin-Lof), and to arbitrary distributions (by Levin). We recall the main ideas and problems of Levin's theory, and report further progress in the same framework.

**Andrew Gelman**, Columbia University

**Mark Hansen**, UCLA

Title: Data density and degrees of freedom in statistical models

A well-known graph in quantum thermodynamics shows how the temperature T of a gas increases as energy Q is added to it. When properly scaled, dQ/dT is a measure of the "degrees of freedom" of the system. Because of quantum effects, dQ/dT looks like an increasing staircase as a function of T, with more degrees of freedom activated at higher temperatures.

We suspect that similar patterns could be found in statistical models, where increasing data density allows further structure to be discovered, hence more degrees of freedom. In particular, we are studying regressions, splines, and constrained-parameter models such as arise in inverse problems; and using degrees-of-freedom measures based on AIC and DIC.

**Donald Geman**, Johns Hopkins University

Title: Hierarchical Designs for Pattern Recognition

It is unlikely that very complex problems in machine perception, such as full-scale scene interpretation, will yield directly to improved methods of statistical learning. Some organizational framework is needed to confront the small amount of data relative to the large number of possible explanations, and to make sure that intensive computation is restricted to genuinely ambiguous regions. As an example, we discuss the theoretical foundations of a "twenty questions" approach to pattern recognition. The object of analysis is the computational process itself rather than probability distributions (Bayesian inference) or decision boundaries (statistical learning). Under mild assumptions, optimal strategies exhibit a steady progression from broad scope (i.e., high invariance) coupled with low power to high power coupled with dedication to specific explanations. Such designs then provide a blueprint for efficient learning and processing.

**Tom Griffiths**, Stanford University

Title: Subjective randomness and cognitive complexity

Kolmogorov complexity provides a mathematical basis for the definition of randomness, and recent work in cognitive science has suggested that it may play an important role in understanding cognition (eg. Chater & Vitanyi, 2003). The major practical impediment to developing cognitive theories based on Kolmogorov complexity is incomputability, while the major theoretical impediment is that people are sensitive to a much more restricted set of regularities than those identified by Turing machines. I will show how two strategies for defining more restrictive measures of complexity, motivated by algorithmic probability and universal probability, can address both of these issues, as well as revealing the rational statistical basis of intuitive judgments about randomness and coincidences.

**Peter Grunwald**, CWI Amsterdam

Title: MDL and classification, revisited

The Minimum Description Length (MDL) Principle is a powerful method for model selection. The theoretical development of MDL has mostly centered around probabilistic modeling. Yet practical applications of MDL often involve models which are best viewed as predictors rather than probability distributions. The standard example is classification, one of the most popular applications of MDL ever since its inception. MDL has been applied to such non-probabilistic models in various ways. We review these approaches and show that, contrary to what is often thought, they can exhibit some rather problematic behaviour: on the theoretical side, it is not known whether the resulting procedures are consistent (none of the existing proof techniques can be applied). On the practical side, the methods can behave quite unreasonably for small data samples. We analyze the reasons for this undesirable behaviour and propose a radical, general and surprising solution to the problem.

**Kazuo Iwama**, Kyoto University

Title: Condensation of Boolean formulas

The following problem is considered: Given a Boolean formula f, generate another formula g such that: (i) If f is unsatisfiable then g is also unsatisfiable. (ii) If f is satisfiable then g is also satisfiable and furthermore g is ``easier'' than f. For the measure of this easiness, we use the {\it density\/} of a formula f which is defined as (the number of satisfying assignments)/2^n, where n is the number of Boolean variables of f. In this paper, we mainly consider the case that the input formula f is given as a 3-CNF formula and the output formula g may be any formula using Boolean AND, OR and negation. Two different approaches to this problem are presented: One is to obtain g by reducing the number of variables and the other by increasing the number of variables, both of which are based on existing SAT algorithms. Our performance evaluation shows that, a little surprisingly, better SAT algorithms do not always give us better density-condensation algorithms. This is a joint work with Youichi Hanatani and Takashi Horiyamapre.

**Adam Klivans**, Harvard University

Title: Learnability beyond AC^0

We consider a natural problem at the intersection of computational learning theory and computational complexity theory: what are the most powerful Boolean circuits we can learn in a computationally efficient manner? We give an algorithm to learn constant-depth polynomial-size circuits augmented with majority gates under the uniform distribution using random examples only. For circuits which contain a polylogarithmic number of majority gates the algorithm runs in quasipolynomial time. Furthermore, we can prove that under a suitable cryptographic assumption these are the most expressive circuits which will admit a non-trivial learning algorithm.

This is joint work with Rocco Servedio (Columbia University).

**Natalio Krasnogor**, University of Nottingham

Title: Using poynomial local search and Kolmogorov complexities to better understand evolutionary algorithms

We develop a syntax-only classification of evolutionary algorithms, particularly, the so called Memetic Algorithms(MAs). When ``syntactic sugar'' is added to our model, we are able to investigate the Polynomial Local Search (PLS) complexity of Memetic Algorithms. In this talk we enunciate the PLS-Completeness of not just one PLS problem but of whole classes of problems associated with MAs applied to the TSP. These classes of problems arise when a range of mutation, crossover and local search operators are considered. Our PLS-Completeness results shed light on the worst case behavior that can be expected of an MA that belongs to the classes of algorithms analyzed. We also mention some connections between the Polynomial Local Search Complexity theory and Kolmogorov Complexity theory in the context of evolutionary algorithms understanding.

**John Langford**, IBM, TJ Watson

Title: Data compression and learning

I'll discuss the way in which "every" PAC-style bound is related to the description length of the labels given the unlabeled data and I'll show a general theorem which implies that if the description length of the labels given the unlabeled data is small in any language, then a tight bound on the true error rate holds.

**Jan Lemeire**, Vrije Universiteit Brussel

Title: Complexity preserving functions

It is widely recognized that compression is useful and even necessary for inductive learning, where a short description will capture the regularities. We introduce complexity-preserving functions that preserve these regularities of the concept. They are based on the universal information distance [Bennett et al. '97] and define for an instance a set of elements sharing the same complexity type. This corresponds to the two-part code [Rissanen '89] of the MDL principle, when it is interpreted as the first term describing the set and the second term the element in the set [Rissanen '99]. We investigate its importance in inductive learning.

**Feng Liang**, Duke University

Title: Exact Minimax Density Estimation and MDL

For problems of model selection in linear regression, we determine the exact minimax universal data compression strategy for the minimum description length (MDL) criterion. The analysis also gives the best invariant and indeed minimax procedure for predictive density estimation in location families and in scale families, using Kullback-Leibler loss. The exact minimax rule is generalized Bayes using a uniform (Lebesgue measure) prior on the location (or log-scale) parameters, which is made proper by conditioning on an initial set of observations.

**Ron Meir**, Technion

Title: Data-dependent generalization bounds for Bayesian mixture algorithms

Bayesian approaches have been known for many years to possess desirable optimality properties even within a frequentist setting. Moreover, they have become increasingly popular in recent years with the advent of efficient numerical simulation algorithms and powerful computers. In this work we consider the special case of Bayesian mixture algorithms, and establish finite- sample data-dependent bounds for such methods. These results extend the work within the field of Computational Learning Theory, the main focus of which has been frequentist in nature, to a Bayesian setting. Unlike standard Bayesian approaches, the results remain valid whether or not the underlying model assumptions are correct. Moreover, they can be directly applied to other mixture based approaches such as Boosting and Bagging.

**Eado Meron**, Tel Aviv University

Title: Finite memory universal coding of individual binary sequences

We consider universal coding of individual binary sequences, with the constraint that the universal coder itself is a K-state time-invariant deterministic finite state (TI-DFS) machine. Actually we consider an equivalent problem of sequentially assigning a probability to the next outcome using a TI-DFS machine. We show that unlike the associated prediction problem, counters and finite window machines perform poorly for malicious sequences. We introduce a simple "exponential decaying memory" machine whose normalized code length is greater than the sequence empirical entropy by an O(K^(-2/3)) term. As a lower bound, we show that no K-state TI-DFS machine can achieve a redundancy less than O(K^(-4/5)) for all sequences.

This is joint work with Meir Feder(Tel Aviv University).

**Jorma Rissanen**, Helsinki Institute for Information Technology

Title: The MDL Principle with Distortion

Kolmogorov's structure function in the algorithmic theory of information seeks to find the smallest set that includes a given string as its model subject to the constraint that the Kolmogorov complexity of the set does not exceed a preselected bound. The result extracts the desired amount of properties in the string leaving the rest as noise.

An application of this idea to parametric probability models involves minimization of the code length for the data with a model whose code length, that depends on the number of parameters and their quantization, does not exceed a preselected bound. When the best model of the desired code length is projected back to the data as a "smooth" curve we get not only a means to model data with distortion but also a different approach to Shannon's rate-distortion theory.

**Serap Savari**, Bell Laboratories, Lucent Technologies

Title: Descriptions of words over a partially commutative alphabet

The concept of ordering of events is fundamental to the study of the dynamic behavior of a system. In a sequential process it is natural to use strings of symbols over some alphabet to specify the temporal ordering of events. The symbols may, for example, correspond to the states, commands, or messages in a computation. A recent prize-winning paper in the area of computing systems applies a lossless data compression algorithm known as Sequitur to the the sequence of events or signals determining the control ow or operations of a program's execution. The author of demonstrated that the grammar which is output from Sequitur can be exploited to identify performance tuning opportunities via heavily executed subsequences of operations.

The underlying premise in using lossless data compression for this application is the existence of a well-defined linear ordering of events in time. A partial ordering of events is a more accurate model for concurrent systems like multiprocessor configurations, distributed systems and communication networks which consist of a collection of distinct processes which communicate with one another or synchronize at times but are also partly autonomous. These complex systems permit independence of some events occurring in the individual processes while others must happen in a predetermined order.

**Clayton Scott**, Rice University

Title: Classification or regression trees

In this paper we challenge three of the underlying principles of CART, a well know approach to the construction of classification and regression trees. Our primary concern is with the penalization strategy employed to prune back an initial, overgrown tree. We reason, based on both intuitive and theoretical arguments, that the pruning rule for classification should be different from that used for regression (unlike CART). We also argue that growing a tree-structured partition that is specifically fitted to the data is unnecessary. Instead, our approach to tree modeling begins with a non-adapted (fixed) dyadic tree structure and partition, much like that underlying multiscale wavelet analysis. We show that dyadic trees provide sufficient flexibility, are easy to construct, and produce near-optimal results when properly pruned. Finally, we advocate the use of a negative log-likelihood measure of empirical risk. This is a more appropriate empirical risk for non-Gaussian regression problems, in contrast to the sum-of-squared errors criterion used in CART regression.

This is joint work with Rebecca Willett and Robert Nowak

**Ray Solomonoff**, Oxbridge Research

Title: A General System For Incremental Machine Learning

We will describe recent developments in a system for machine learning that we've been working on for some time. It is meant to be a "Scientist's Assistant" of great power and versatility in many areas of science and mathematics. It differs from other ambitious work in this area in that we are not so much interested in knowledge itself, as we are in how it is aquired - how machines may learn. To start off, the system will learn to solve two very general kinds of problems. Most, but perhaps not all problems in science and engineering are of these two kinds.

The first kind is Function Inversion. These are the P and NP problems of computational complexity theory. They include theorem proving, solution of equations, symbolic integration, etc.

The second kind of problem is Time Limited Optimization. Inductive inference of all kinds, surface reconstruction, and image restoration are a few examples of this kind of problem. Designing an automobile in 6 months satisfying certain specific

Title: Conditional Akaike Information for Mixed Effects Models

In this talk we show that for a mixed effects model where the focus is on the cluster-specific inference the commonly-used definition for AIC is not appropriate. We propose a new definition for the Akaike information to be used in such conditional inference, and we show that for a linear mixed effects model this definition leads to an Akaike information criterion (AIC) where the penalty for the random effects is related to the effective number of parameters, rho, proposed by Hodges and Sargent (Biometrika 2001); rho reflects an interim level of complexity between a fixed-effects model with no cluster effects, and a corresponding model with fixed cluster-specific effects. We compare the conditional AIC with the marginal AIC (in current standard use), and we argue that the latter is only appropriate when the inference is focused on the marginal, population-level parameters. We discuss the relationship of the conditional AIC with the deviance information criterion and other related work. A pharmacokinetics data application is used to illuminate the distinction between the two inference settings, and the usefulness of the conditional AIC.

**Yong Su**, Ohio State University

Title: Minimum description length and cognitive modeling

Minimum Description Length principle (MDL) has been proposed as a model selection criterion for cognitive modeling. In applying MDL 1996 criterion to the selection of cognitive models, one of the major obstacles is to calculate Fisher information. We present a general formula of Fisher information with multinomial or normal distribution. The resulting formula is attractive as it greatly simplifies the computation of the Fisher information without numerical expectation or second derivative of the likelihood function. We also illustrate the usage of the formula for the models of categorization, information integration, retention and psychophysics. In the further investigation of the newer MDL 2001 criterion, we verify that the two complexity measures of MDL 1996 criterion and MDL 2001 criterion are close to each other for a multinomial model. Finally, the application of MDL in cognitive psychology is illustrated in the selection of retention models. And the advantage of MDL is demonstrated by comparing the model selection performance of MDL and several other commonly used model selection criteria.

**Hayato Takahashi**, Tokyo Institute of Technology

Title: Redundancy of universal coding, Kolmogorov complexity, and Hausdorff dimension

We study asymptotic code-lengths of sequences generated by parametric models. We show a universal coding whose code-length is \\ \( -\log P_{\hat{\theta}} + \sum_{j=1}^k K(\mbox{description of }\theta^j\mbox{ up-to }(\log n)/2\mbox{ bit }\vert n)\)\\\(+O(\log\log n),\ P_\theta-a.e.,\) where \(\theta=(\theta^1,\cdots,\theta^k)\), \(\hat{\theta}\) is the maximum-likelihood estimator, \(K\) is prefix Kolmogorov complexity, \(k\) is the dimension of the parameter space, \(n\) is the sample size, and the base of \(\log\) is \(2\). Moreover we show a universal coding having the following property: For each real numbers \(h_j,~0\leq h_j\leq 1,~1\leq j\leq k\), there are parameter sets \(H_j\in R,~1\leq j\leq k\) such that if \(\theta^j\in H_j\), the code-length is given as \( -\log P_{\hat{\theta}} + \frac{1}{2}(\sum_{j=1}^k \dim H_j)\log n +o(\log n),\ P_\theta-a.e.,\) where \(\dim H_j\) is the Hausdorff dimension of \(H_j\). Our universal coding is considered to be a natural extension of Shannon code and MDL code, and smoothly connects them. The code-length gives a non-trivial upper bound of Kolmogorov complexity when the source is not computable.

(This work was done when the author was a member of The Institute of Statistical Mathematics.)

**Peter van der Helm**, Nijmegen Institute for Cognition and Information

Title: Incomputable randomness or computable regularity?

In the mathematical domain of algorithmic information theory (AIT), randomness is defined to mean so much as the absence of regularity. Accordingly, a simplest code of a symbol string is taken to be a code that squeezes out a maximum amount of regularity. This implies, in AIT, that simplest codes of symbol strings are incomputable, because one can never be sure that a maximum amount of regularity has been squeezed out. But, what is regularity? This question has been addressed in perceptual domain of structural information theory (SIT), resulting in a definition of regularity that seems relevant not only in visual perception but also in, for instance, molecular biology. This definition leads to a coding language in which, basically, only three kinds of regularity are taken into account. Within this perceptual coding theory, the set of symbol strings that may represent a given object is still incomputable, but simplest codes of given symbol strings are computable. That is, SIT's definition of regularity allows for so-called transparallel processing, which means that an exponential number of codes can be dealt with as if only one code were concerned.

**Paul Vitanyi**, CWI and the University of Amsterdam

Title: The Similarity Metric and Algorithmic Clustering of Music

A new class of metrics appropriate for measuring effective similarity relations between sequences, say one type of similarity per metric, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it minorizes every metric in the class (that is, it is universal in that it discovers all effective similarities). We demonstrate that it too is a metric and takes values in [0,1]; hence it may be called the similarity metric. This is a theory foundation for a new general practical tool. We give two distinctive applications in widely divergent areas (the experiments by necessity use just computable approximations to the target notions). First, we computationally compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we give fully automatically computed language tree of 52 different language based on translated versions of the ``Universal Declaration of Human Rights''.

We present a fully automatic method for music classification, based only on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without change, be used in different areas like linguistic classification and genomics. It is based on an ideal theory of the information content in individual objects (Kolmogorov complexity), information distance, and a universal similarity metric. Experiments show that the method distinguishes reasonably well between various musical genres and can even cluster pieces by composer. This generated some interest in the popular press, see http://www.cwi.nl/~paulv

**Volodya Vovk**

Title: Predictive complexity, randomness and information

Kolmogorov complexity can be interpreted as the minimum loss in the game of lossless compression (or probability forecasting with the logarithmic loss function). Once complexity is defined, we can define randomness (as maximal complexity, under given restrictions) and information (how much the complexity of an object decreases when another object is given). The game of lossless compression is only one of interesting on-line prediction games; examples of other games are the square-loss game, the absolute-loss game, the game of investing in a stock market, the investment game with short positions allowed, etc. The notion of Kolmogorov complexity can be extended to many on-line prediction games, leading to "predictive complexity". Similarly to Kolmogorov complexity, predictive complexity leads to the notions of predictive randomness and predictive information.

In this talk, for each of the three areas (predictive complexity, randomness, information) I will state a recent mathematical result: the criterion of existence of predictive complexity; the estimate of the number of non-random sequences; the limits of data-snooping. All these results concern binary games and were obtained in collaboration with Yuri Kalnishkan and Michael Vyugin.

**Marcelo Weinberger**, Information Theory Research Group, Hewlett-Packard Laboratories

Title: Universal discrete Denoising: Known Channel

A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the goodness of the reconstruction is evaluated with a given single-letter fidelity criterion, we propose a discrete denoising algorithm that does not assume knowledge of statistical properties of the input sequence. Yet, the algorithm is universal in the sense of asymptotically performing as well as the optimum denoiser that knows the input sequence distribution, which is only assumed to be stationary and ergodic. Moreover, the algorithm is universal also in a semi-stochastic setting, in which the input is an individual sequence, and the randomness is due solely to the channel noise. The proposed denoising algorithm is practical, requiring a linear number of register-level operations and sub-linear working storage size relative to the input data length.

**Abraham Wyner and Dean Foster**, University of Pennsylvania

Title: On the lower limits of entropy estimation

In recent paper, Antos and Kontoyiannis considered the problem of estimating the entropy of a countably infinite discrete distribution from independent identically distributed observations. They left several open problems regarding the convergence rates of entropy estimates, which we consider here. Our first result, is that the plug-in estimate of entropy is as efficient as a match length estimator when applied to the class of memoryless sources on a countable alphabet with finite entropy and finite entropy ``variance". Our second result provides lower bounds on the convergence rate of any sequence of universal estimators of entropy over the same class of distributions. Finally, we consider an estimator based on match lengths that achieves this lower bound to first order. The surprising conclusion that follows is that a match-length estimator is first order optimal over the simplest class of distributions for which entropy estimation is non-trivial. We describe how this in turn implies that the Lempel-Ziv algorithm has an optimal convergence rate among the class of universal data compression algorithms over the arguably simplest class of non-trivial sources.

**Bin Yu**, UC Berkeley

Title: Boosting: convergence, consistency, and minimax results

Boosting is one of the most significant advances in machine learning for classification and regression. In its original version, the boosting estimator seeks to minimize empirically a loss function in a greedy fashion. The estimator of the target takes an additive function form and is built iteratively by applying a base learner to updated samples depending on the previous iterations. An unusual form of regularization, early stopping, is used to based CV or a test set.

In this talk, after an overview, boosting is analyzed in the early stopping framework. We present convergence and consistency results for general loss functions (Zhang and Yu, 2003), and minimax results for L2Boosting (Buhlmann, 2001, 2003) with smoothin spline base learners. These theoretical results shed lights on the following empirical observations on boosting: (a). It is better to use simple or weak base learners; (b). It is better to use small greedy search steps; (c). Ovefitting happens very slowly most of the time. Finally, we demonstrate that L2Boosting of component-wise smoothing splines is an effective model building tool for high dimesions. This is supported by experiments with real and simulated data sets.

**Peng Zhao**, University of California, Berkeley

Title: Some analysis of a predictive lossless coder for audio signals

Recently a coding method has been developed (Gerald Schuller, Bin Yu, Dawei Huang, and Bern Edler) which achieves leading compression ratios and a low lag for a wide variety of audio sources. To resolve the conflicts between the irrelevance and redundancy reduction, the method was separated into different stages. The irrelevance reduction stage consists of a psycho-acoustically controlled pre-filter. Then, to achieve an efficient reduction of redudancy, a new and novel lossless coder based on prediction was designed.

Although the prediction method has been tested broadly for this application, why and how it works has been quite a puzzle. I will try to describe its relationship with boosting, and the theory of competitive on-line algorithms which seem to provide an explanation of its strong performance. Also, I will show some empirical results and a comparison to a similar competitive on-line algorithm.

Previous: Participation

Next: Registration

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on May 14, 2003.