DIMACS Mixer Series, September 23, 2004


Satyen Kale, Princeton University Computer Science

Title: $O(\sqrt{\log n})$ Approximation to SPARSEST CUT in $\tilde{O}(n^2)$ Time

We show how to compute $O(\sqrt{\log n})$-approximations to SPARSEST CUT and BALANCED SEPARATOR problems in $\tilde{O}(n^2)$ time, thus improving upon the recent algorithm of Arora, Rao and Vazirani (2004). Their algorithm uses semidefinite programming and required $\tilde{O}(n^{4.5})$ time. Our algorithm relies on efficiently finding expander flows in the graph and does not solve semidefinite programs. The existence of expander flows was also established by Arora, Rao, and Vazirani.

Pandurang Kamat, Rutgers University

Title: Content Integrity Service for Long-Term Digital Archives

In this talk I will outline the motivation behind such a service and present its design. The goal of the service is to prove that any information retrieved from the archive is authentic and has not been unintentionally or maliciously altered -- even after its bit representation in the archive has undergone one or more transformations. We use one-way hash functions and digital time-stamping procedures to unambiguously answer questions about the integrity, time of creation, and evolution history of data in the archive. I will also discuss the prototype of the service that was implemented over the summer. The prototype was designed as a service in the Content Services Framework that is part of HP's Digital Media Platform.

Martin Pal, DIMACs postdoc

Title: Boosted Sampling: Approximation Algorithms for Stochastic Optimization

Imagine a telecommunication company planning to build a network to serve a number of users. All would be nice and easy, if only the exact number, location, and demands of users was known.. except that often this is not the case. Thus the company faces a difficult decision: shall it go ahead and design a network based on incomplete information, or wait until more data becomes available? If it starts building early, the resulting network may not fit well with the actual usage pattern; if it waits, it may incur high cost because of the need to build quickly and under time pressure.

We consider a simple two-stage stochastic optimization model with recourse. In the first stage, building costs are low, but we do not know the demands -- we only know a distribution \pi of requirements that will be revealed in the second stage. In the second stage, the demands are revealed, and we may need to build additional links, that have become more expensive by a factor \sigma, to make sure that we satisfy the demands of all users. The goal is to minimize the expected cost of elements purchased in both stages.

We give a general technique to adapt approximation algorithms for several non-stochastic network design and other optimization problems to their stochastic versions by using a boosted random sample (by drawing \sigma times repeatedly) to construct our solution. This simple approach yields constant-factor approximation algorithms for stochastic variants of several problems including Steiner Tree, Steiner Network, Rent-or-Buy problems, Vertex Cover or Uncapacitated Facility Location.

This is joint work with Anupam Gupta, R. Ravi and Amitabh Sinha.

Val Pinciu, DIMACS visitor, Southern Connecticut State University

Title: Guards, Guarded Guards, and Connected Guards in Art Galleries

We consider several variations of the original Art Gallery Problem. Given an art gallery (a simple polygon with n sides), a guard set is a set of points G such that every point inside the polygon is visible to some point in G. A guard set where every guard is visible by at least some other guard is called guarded guard set. A guard set where the visibility graph is connected is called a connected guard set. We provide sharp bounds for the minimum number of guards, guarded guards, and connected guards necessary to cover a polygonal region. We also consider the traditional galleries where the walls are orthogonal.

Iryna Skrypnyk, University of Jyväskylä, Finland

Title: Combining Ensemble Techniques for Decomposition of Classification Heterogeneity

Construction of the predictive models of heterogeneous data is a topical problem in a present-day data mining research. Heterogeneous classification problems are characterized as having features unstable in their relevance for prediction across the set of instances. In natural domains instances from the same class often tend to group in the feature space. Instances which are close in the feature space share the same class label more often than not. For heterogeneous classification problems such tendency is exhibited quite often. This particular type of classification heterogeneity is called class heterogeneity.

The approach applied to construct predictive models of heterogeneous data is based on decomposition of heterogeneous classification problems onto subproblems. The ensemble technique combining local feature selection and class encoding is based on construction of local models for homogeneous regions of data. Those homogeneous regions can be approximated estimating data characteristics revealing the structure of heterogeneous data. It is argued that local feature interactions and independent local predictive ability of each feature are among such data characteristics. Feature merit measures originated from feature selection methods have been proposed to estimate those characteristics.

Class encoding has been applied in order to approximate grouping of instances in the case of class heterogeneity. In the decomposition scheme for ensemble generation developed class encoding is combined with local feature selection. Performance of one-per-class and pairwise class decomposition schemes for class encoding and correlation based feature subset selection has been studied experimentally within the developed combined ensemble technique. Various application aspects have been analyzed and different factors affecting performance of this technique have been revealed.

The preliminary results are encouraging and promote an important move towards developing solutions for different types of classification heterogeneity. The combined ensemble technique for heterogeneity decomposition can be extended for all variations of heterogeneity since the way to partition the set of instances for decomposition is determined.

Keywords: feature selection, classification, ensemble of learning models, data mining, knowledge discovery, machine learning