DIMACS TR: 2003-23
Feature Selection and Training Set Sampling for Ensemble
Learning on Heterogeneous Data
Authors: Iryna Skrypnyk, Tin Kam Ho
ABSTRACT
Heterogeneity in the data is a frequent issue in the contemporary large
databases. It is a critical feature for most data mining techniques that
requires a specific treatment. In this work we have formalized the notion
of data heterogeneity, considered its various types, and studied a feature
space heterogeneity case with ensemble techniques. Ensembles, or multiple
classifier systems, are capable to produce more accurate classifications
than a single classifier. However, the mechanisms of accuracy burst are
different in different ensemble techniques - bootstrapping, error
correcting coding, etc. In such a way, for instance, training set sampling
techniques are superior in reducing error bias and variance while feature
space sampling (spacing) techniques are often good in increasing diversity
of component classifiers' predictions. Applying ensembles to the
heterogeneous data turns those mechanisms to work on promoting local
homogeneous regions to be covered by different component classifiers still
maintaining the concept of a weak classifier. The reason for it is that
those local homogeneous regions are very hard to elicit without prior
domain knowledge, and known techniques to find homogeneous regions are
designed for specific cases of heterogeneity, as for example, contextual
features. We experimented with two representative ensemble techniques,
Bagging for training set sampling and the Random Subspace Method for
spacing. Their contribution for accuracy growth has been studied on
several synthetic data sets modeling different cases of feature space
heterogeneity. The potential benefits of combining spacing and sampling
are considered.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2003/2003-23.ps.gz
DIMACS Home Page