DIMACS TR: 2003-23

Feature Selection and Training Set Sampling for Ensemble Learning on Heterogeneous Data

Authors: Iryna Skrypnyk, Tin Kam Ho


Heterogeneity in the data is a frequent issue in the contemporary large databases. It is a critical feature for most data mining techniques that requires a specific treatment. In this work we have formalized the notion of data heterogeneity, considered its various types, and studied a feature space heterogeneity case with ensemble techniques. Ensembles, or multiple classifier systems, are capable to produce more accurate classifications than a single classifier. However, the mechanisms of accuracy burst are different in different ensemble techniques - bootstrapping, error correcting coding, etc. In such a way, for instance, training set sampling techniques are superior in reducing error bias and variance while feature space sampling (spacing) techniques are often good in increasing diversity of component classifiers' predictions. Applying ensembles to the heterogeneous data turns those mechanisms to work on promoting local homogeneous regions to be covered by different component classifiers still maintaining the concept of a weak classifier. The reason for it is that those local homogeneous regions are very hard to elicit without prior domain knowledge, and known techniques to find homogeneous regions are designed for specific cases of heterogeneity, as for example, contextual features. We experimented with two representative ensemble techniques, Bagging for training set sampling and the Random Subspace Method for spacing. Their contribution for accuracy growth has been studied on several synthetic data sets modeling different cases of feature space heterogeneity. The potential benefits of combining spacing and sampling are considered.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2003/2003-23.ps.gz
DIMACS Home Page