DIMACS TR: 2004-40
Category-based feature extraction in supervised categorization of
Aviation Safety Report System documents
Authors: Yangzhe Xiao, Haym Hirsh, Casimir Kulikowski, Michael Littman and
Ilya Muchnik
ABSTRACT
In this study, we introduce novel feature extraction approaches
for text categorization, which are based on characteristic
descriptions of categories. A new category-based feature is
derived from the relationship between a document and such a
description. A document is then projected onto the category-based
coordinates in the new feature space. We evaluate two different
approaches for extracting category-based features. One is a
category-specific weighting, where a description of a category is
composed of the relative discriminating powers of all terms
w.r.t. the category. The new feature based on this category is
the weighted sum of all terms of a document vector according to
this description. The other is classifier-based, where a
description is a learned classifier of a category and the new
feature for a document is the judgment of this classifier for
this document. We evaluate our new feature extraction methods for
the Aviation Safety Report System documents using Support Vector
Machines. The new category-based feature extraction methods give
comparable results to the best ones obtained using feature
selection with Chi-square-based term ranking.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2004/2004-40.ps.gz
DIMACS Home Page