DIMACS TR: 2004-40

Category-based feature extraction in supervised categorization of Aviation Safety Report System documents

Authors: Yangzhe Xiao, Haym Hirsh, Casimir Kulikowski, Michael Littman and Ilya Muchnik

In this study, we introduce novel feature extraction approaches for text categorization, which are based on characteristic descriptions of categories. A new category-based feature is derived from the relationship between a document and such a description. A document is then projected onto the category-based coordinates in the new feature space. We evaluate two different approaches for extracting category-based features. One is a category-specific weighting, where a description of a category is composed of the relative discriminating powers of all terms w.r.t. the category. The new feature based on this category is the weighted sum of all terms of a document vector according to this description. The other is classifier-based, where a description is a learned classifier of a category and the new feature for a document is the judgment of this classifier for this document. We evaluate our new feature extraction methods for the Aviation Safety Report System documents using Support Vector Machines. The new category-based feature extraction methods give comparable results to the best ones obtained using feature selection with Chi-square-based term ranking.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2004/2004-40.ps.gz
DIMACS Home Page