DIMACS TR: 95-04
DNA Sequence Classification Using Compression-Based Induction
Authors: David Loewenstern, Haym Hirsh, Michiel Noordewier, Peter Yianilos
ABSTRACT
Inductive learning methods, such as neural networks and decision
trees, have become a popular approach to developing DNA sequence
identification tools. Such methods attempt to form models of a
collection of training data that can be used to predict future data
accurately. The common approach to using such methods on DNA sequence
identification problems forms models that depend on the {\em absolute
locations} of nucleotides and assume {\em independence} of consecutive
nucleotide locations. This paper describes a new class of learning
methods, called {\em compression-based induction} (CBI), that is
geared towards sequence learning problems such as those that arise
when learning DNA sequences. The central idea is to use text
compression techniques on DNA sequences as the means for generalizing
>from sample sequences. The resulting methods form models that are
based on the more important {\em relative locations} of nucleotides
and on the {\em dependence} of consecutive locations. They also
provide a suitable framework into which biological domain knowledge
can be injected into the learning process. We present initial
explorations of a range of CBI methods that demonstrate the potential
of our methods for DNA sequence identification tasks.
Paper available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1995/95-04.ps.gz
DIMACS Home Page