At the current pace at which new biological information becomes available, the ability to automatically cluster, classify, and annotate it across the traditional boundaries of individual databases is becoming an increasingly critical need. (See Macauley, Wang, and Goodman [1998].) For instance, DNA sequences for promoter and enhancer regions, structural motifs in transcription factors, metabolic pathway databases, and gene expression analysis are all tightly bound and interconnected. However, they tend to be studied in isolation. When this happens, clustering techniques are often less effective because in the absence of additional constraints they have to deal directly with the high dimensionality of the solution space. Functional clustering of protein sequences, for instance, can help reduce the complexity of structural clustering and vice versa. Analogously, functional clustering in the gene expression domain has been shown to significantly reduce the complexity of promoter region analysis.
We will consider high-dimensional combinatorial algorithms and probabilistic models central to DM for the analysis, clustering, and classification of complex data patterns that can be used to integrate diverse information from many biological sources. Motivating approaches include the work of Roth, Hughes, Estep and Church [1998] tying together mRNA monitoring and transcription factors; of Bystroff and Baker [1998] using 1D database information to improve 3D fold recognition; of Eckman, et al. [1998] on large-scale diverse data; and of Karp and Paley [1996] integrating genomic data and metabolic pathways data.