DIMACS Theory Seminar


Locating Protein Coding Regions in Human DNA using Decision Trees


Steven Salzberg
Johns Hopkins University


Computer Science Building, 35 Olden Street, Room 302
Princeton University


1:30 - 2:30 PM
Thursday, April 6, 1995


Genes in eukaryotic DNA stretch across hundreds or thousands of base pairs, while the regions of those genes that code for proteins usually occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance to the understanding of mammalian genetics. Using the growing body of publicly available DNA sequences, researchers have begun experimenting with computational methods for distinguishing between coding and non-coding regions, and several promising results have been reported. Existing methods experience their greatest difficulty when trying to identify short DNA sequences, for which the statistics available are quite limited. We describe here a new approach, based on a randomized decision tree algorithm, for identifying coding regions in DNA. This approach produces consistently higher accuracies than previous methods on short DNA subsequences. The algorithm can easily be trained for any length DNA sequence. The talk will review the gene identification problem as background material before presenting details of the decision tree algorithm and the experiments on human DNA sequences.

A reception follows the talk at 2:30 in the Tea Room.

Host: Simon Kasif (kasif@cs.princeton.edu)

Document last modified on March 31, 1995