DIMACS Seminar on Math and CS in Biology


Locating Protein Coding Regions in Human DNA using Decision Trees


Steven Salzberg
Johns Hopkins University


431 CoRE Building, Busch Campus
Rutgers University


3:00 PM
Monday, April 10, 1995


Genes in eukaryotic DNA stretch across hundreds or thousands of base pairs, while the regions of those genes that code for proteins usually occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance to the understanding of mammalian genetics. Using the growing body of publicly available DNA sequences, researchers have begun experimenting with computational methods for distinguishing between coding and non-coding regions, and several promising results have been reported. Existing methods experience their greatest difficulty when trying to identify short DNA sequences, for which the statistics available are quite limited. We describe here a new approach, based on a randomized decision tree algorithm, for identifying coding regions in DNA. This approach produces consistently higher accuracies than previous methods on short DNA subsequences. The algorithm can easily be trained for any length DNA sequence. The talk will review the gene identification problem as background material before presenting details of the decision tree algorithm and the experiments on human DNA sequences.

Upcoming Talks:

April 17: Dr. Charles Cantor, Boston U. (distinguished lecture)
April 24: Dr. Richard Lipton, Princeton (seminar will be held at U. Penn)

Document last modified on April 6, 1995