Automatic Analysis and Annotation of Data from the Mycoplasma capricolum Project

P.M. Gillevet1, N. Maltsev2, R. Overbeek2 and T. Gaasterland2

1George Mason University, Fairfax, Virginia.
2Mathematics and Computer Science Division, Argonne National Laboratory

We had accumulated over a million raw bases of Mycoplasma capricolum sequence (1,039,095 bp) with a total of over a quarter of a million linear bases (267,686 bp). The 287 open reading frame (ORFs) longer than 30 amino acids found in the 381 contigs have been analyzed by a semi-automatic system (GenQuiz) and 215 (75%) have significant similarities to proteins. The present sequence annotation approaches require skilled biologists to make key judgments in the routine processing steps of sequence data and in the interpretation of the output from various analytical tools. We proceeded to develop a semi-automated system to analyze sequence data with a number of analytical tools and parse the resulting outputs. This prototype system will be expanded with ultimate goal of building a expert system to automatically annotate the sequence data. The sequence data was automatically submitted to a suite of available tools (including Blast, Fasta, Blocks, and Blitz). This process involves a combination of locally maintained tools and access to available servers over the network; it is all achieved without manual intervention. The results from the tools are translated into Prolog facts asserting specific properties (such as similarities to known sequence and putative CDSs from tools like Genmark). The encoded output from the tools was then parsed using a semi-automated tool built in PROLOG and heuristic rules for correlating annotation facts were added. We re-analyzed the Mycoplasma capricolum dataset and identified 6 new similarities that were missed in the previous analysis. The implications of this analysis to the analysis of large scale sequencing projects will be discussed.

DIMACS Homepage
Contacting the Center
Document last modified on March 28, 2000.