DIMACS TR: 2004-33

Automatic screening for groups of orthologous genes in comparative genomics using multiple-component clustering

Authors: Akshay Vashist, Casimir A. Kulikowski, Ilya Muchnik

To understand evolutionary relationships among genes from different organisms is a problem in modeling evolutionary history while solving practical problems related to functional annotation of genes. We have developed automatic method for discovering groups of gene sequences present in different organisms that are functionally related through evolution.

We have developed a new clustering method, which allows us to build clusters from multi-component types of data. In our case the data is a large set of genomes in which one has to find clusters that are groups of orthologous genes, focusing on hyper-inter-similarities among genes from different genomes more than the intra-similarities among genes from the same genome.

We have found that discovering these groups provides a "strong draft" of the complete picture of orthologous relations among genes in the complete genomes studied. Comparisons of these groups with the well-known semi-automatically extracted clusters of orthologous groups, COG [http://www.ncbi.nlm.nih.gov/COG/] shows strong correlation between these two systems of clusters. For instance, more than 85% of our clusters include genes from at least three different genomes and each of these genes belongs to COGs. These studies demonstrate that the method can be applied for an automatic screening of groups of orthologous genes in analyzing a large collection of genomes from different organisms.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2004/2004-33.ps.gz

DIMACS Home Page