We have developed a new clustering method, which allows us to build clusters from multi-component types of data. In our case the data is a large set of genomes in which one has to find clusters that are groups of orthologous genes, focusing on hyper-inter-similarities among genes from different genomes more than the intra-similarities among genes from the same genome.
We have found that discovering these groups provides a "strong draft"
of the complete picture of orthologous relations among genes in the
complete genomes studied. Comparisons of these groups with the well-known
semi-automatically extracted clusters of orthologous groups, COG
[http://www.ncbi.nlm.nih.gov/COG/] shows strong correlation between these
two systems of clusters. For instance, more than 85% of our clusters include
genes from at least three different genomes and each of these genes belongs to COGs.
These studies demonstrate that the method can be applied for an automatic
screening of groups of orthologous genes in analyzing a large collection
of genomes from different organisms.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2004/2004-33.ps.gz