A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acids and a dynamic-programming distance is calculated between each pair of segments. This space of segments is first embedded into Euclidean space with small metric distortion. A novel self-organized cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural protein data. Motifs and domains such as the Zinc Finger, EF hand, Homeobox, EGF-like and others are automatically correctly identified. A novel representation of protein families is introduced, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporters family.
The self organization method presented is very general and applies to any data with a consistent and computable measure of similarity between data items.
(Joint work with Nathan Linial and Naftali Tishby from Hebrew University Computer Science Department, and Michal Linial from Hebrew University Life Sciences Department.)