We describe a method which enables us to search efficiently for maximal common substructures in a set of protein structures. In order to obtain a unique description of the secondary structure of a protein we model it as an undirected labeled graph such that the vertices represent helices or strands and the edges describe spatial adjacencies between the vertices. Connected components in these graphs are called "topology graphs".
In order to search for maximal common subgraphs in more than two topology graphs we must first compute all maximal common subgraphs in all graph pairs. We express all possibilities of compatible edge pairs in the considered topology graph pair in one graph [1], the so-called "product graph". Then, a maximal common substructure in both topology graphs corresponds to a maximal clique in the product graph.
For solving the NP-hard maximum-clique problem we enumerate all maximal cliques in product graphs. Our method is based on the algorithm of Bron & Kerbosch [2]. The efficiency of our technique is based on two optimizations. First, we recognize and eliminate equal search spaces. Second, we cut the search space and reduce the number of maximal cliques drastically by building up only those maximal cliques, which correspond exclusively to connected subgraphs in the participating topology graphs.
For large topology graphs, for instance with 32 and 55 vertices, respectively, and 36 and 74 edges, respectively, the product graph has 316 vertices and 40109 edges. We can enumerate the resulting 323 cliques for this example in a few seconds, whereas the original algorithm needs more than four days.
Searching for maximal connected subgraphs in more than two topology graphs we have to compute all maximal common subgraphs in all graph pairs. We store the appropriate edge sets representing maximal common substructures for every topology graph pair in ordered binary trees. By intersection of these edge sets we obtain the maximal common substructure in a set containing more than two topology graphs.