DIMACS Seminar on Math and CS in Biology


An Algorithm for Finding Maximal Common Subtopologies in Protein Structures


Ina Koch
Institute for Algorithms and Scientific Computing, SCAI
GMD - German National Research Center for Information Technology
D-53754 Sankt Augustin, Germany


Computer Science Building, Room 402
Princeton University


11:00 AM
Tuesday, March 26, 1996


For many applications, the determination of structural similarities is an essential part of analyzing chemical compounds and biochemical structures. In contrast to small chemical compounds, protein structures are more complex such that they are often described on different structural levels. The secondary and supersecondary structure level is of special interest, because structural motifs for spatially adjacent secondary structure elements are often responsible for the function of the protein without constituting already a domain structure itself.

We describe a method which enables us to search efficiently for maximal common substructures in a set of protein structures. In order to obtain a unique description of the secondary structure of a protein we model it as an undirected labeled graph such that the vertices represent helices or strands and the edges describe spatial adjacencies between the vertices. Connected components in these graphs are called "topology graphs".

In order to search for maximal common subgraphs in more than two topology graphs we must first compute all maximal common subgraphs in all graph pairs. We express all possibilities of compatible edge pairs in the considered topology graph pair in one graph [1], the so-called "product graph". Then, a maximal common substructure in both topology graphs corresponds to a maximal clique in the product graph.

For solving the NP-hard maximum-clique problem we enumerate all maximal cliques in product graphs. Our method is based on the algorithm of Bron & Kerbosch [2]. The efficiency of our technique is based on two optimizations. First, we recognize and eliminate equal search spaces. Second, we cut the search space and reduce the number of maximal cliques drastically by building up only those maximal cliques, which correspond exclusively to connected subgraphs in the participating topology graphs.

For large topology graphs, for instance with 32 and 55 vertices, respectively, and 36 and 74 edges, respectively, the product graph has 316 vertices and 40109 edges. We can enumerate the resulting 323 cliques for this example in a few seconds, whereas the original algorithm needs more than four days.

Searching for maximal connected subgraphs in more than two topology graphs we have to compute all maximal common subgraphs in all graph pairs. We store the appropriate edge sets representing maximal common substructures for every topology graph pair in ordered binary trees. By intersection of these edge sets we obtain the maximal common substructure in a set containing more than two topology graphs.

[1] Levi,G. 1972.Calcolo 9, 341--352.
[2] Bron,C. and Kerbosch,J. 1973.Commun.ACM 16, 575--577.

Document last modified on March 19, 1996