Small Molecule Subgraph Detector (SMSD) toolkit

Rahman, Syed Asad; Bashton, Matthew; Holliday, Gemma L; Schrader, Rainer; Thornton, Janet M

doi:10.1186/1758-2946-1-12

Software
Open access
Published: 10 August 2009

Small Molecule Subgraph Detector (SMSD) toolkit

Syed Asad Rahman¹,
Matthew Bashton¹,
Gemma L Holliday¹,
Rainer Schrader² &
…
Janet M Thornton¹

Journal of Cheminformatics volume 1, Article number: 12 (2009) Cite this article

20k Accesses
101 Citations
26 Altmetric
Metrics details

Abstract

Background

Finding one small molecule (query) in a large target library is a challenging task in computational chemistry. Although several heuristic approaches are available using fragment-based chemical similarity searches, they fail to identify exact atom-bond equivalence between the query and target molecules and thus cannot be applied to complex chemical similarity searches, such as searching a complete or partial metabolic pathway.

In this paper we present a new Maximum Common Subgraph (MCS) tool: SMSD (Small Molecule Subgraph Detector) to overcome the issues with current heuristic approaches to small molecule similarity searches. The MCS search implemented in SMSD incorporates chemical knowledge (atom type match with bond sensitive and insensitive information) while searching molecular similarity. We also propose a novel method by which solutions obtained by each MCS run can be ranked using chemical filters such as stereochemistry, bond energy, etc.

Results

In order to benchmark and test the tool, we performed a 50,000 pair-wise comparison between KEGG ligands and PDB HET Group atoms. In both cases the SMSD was shown to be more efficient than the widely used MCS module implemented in the Chemistry Development Kit (CDK) in generating MCS solutions from our test cases.

Conclusion

Presently this tool can be applied to various areas of bioinformatics and chemo-informatics for finding exhaustive MCS matches. For example, it can be used to analyse metabolic networks by mapping the atoms between reactants and products involved in reactions. It can also be used to detect the MCS/substructure searches in small molecules reported by metabolome experiments, as well as in the screening of drug-like compounds with similar substructures.

Thus, we present a robust tool that can be used for multiple applications, including the discovery of new drug molecules. This tool is freely available on http://www.ebi.ac.uk/thornton-srv/software/SMSD/

Background

The chemical similarity between two molecules, either at the sub- or superstructure level [1], and clustering of similar molecules [2, 3] are widely used to measure the diversity of the chemical space [4]. These methods are of vital importance, as they can be applied towards discovering new drug-like molecules [5–8]. The similarity between the sub- & superstructures can be measured by mathematical coefficients such as the Tanimoto similarity or Euclidean distance [9]. One of the most frequently used similarity search approaches is based on "structural descriptors" [1] and to date these have proved to be the most effective methods for measuring chemical similarity. Newer methods such as the Maximum Common Subgraph (MCS) have more recently come to the fore as they overcome nearly all of the challenges posed by fragment or descriptor based similarity searches [10]. The MCS methods benefit from much improved sensitivity as they can find atom-atom or atom-bond equivalence between query and target molecules [11]. It then becomes possible to follow a query substructure or moieties along metabolic pathways or reaction schemes.

One of the main issues with using the MCS is that it is NP-complete, i.e. no method can guarantee a perfect solution in polynomial time [12]. Many MCS algorithms have been proposed to overcome the complexity involved in finding chemically relevant substructures sharing the same pattern. The most widely used algorithm for the MCS problem of molecular graphs is the one designed by Bron and Kerbosch [13], which enumerates all cliques of the compatibility graph [14]. It applies a branch and bound technique to prune the search tree, which makes it very efficient. Some of these algorithms use heuristic approaches for arriving at non-optimal solutions [15–18] while others give optimal solutions [19]. Most of these algorithms find the MCS by converting a molecular association graph into a clique detection problem. An alternative MCS method has been proposed, without constructing an association graph, based on the backtracking algorithm [20, 21]. Cao et al. [22] proposed an MCS algorithm based on the backtracking VF algorithm [23, 24], with some heuristics for searching and predicting drug-like compounds. While a few of these algorithms [25]; Akutsu 2004 [16] are based on dynamic programming, others ([26] use heuristic algorithms to find subgraph isomorphism. Some of the above mentioned MCS algorithms utilize the labels of the vertices (atoms), the induced edges (bonds), and other important structural feature constraints. Interestingly almost all of them lack chemical knowledge to rank the MCS solutions based on information such as bond energies.

In this paper we present a chemically sensitive and robust tool, which uses a combination of various graph-matching algorithms for finding the MCS between small molecules.

This new tool can generate bond sensitive and bond insensitive MCS and it ranks the solution(s) according to minimal fragments, bond breaking energy and stereochemical matches. It can work with explicit hydrogen(s) though it performs faster when the hydrogen(s) are implicit (i.e. hydrogen(s) are only considered to be present where explicitly defined by the input molecule's structure). The tool's performance is benchmarked on a small virtual screening test case between KEGG [27] and PDB [28] molecules (50,000 pair-wise comparisons). It is difficult to compare the MCS algorithms implemented in various chemical tools, since most of them are commercial. Hence we have chosen the widely used MCS module implemented in an open source chemo-informatics library: the MCS algorithm in the Chemistry Development Kit (CDK) [29]. The new tool uses the CDK where efficient, but also provides solutions in cases where the MCS routine in the CDK fails to provide a solution.

The design principles of graph matching algorithms are all closely related to the kind of problem they address. Since the specificity and sensitivity of the chemical similarity/distance searches are a function of time and space, heuristic techniques [17] have evolved to map two molecules.

Chemical Graph Theory and MCS

Graph theory studies pair-wise relationships between objects. A molecular graph G = (V, E, l) consists of a set of vertices V(G) (i.e. atoms in a molecule), a set of edges E(G) (i.e. the bonds in a molecule) and l is a function that maps the union of V and E to natural numbers (maps atoms and bonds to their types). Molecular graphs are assumed to be simple, undirected graphs. A molecular graph G_mconsists of a set of vertices V(G_m) and a set of edges E(G_m). The vertices in G are connected by an edge if there exists an edge (v_i, v_j) ∈ E(G) connecting the vertices v_iand v_jin G such that v_i, v_j∈ V(G).

Two graphs G_q(query) and G_t(target) are said to be isomorphic if there is a one-to-one correspondence (mapping) f : V(G_q) → V(G_t) between their vertex sets which preserves the adjacency of vertices (i.e. two vertices u and v from G_qare adjacent if and only if f (u) and f (v) are adjacent in (G_t). The mapping itself is called an isomorphism. Two labeled graphs G_qand G_tare isomorphic if there is an isomorphism f between them preserving the labels i.e. l(v) = l(f (v)) for all v ∈ V(G_qand l(u, v) = l(f (u), f (v)) for all edges u, v ∈ E(G_q).

A clique (ω) in a molecular G_mcan be defined as a subset of vertices such that each pair of vertices is connected by an edge in the graph G_m. We call a subgraph ⊆ G_mcomplete if u, v ∈ E for all u, v ∈ G(V'). The subgraph is maximal if it is not a subgraph of a larger subgraph. A maximum clique is the largest maximal clique induced in G_m.

A maximal clique ω(G_m) in a graph G_mis a clique of a graph that is not a subgraph of a larger clique in G_mpossessing these criteria (i.e. bonds, atom types and their connectivity are preserved), hence this subgraph is referred to as a Maximum Common Subgraph (MCS).

The modular product of two graphs G_q(query) and G_t(target) is defined on the Cartesian product V(G_q) ⊗ V(G_t) with any two vertices u_q, v_qand u_t, v_tbeing adjacent in the modular graph. The modular product will be denoted by G_q⊗ G_t. In the case of molecular graphs (labelled graphs), the modular product of two graphs can be further restricted by constraints such as the vertex and edge labels satisfying certain compatibility criterion. Hence, in the case of molecular graphs, the modular product graph is often called compatibility graph (Figure 1).

Implementation

We propose a new Java based tool called Small Molecule Subgraph Detector (SMSD) to find the MCS in small molecules. The SMSD is distinguished from previous algorithms in that it uses a combination of various algorithms (summarised in appendix 1) to find the MCS and filters the results in a manner that is chemically relevant (Figure 2). In the following section we describe the underlying algorithm and the subroutines (Figure 3), which take account of the chemistry in the SMSD for the MCS. The SMSD calculates the MCS between two molecules by combining the power of the VF+ Lib, the MCS+, and the CDK based MCS algorithm. These algorithms are used automatically on a case-by-case basis, which is dependent on the molecules under consideration for the MCS search.

Revisiting the MCS routine in the CDK (Chemistry Development Kit)

The MCS algorithm as coded in the CDK [30] (CDKMCS) is based upon the common pattern recognition concept and the principle of Maximum Common Induced Subgraph (MCIS) [19]. In that approach, a given molecular graph G_mis reduced to a MCIS graph G'. This algorithm is designed in such a way that it selects the common features between two graphs using bit strings. Thus, a compatibility graph between two molecules can be traversed in a much faster way. In order to traverse the edges on the graph it uses an approach similar to the Bron and Kerbosch [13] algorithm.

The disadvantages of the MCS algorithm in the CDK are:

a) It may treat two chemically non-identical molecules as identical because it works on the MCIS principle, e.g. cyclopropane (CAS: 75-19-4) and isobutane (CAS: 75-28-5).

b) The runtime is high if two graphs are large with few dissimilar edges e.g. between aquacob(III)alamin (CAS: 13422-52-1) and ferroheme (CAS: 14875-96-8).

These challenges were resolved in the SMSD, by first using the atom and bond count filter to discriminate between two dissimilar structures (which the CDKMCS might consider similar) before performing the MCS search, and secondly, by using the VF+ Lib and MCS+ method developed herein.

MCS+

The MCS+ method has been designed using a combination of a compatibility graph based clique detection algorithm and a backtracking algorithm. Koch [31] presented an efficient algorithm called c-cliques based on the MCES (or maximum overlapping set (MOS)) concept that identifies cliques by pruning a given graph on the basis of certain constraints (c-edges and d-edges) before using the Bron Kerbosch algorithm to find the MCS. Leber [32] used this algorithm to find MCS between reactants and products in a reaction. The drawback to this algorithm is that it is not guaranteed to give the optimal solution (maximum clique), i.e. the c-cliques reported by this method might not contain the complete maximum match. Hence Leber [32] used McGregor's backtracking algorithm to further extend the clique when possible. Thus, the resulting clique will contain the complete maximum match. The performance of the MCS+ algorithm is robust (in terms of time) especially in cases where two graphs are not identical as the unconnected/dissimilar parts are restricted in the pre-processing module (c-edges &d-edges).

In the Koch approach, an edge in the compatibility graph is designated a c-edge if two adjacent vertices in the graphs G_qand G_tare connected and they share similar properties. In our tool, these properties can be described as a similar bond type. However, if an adjacent edge in the graph is not connected then it is termed as a d-edge. As the d-edges represent edges that are not connected to the atoms (vertices) being compared, they are removed from the search space (for the bond-insensitive algorithm). Here we consider as d-edges only those edges that do not share similar bond types (for the bond-sensitive algorithm). Thus, by performing this initial pre-processing we limit the search space of the MCS. For example in the case of cyclopropane and isobutane the compatibility graph (Figure 1) will generate 12 vertices and 36 edges (18 of which are c-edges). With an increase in the number of bonds and atoms, the search space also increases due to the combination of edges and vertices. For example the compatibility graph between aqua(III)alamin (CAS:13422-52-1) and ferroheme B (CAS:14875-96-8) will generate 2,220 vertices with 2,706 c-edges and 1,526 d-edges, thus reducing the MCS search space. From the c-edges, we can obtain c-cliques using a modified Bron Kerbosch algorithm.

We have replaced the Koch [31] algorithm with an improved c-cliques algorithm for reporting unique cliques (i.e. removing cliques from the search space that had already been reported) proposed by Cazals and Karande [33, 34]. We extended the algorithm to make it sensitive to bond type searches. A third improvement was introduced, which allowed only 10% of the reported d-edges (e.g. if the number of the d-edges are more than 20000 then only 2000 d-edges are selected) to find maximum c-cliques. This heuristic increases the speed of the algorithm by compromising on the size of the reported c-cliques, which are further extended by McGregor algorithm.

Interplay Between CDKMCS and MCS+

The proposed algorithm in the SMSD checks if two molecules are identical or dissimilar based on the atom count and bond count before performing the MCS search. If the two molecules have identical atom count and bond count or their bond count and atom count are both dissimilar, then the native MCS module of the CDK (CDKMCS) is used. Else, the MCS+ algorithm is used. Hence the CDKMCS approach is used in cases where two molecules are identical or are complete subgraphs of each another. However, if the CDKMCS method is unable to find an optimal solution in the given time, the algorithm automatically uses the MCS+ method. This often occurs in cases of subgraphs with matching ring structures and dissimilar peripherals, e.g. ATP, ADP.

VF+ library

VF algorithm is able to efficiently solve the graph isomorphism and subgraph isomorphism problems on Attributed Relational Graphs (ARG) [23, 24]. It works effectively on very large graphs, since its memory requirements are smaller than those of other algorithms of the same kind. This makes it an obvious choice for handling larger graphs where the clique finding algorithms need more memory and time (such as CDKMCS and MCSPlus). Since this algorithm is very fast when graphs represent exact isomorphism or one is a subgraph of another, it is an ideal choice for substructure searches in cheminformatics. The Java code for VF algorithm for computing substructures was adapted from the MX library available at http://metamolecular.com/mx. The CDK compatible version of the code is available on http://wiki.github.com/asad/VFLib.

This algorithm also generates approximate MCS if two graphs are not isomorphic. Hence this becomes an ideal choice for big graphs where these approximate MCS solutions can further be extended via McGregor algorithm to find MCS. This makes it very efficient for finding MCS without compromising on the speed and memory usage. Therefore we further extended the original code to compute substructure search and MCS between two molecules.

Introducing Chemical Knowledge into the MCS

The MCS search is based on two types of chemical constraints – a bond-sensitive MCS search and a bond-insensitive search. The bond sensitive module can be employed to screen compounds that mimic the substructure of the query molecule(s) based on the MCS. The bond-insensitive algorithm can be used for atom-atom mapping in a reaction where the focus is on bond changes, thus leading to a better understanding of the structural changes that emerge during a reaction. The runtime of the bond insensitive algorithm is longer than the bond-sensitive search as the former leads to an exponential increase in the search space. The SMSD software also contains several similarity (i.e. Tanimoto etc.) and distance descriptors (i.e. Euclidean etc.), which can be used as filters.

An MCS algorithm usually reports more than one solution when matching two molecules. Thus, even when the molecules are of identical size and chemical shape, there can be more than one feasible set of the MCS solutions (for example two benzene rings may have multiple MCS solutions but the solution can be filtered/constraint by chemical knowledge such as bond type).

We propose a chemically aware post-filtering (Figure 2) subroutine that deals with the ambiguous solutions arising from the MCS search. There are three filters applied:

a. Specific matching of the chemical functional groups, bond types (aromatic, non-aromatic, double, single etc.) and stereochemistry, e.g. phosphate, -SH, nitrite etc. are identified and matched.

b. The resulting solutions are sorted in ascending order of the total bond breaking energy (energy required to break the bonds between matched parts and unmatched parts) required by this MCS match (i.e. lowest energy is highest ranked).

c. The best set of solutions are chosen based on the above two steps and the solutions are then sorted in decreasing order according to the number of fragments generated if the matched part of the molecule is removed from the reference structure (i.e. if a three member ring is matched to a single ring structure then solutions which match the rings on the periphery are preferred over the central ring).

This leads to a set(s) of chemically relevant MCS solutions, keeping intact the chemical significance of the reported substructures.

SMSD (Small Molecule Subgraph Detector)

SMSD is a combination of various algorithms (i.e. CDMCS, MCS+ and VF+ Lib). The decision to use an algorithm is purely based on the complexity of the input molecule. For example molecules, which potentially can be a subgraph based on atoms are handled by VF+ Lib first. Molecules whose bond count and atoms count are not equal are first matched by CDKMCS. If the solution is not computed within a limited time (i.e. timeout occurs) then it's passed to the MCS+, which starts the search from the scratch (i.e. solutions from CDKMCS is ignored). If the d-edge count is more than 99,999 then VF+ Lib is used to find MCS, as this is very efficient in handling medium and large sized graphs. McGregor further extends solution generated by VF+ Lib where possible to find MCS. A turbo-matching option was introduced in the SMSD, to return the first MCS match.

The MCS solution(s) are then passed to the chemical filters, which ranks the solutions in a chemically meaningful way. Thus we get chemically relevant MCS solutions computed in polynomial time.

Results and Discussion

The MCS in the SMSD was benchmarked (Figure 4) against the standard MCS module available in the CDK [30]. A random test set of 50,000 comparisons between PDB HET molecules [28] and KEGG ligands [27, 35] was created. This list was filtered so that no comparison was repeated either with the same target and query, or vice versa. Thirty-eight examples in the test set did not find any similarity in either of the systems (CDKMCS and SMSD) due to unrecognised atom types, corresponding to heavy metal atoms in HET groups. This reduced the comparisons list to 49,962 jobs. The SMSD was able to compute all the jobs (except one which ran for more than an hour) while the native CDK-MCS failed to finish 24 jobs (~0.4%), mostly due to time-out or errors (jobs which ran more than 12 hours were automatically terminated by the machine).

SMSD versus CDK similarity validation

A molecular fingerprint is derived by enumerating on various structural features of chemical space [1, 36, 37]. The rationale behind such an approach is to find "globally" analogous features between the query and target molecules under less stringent constraints than the MCS. There are several heuristic fragment based chemical similarity search approaches, like the 2-D fingerprint method, which represents chemical structure as a vector in a high-dimensional space. These approaches are usually very fast and good for the pre-screening of small molecules. One of the major limitations of such approaches is that they are unable to identify "local" similarity between structures thereby increasing the false-negative rates (i.e. identifying cases with weaker similarities).

In order to measure and benchmark validity (Figure 5) of the reported MCS solutions, we compared the SMSD results (49,962 jobs) with that of the Fingerprint similarity generated using the CDK fingerprint module. The results were compared to each other by binning according to the frequency of the similarity reported. The similarity scores between each pair of comparison molecules were sorted into bins. The cumulative percentage of the overall dataset for each bin was also calculated (Figure 5). A graph-matching score defined as c/(a + b - c) where c is the size of the MCS and a, b are number of atoms in query and target respectively. A score of 0.49–0.50 was found to represent a good cut-off to use for graph matching using the SMSD, as 99% of all random matches occur below this threshold. Likewise, a similarity score of ~0.77 represents a good cut-off to use the SMSD and fingerprint matching using the CDK Fingerprints, as 99.9% of all random matches occur below this threshold. These cut-off scores (similarity) can be used to filter out molecules [38] which have a very high probability of being dissimilar. This analysis also highlights the fact that the similarity range covered by fingerprints is very different to that of the MCS search as reported earlier [39].

SMSD versus CDKMCS similarity

A second set of benchmarks (Figure 4) was performed between the CDKMCS and the SMSD on the subset of intersection (45,139 jobs that reported at least one match between query and target molecules) between successful cases reported by both the methods. A detailed analysis shows that in 43,711 (~96.83%) cases, the SMSD and the CDKMCS searches reported an MCS of similar size. In 1,361 (~3.01%) cases the SMSD reported an MCS of a longer length than the MCS reported by the CDKMCS and in only 67 (~0.14%) instances was it vice versa. The difference in the size of the matches is not due to time out. If a time out occurs in CDKMCS then the MCS+ starts to compute MCS from the beginning. Previously, the CDKMCS used to provide better matches (though very expensive in terms of time and memory) than MCS+, in a few cases. This was due to the fact that these molecules had a very high d-edge count; hence MCS+ use to switch into heuristic mode. We have now been able to resolve this issue using VF+ Lib, which can handle larger graphs efficiently. The analysis presented in this paper is performed without VF+ Lib so as to highlight the strength of the combination of MCS+ and CDKMCS. Some larger matches reported by CDKMCS are incorrect due to the MCIS vs MCES problem. For example in the case of cyclopropane and isobutene, CDKMCS reports 3 matched atoms whereas MCS+ reports only 2, which is correct. This highlights the strength of the SMSD and its ability to calculate the MCS in a reasonably acceptable time without clogging the comparison pipeline.

The MCS Execution Runtime

All these test cases were run on EBI farm nodes (processor speed varies across the nodes) with an average of 300 jobs processed in parallel. The overall execution time for all the processed jobs can be obtained by summing up the CPU time. This time does not have a direct, one to one correlation between the run times of jobs (e.g. Job "A" fired from the SMSD ran on a slower machine whereas same job "A" fired by the CDKMCS ran on a faster machine or vice versa). Hence this is just an indicative measure of the runtime performance between the algorithms.

The total time required by a single processor to run 49,961 test cases using the SMSD was ~21 hours whereas 49,938 test cases processed by the CDKMCS, took ~113 hours (excluding 25 jobs which were still running even after 6 hours). Hence the average execution time (in the completed job list) for each comparison by the SMSD is less than ~1.5 seconds and ~8.2 seconds by the CDKMCS. Thus overall, execution of the SMSD is far faster than the CDKMCS.

Conclusion

MCS have been previously applied to identify biochemical activity in the metabolic pathways by mapping atom-atom equivalences between reactants and products in a chemical reaction [15, 40–43]. The algorithms used for screening similar molecules are usually bond sensitive (i.e. discriminate between single and double bonds) whereas MCS similarity checks between reactant and product molecules in a reaction are often bond insensitive (i.e. atom-atom mapping is preferred over bond type matches). We propose the SMSD as a robust software for calculating substructure similarity using the MCS algorithm. The SMSD returns more MCS solutions than CDKMCS, as in many cases the latter fails to detect similarity due to timeout. The SMSD software encapsulates the power of various algorithms and generates chemically relevant MCS solutions. Another merit of this software is that the MCS solutions are ranked on the basis of their chemical relevance such as bond energy, fragment size etc. This is one of the first attempts at ranking the MCS solutions in a chemically meaningful manner.

Presently this tool can be applied to various areas of bioinformatics and chemo-informatics for finding exhaustive MCS matches. For example, it can be used to analyse metabolic networks by mapping the atoms between reactants and products involved in a reaction. It can also be used to detect the MCS/substructure searches in small molecules reported by metabolome experiments, as well as in the screening of drug-like compounds with similar substructures.

This can be used to determine whether or not a query was a substructure of the target, as this question does not require an exhaustive MCS match with all possible maximum cliques.

Availability and requirements

Project name: Small Molecule Subgraph Detector (SMSD)

Project home page: http://www.ebi.ac.uk/thornton-srv/software/SMSD/

Operating system(s): Platform independent (Windows, MAC, Linux/Unix)

Programming language: Java

Other requirements: Java 1.6 or higher

License: Creative Commons http://creativecommons.org/licenses/by/3.0/

Any restrictions to use by non-academics: NONE

Authors' informations

SAR is a Research Fellow in the JMT group.

MB is a Research Fellow in the JMT group.

GLH is a Research Fellow in the JMT group.

RS is a full time Professor at ZAIK, University of Cologne, Germany.

JMT is the Director, Research Programme Coordinator and Group Leader at EMBL-EBI, Cambridge, UK. She has been a Fellow of The Royal Society since 1999.

Appendix 1: SMSD Algorithm for Calculating MCS

PROCEDURE SMSD

INPUT: Given a query Gq and a target Gt graph

OUTPUT: the MCS mappings between the two graphs

IF atoms of query Gq is a subset of Gt or vice versa

THEN CALL VF+ (Gq, Gt)

ELSE IF the Gq bond count NOT EQUAL TO Gt && Gq bond

count NOT EQUAL TO Gt

THEN CALL CDK+ (Gq, Gt)

IF CDKMCS (Gq, Gt) EQUALS TO NULL

THEN CALL MCS+ (Gq, Gt)

END IF

IF MCS+ (Gq, Gt) EQUALS TO NULL

THEN CALL VF+ (Gq, Gt)

END IF

ELSE

CALL VF+ (Gq, Gt)

END IF

END PROCEDURE

PROCEDURE VF+ Lib

INPUT: Given a query Gq and a target Gt graph

OUTPUT: the mappings between the two graphs

IF node count of Gq > node count Gt

THEN CALL VF+ (Gt, Gq)

ELSE CALL VF+ (Gq, Gt)

END IF

IF the MCS size EQUALS Gq node count OR

MCS Size EQUALS Gt node Count

THEN return MCS

ELSE CALL McGregor (Gq, Gt, MCS)

END IF

END PROCEDURE

PROCEDURE MCS+

INPUT: Given a query Gq and a target Gt graph

OUTPUT: the mappings between the two graphs

Construct compatibility graph and mark c-edges and d-edges

IF heuristic flag False and d-edge count > 99,999

THEN return NULL

ELSE

Choose only 10% of the reported d-edges

ENDIF

CALL CazalsKarandeKoch (CompatibilityGraph, c-edges, d-edges)

IF the MCS size EQUALS Gq node count OR

MCS Size EQUALS Gt node Count THEN return MCS

ELSE CALL McGregor (Gq, Gt, MCS)

END IF

END PROCEDURE

PROCEDURE CDKMCS

INPUT: Given a query Gq and a target Gt graph

OUTPUT: the mappings between the two graphs

IF node count of Gq > node count Gt

THEN CALL UniversalIsomorphismTester (Gt, Gq)

ELSE CALL UniversalIsomorphismTester (Gq, Gt)

END IF

IF timeout is TRUE THEN return NULL

ELSE return MCS

END PROCEDURE

Abbreviations

MCS:: Maximum Common Subgraph
CDK:: Chemistry Development Kit
PDB:: Protein Data Bank
KEGG:: Kyoto Encyclopedia of Genes and Genomes
CDKMCS:: MCS in the Chemistry Development Kit
SMSD:: Small Molecule Subgraph Detector
MCS+:: Maximum Common Subgraph Plus
MCIS:: Maximum Common Induced Subgraph
MCES:: Maximum Common Edge Subgraph
G _m :: Molecular graph
G _q :: Query graph
G _t :: Target graph.

References

Gasteiger J, Engel T: Chemoinformatics: A Textbook. 2003, WILEY-VCH GmbH & Co
Chapter Google Scholar
Gardiner EJ, Gillet VJ, Willett P, Cosgrove DA: Representing clusters using a maximum common edge substructure algorithm applied to reduced graphs and molecular graphs. Journal of chemical information and modeling. 2007, 47 (2): 354-366. 10.1021/ci600444g.
Article CAS Google Scholar
Raymond JW, Blankley CJ, Willett P: Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. J Mol Graph Model. 2003, 21 (5): 421-433. 10.1016/S1093-3263(02)00188-2.
Article CAS Google Scholar
Voigt JH, Bienfait B, Wang S, Nicklaus MC: Comparison of the NCI open database with seven large chemical structural databases. J Chem Inf Comput Sci. 2001, 41 (3): 702-712.
Article CAS Google Scholar
Li Y, Hao P, Zheng S, Tu K, Fan H, Zhu R, Ding G, Dong C, Wang C, Li X, et al: Gene expression module-based chemical function similarity search. Nucleic acids research. 2008, 36 (20): e137-10.1093/nar/gkn610.
Article Google Scholar
Barreiro G, Guimaraes CR, Tubert-Brohman I, Lyons TM, Tirado-Rives J, Jorgensen WL: Search for non-nucleoside inhibitors of HIV-1 reverse transcriptase using chemical similarity, molecular docking, and MM-GB/SA scoring. Journal of chemical information and modeling. 2007, 47 (6): 2416-2428. 10.1021/ci700271z.
Article CAS Google Scholar
Sheridan RP, Kearsley SK: Why do we need so many chemical similarity search methods?. Drug Discov Today. 2002, 7 (17): 903-911. 10.1016/S1359-6446(02)02411-X.
Article Google Scholar
Gillet VJ, Willett P, Bradshaw J: Similarity searching using reduced graphs. J Chem Inf Comput Sci. 2003, 43 (2): 338-345.
Article CAS Google Scholar
Willett P: Chemoinformatics – similarity and diversity in chemical libraries. Curr Opin Biotechnol. 2000, 11 (1): 85-88. 10.1016/S0958-1669(99)00059-2.
Article CAS Google Scholar
Cuissart B, Touffet F, Cremilleux B, Bureau R, Rault S: The maximum common substructure as a molecular depiction in a supervised classification context: experiments in quantitative structure/biodegradability relationships. J Chem Inf Comput Sci. 2002, 42 (5): 1043-1052.
Article CAS Google Scholar
Cerruela García G, Luque Ruiz I, Gómez-Nieto MA: Step-by-step calculation of all maximum common substructures through a constraint satisfaction based algorithm. Journal of Chemical Information and Computer Sciences. 2004, 44: 30-41.
Google Scholar
Garey MR, Johnson DS: Computers and Intractability; A Guide to the Theory of NP-Completeness. 1990, W. H. Freeman & Co
Google Scholar
Bron C, Kerbosch J: Finding all the cliques in an undirected graph. Communication of the ACM. 1973, 16: 189-201. 10.1145/362342.362367.
Article Google Scholar
Samudrala R, Moult J: A graph-theoretic algorithm for comparative modeling of protein structure. Journal of molecular biology. 1998, 279 (1): 287-302. 10.1006/jmbi.1998.1689.
Article CAS Google Scholar
Hattori M, Okuno Y, Goto S, Kanehisa M: Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. Journal of the American Chemical Society. 2003, 125 (39): 11853-11865. 10.1021/ja036030u.
Article CAS Google Scholar
Akutsu T: Efficient extraction of mapping rules of atoms from enzymatic reaction data. J Comput Biol. 2004, 11 (2–3): 449-462. 10.1089/1066527041410337.
Article CAS Google Scholar
Raymond JW, Gardiner EJ, Willett P: Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J Chem Inf Comput Sci. 2002, 42 (2): 305-316.
Article CAS Google Scholar
Raymond JW, Willett P: Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of computer-aided molecular design. 2002, 16 (7): 521-533. 10.1023/A:1021271615909.
Article CAS Google Scholar
Tonnelier C, Jauffret P, Hanser T, Kaufmann G: Machine learning of generic reactions: 3. an efficient algorithm for maximal common substructure determination. Tetrahedron Computer Methodology. 1990, 3 (6): 351-358. 10.1016/0898-5529(90)90061-C.
Article CAS Google Scholar
McGregor JJ: Backtrack search algorithms and the maximal common sub-graph problem. Software Practice and Exprerience. 1982, 12: 23-24. 10.1002/spe.4380120103.
Article Google Scholar
McGregor JJ, Willet P: Use of a maximal common subgraph algorithm in the automatic identification of the ostensible bond changes occurring in chemical reactions. J Chem Inf Comput Sci. 1981, 21: 137-140. [http://pubs.acs.org/doi/abs/10.1021/ci00031a005]
Article CAS Google Scholar
Cao Y, Jiang T, Girke T: A maximum common substructure-based algorithm for searching and predicting drug-like compounds. Bioinformatics (Oxford, England). 2008, 24 (13): i366-374. 10.1093/bioinformatics/btn186.
Article CAS Google Scholar
Cordella LP, Foggia P, Sansone C, Vento M: An Improved Algorithm for Matching Large Graphs. 2008, 8-[http://amalfi.dis.unina.it/graph/db/papers/vf-algorithm.pdf]
Google Scholar
Cordella LP, Foggia P, Sansone C, Vento M: A (sub)graph isomorphism algorithm for matching large graphs. IEEE T Pattern Anal. 2004, 26: 1367-1372. 10.1109/TPAMI.2004.75.
Article Google Scholar
Wong AKC, Akinniyi FA: An algorithm for the largest common subgraph isomorphism using the implicit net. Proc. 1983 Int. Conf. Syst., Man, and Cybern. 1983, 197-201.
Google Scholar
Ullmann JR: An Algorithm for Subgraph Isomorphism. J ACM. 1976, 23 (1): 31-42. 10.1145/321921.321925.
Article Google Scholar
Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic acids research. 2006, D354-357. 10.1093/nar/gkj102. 34 Database
Berman H, Henrick K, Nakamura H, Markley JL: The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic acids research. 2007, 35: D301-303. 10.1093/nar/gkl971.
Article CAS Google Scholar
Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: Recent developments of the chemistry development kit (CDK) – an open-source java library for chemo- and bioinformatics. Curr Pharm Des. 2006, 12 (17): 2111-2120. 10.2174/138161206777585274.
Article CAS Google Scholar
Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci. 2003, 43 (2): 493-500.
Article CAS Google Scholar
Koch I: Enumerating all connected maximal common subgraphs in two graphs. Theoretical Computer Science. 2001, 250 (1–2): 1-30. 10.1016/S0304-3975(00)00286-3.
Article Google Scholar
Leber M: Kodierung enzymatischer Reaktionen. 2008, University of Cologne
Google Scholar
Cazals F, Karande C: A note on the problem of reporting maximal cliques. Theoretical Computer Science. 2008, 407 (1–3): 564-568. 10.1016/j.tcs.2008.05.010.
Article Google Scholar
Cazals F, Karande C: An algorithm for reporting maximal c-cliques. Theoretical Computer Science. 2005, 349 (3): 484-490. 10.1016/j.tcs.2005.09.038.
Article Google Scholar
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al: KEGG for linking genomes to life and the environment. Nucleic acids research. 2008, D480-484. 36 Database
Holliday JD, Salim N, Whittle M, Willett P: Analysis and display of the size dependence of chemical similarity coefficients. J Chem Inf Comput Sci. 2003, 43 (3): 819-828.
Article CAS Google Scholar
Chen J, Holliday J, Bradshaw J: A machine learning approach to weighting schemes in the data fusion of similarity coefficients. Journal of chemical information and modeling. 2009, 49 (2): 185-194. 10.1021/ci800292d.
Article CAS Google Scholar
Godden JW, Xue L, Bajorath J: Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients. J Chem Inf Comput Sci. 2000, 40 (1): 163-166.
Article CAS Google Scholar
Raymond JW, Willett P: Effectiveness of graph-based and fingerprint-based similarity measures for virtual screening of 2D chemical structure databases. Journal of computer-aided molecular design. 2002, 16 (1): 59-71. 10.1023/A:1016387816342.
Article CAS Google Scholar
Arita M: In silico atomic tracing by substrate-product relationships in Escherichia coli intermediary metabolism. Genome Res. 2003, 13 (11): 2455-2466. 10.1101/gr.1212003.
Article CAS Google Scholar
Arita M: The metabolic world of Escherichia coli is not small. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101 (6): 1543-1547. 10.1073/pnas.0306458101.
Article CAS Google Scholar
Birkemeyer C, Luedemann A, Wagner C, Erban A, Kopka J: Metabolome analysis: the potential of in vivo labeling with stable isotopes for metabolite profiling. Trends in biotechnology. 2005, 23 (1): 28-33. 10.1016/j.tibtech.2004.12.001.
Article CAS Google Scholar
Mu F, Williams RF, Unkefer CJ, Unkefer PJ, Faeder JR, Hlavacek WS: Carbon-fate maps for metabolic reactions. Bioinformatics (Oxford, England). 2007, 23 (23): 3193-3199. 10.1093/bioinformatics/btm498.
Article CAS Google Scholar

Download references

Acknowledgements

Authors would like to thank Dr. M. Leber and Prof. Dr. D. Schomburg for the initial discussions on the MCS algorithm. Dr. M. Leber provided C/C++ version of the c-cliques MCS library (from his PhD work), which was referred to during the software development. Authors would also like to thank Dr. R. Laskowski, for his comments on the manuscript and Dr. L. Baldacci for his feedback on the tool. We would like to thank the Chemistry Development Kit (CDK) developers who were prompt in addressing our queries. MB was supported as part of the mid-west Structural Genomic Consortium (SGC) by NIH grant (GM62414), US DOE under contract (W-31-109-ENG38). yED http://www.yworks.com was used to sketch the flowcharts. JMT, GLH, SAR would like to acknowledge EMBL, Wellcome Trust, IBM etc for support and funding.

Author information

Authors and Affiliations

EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Syed Asad Rahman, Matthew Bashton, Gemma L Holliday & Janet M Thornton
Zentrum für Angewandte Informatik (ZAIK), Universität zu Köln, Weyertal 80, D-50931, Köln, Germany
Rainer Schrader

Authors

Syed Asad Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Bashton
View author publications
You can also search for this author in PubMed Google Scholar
Gemma L Holliday
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Schrader
View author publications
You can also search for this author in PubMed Google Scholar
Janet M Thornton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Asad Rahman.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SAR was involved in designing the algorithm, coding the software and manuscript preparation. He was also involved in testing and benchmarking the tool. MB was involved in testing and benchmarking the tool. GLH was involved in testing the chemical relevance of the tool and in the process of the manuscript preparation. RS was involved in the supervision of the algorithm development. JMT was involved in the overall supervision of the project, manuscript preparation, intellectual inputs and guidance.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Rahman, S.A., Bashton, M., Holliday, G.L. et al. Small Molecule Subgraph Detector (SMSD) toolkit. J Cheminform 1, 12 (2009). https://doi.org/10.1186/1758-2946-1-12

Download citation

Received: 11 May 2009
Accepted: 10 August 2009
Published: 10 August 2009
DOI: https://doi.org/10.1186/1758-2946-1-12

Small Molecule Subgraph Detector (SMSD) toolkit

Abstract

Background

Results

Conclusion

Background

Chemical Graph Theory and MCS

Implementation

Revisiting the MCS routine in the CDK (Chemistry Development Kit)

MCS+

Interplay Between CDKMCS and MCS+

VF+ library

Introducing Chemical Knowledge into the MCS

SMSD (Small Molecule Subgraph Detector)

Results and Discussion

SMSD versus CDK similarity validation

SMSD versus CDKMCS similarity

The MCS Execution Runtime

Conclusion

Availability and requirements

Authors' informations

Appendix 1: SMSD Algorithm for Calculating MCS

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us