Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2
Center for Bioinformatics, University of Hamburg, Bundestraße 43, 20146 Hamburg, Germany
Journal of Cheminformatics 2012, 4:13 doi:10.1186/1758-2946-4-13Published: 31 July 2012
Searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software. The underlying algorithms, used over several decades, are designed for the application to general graphs. Applied on molecular graphs, little effort has been spend on characterizing their performance. Therefore, it is not clear how current substructure search algorithms behave on such special graphs. One of the main reasons why such an evaluation was not performed in the past was the absence of appropriate data sets.
In this paper, we present a systematic evaluation of Ullmann’s and the VF2 subgraph isomorphism algorithms on molecular data. The benchmark set consists of a collection of 1235 SMARTS substructure expressions and selected molecules from the ZINC database. The benchmark evaluates substructures search times for complete database scans as well as individual substructure-molecule pairs. In detail, we focus on the influence of substructure formulation and size, the impact of molecule size, and the ability of both algorithms to be used on multiple cores.
The results show a clear superiority of the VF2 algorithm in all test scenarios. In general, both algorithms solve most instances in less than one millisecond, which we consider to be acceptable. Still, in direct comparison, the VF2 is most often several folds faster than Ullmann’s algorithm. Additionally, Ullmann’s algorithm shows a surprising number of run time outliers.