Clustering of compound libraries using 2D binary fingerprints is a fundamental task in chemoinformatics and various methods have been described to solve it . These methods can roughly be grouped into deterministic and non-deterministic approaches with two key-characteristics distinguishing them. First, the algorithmic complexity of deterministic approaches is more demanding whereas the non-deterministic methods often try to overcome this drawback by using heuristics to save time and memory. Second, deterministic clustering algorithms, especially agglomerative hierarchical techniques have been shown to yield good results and often perform better than non-deterministic approaches . As a consequence, clustering of small to medium sized libraries with up to 1 million compounds is regularly performed using deterministic techniques whereas libraries comprising millions of compounds are mostly clustered using heuristics like k-means .
Here, we present a deterministic approach for clustering huge compound libraries based on all pairwise compound similarities. For this purpose, we use an extremely fast and flexible algorithm for similarity calculations, which we have developed to be purely CPU-based thus having no need for any specialized hardware. Using this similarity method, we implemented a workflow with the following steps. First, we create a set of unique input fingerprints by filtering duplicates that are then stored and finally remapped onto their representative clusters. Second, we calculate all pairwise similarities to construct a similarity network by applying a fixed Tanimoto threshold to select the edges to be inserted into the network. From this similarity network the connected subgraphs are extracted and forwarded to the last step. Finally, connected subgraphs exceeding a predefined size are hierarchically clustered.
As a result, we show that our algorithm for similarity calculation is competitive to recently published CPU-based methods and can perform up to 380 million Tanimoto calculations per second on a current desktop computer. This efficient method allows our workflow to process medium to large libraries on current desktop computers within minutes. To finally demonstrate the power of our clustering workflow, we processed the commercially available chemical space comprising about 17 million compounds . The entire clustering workflow took 63 hours on a compute server using 64 cores and 100 GB main memory to complete.
J Chem Inf Model 1994, 34:1094-1102. Publisher Full Text
J Chem Inf Model 2012, 52:1757-1768. Publisher Full Text