Abstract
Background
The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as socalled fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cutoffs, atom and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats.
Results
We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depthfirst search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the allshortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depthfirst search fingerprint. As an application of jCompoundMapper, we provide a commandline executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the largescale Ames toxicity benchmark using a largescale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a nonlinear support vector machine using stateoftheart descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al.
Conclusions
jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the commandline interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similaritybased data mining.
Graphical abstract
Background
The decomposition of a chemical graph into a list of features is a convenient way to assess the similarity between chemical compounds by comparing the resulting lists of features. Such representations are also called chemical fingerprints [1]. These encodings are important for data mining applications like similaritybased machine learning approaches or similarity searches [2].
The goal of this work is to introduce an open source molecular fingerprinting library for data mining purposes which provides exact definitions of its fingerprinting algorithms. The algorithms can be parametrized with various options to adapt the encodings, for example, by applying a custom labeling function or by altering the search depth parameter. Additionally, the library can be used as a basis for new implementations. It is based on the Chemistry Development Kit [3], which also provides several fingerprints in its API. However, there are several differences. The first aim of jCompoundMapper is to focus on the exact definition of its encodings, which is crucial to describe the features in data mining experiments. The second aim is to provide the functionality to export the fingerprints or pairwise similarity matrices to formats of popular machine learning toolboxes. A label or property of an input compound to be trained by a machine learning algorithm can be included.
Most fingerprint algorithms rely on either the geometrical or the topological distance between the atoms of a structure. The topological information is stored in the allshortest path matrix, which encodes the minimum topological distance between two atoms (vertices) by the shortest path using the bonds (edges). Organic compounds are usually weakly connected because the number of covalent bonds (vertex degree) of an organic molecule is limited. In contrast, the geometry of a structure can be interpreted as a fully connected graph. The complexity of both approaches can reduced by limiting the search depth for topological fingerprints or by introducing a distance cutoff for geometrical fingerprints.
jCompoundMapper offers a variety of topological (e.g. radial atom environments [4], extended connectivity fingerprints [5], depthfirst search fingerprints [6], or autocorrelation vectors [7]) and geometrical (e.g. twopoint and threepoint encodings [8,9] or geometrical atom environments [10]) fingerprints. If applicable, it allows for a parameterization of an encoding, such as the search depth, the distance cutoff, the geometrical scaling factor, the atom typing scheme, or the hash space.
After the feature generation step, the list of features can be mapped to a vectorial format. One possibility is to encode a set of features as a hashed fingerprint. Here, a unique identifier of a feature is used to initialize a pseudo random number generator which produces numbers in , where h is the maximum size of the hash space. Thus, the dimensionality of the original feature space can be considerably reduced. For example, the Fingal fingerprint [11], uses the cyclic redundancy check algorithm to generate seeds for the hashing of chemical graph patterns. For an introduction into hashed fingerprints, please refer to the review by Brown [1]. Another strategy reserves fixed bit positions in a vector for specific feature types, like patterns obtained at a certain parameter (such as depth or distance) with a limited number of possible combinations. The definition of the CATS2D [7] vector is an example for this approach.
jCompoundMapper supports native formats of common open source machine learning libraries. The exporters can be used to write feature maps to commaseparated format, LIBSVM [12] format (sparse and matrix), and WEKA ARFF [13]. Therefore, various data mining libraries can be directly applied on the output files. Furthermore, the library provides efficient data structures to compare sets of features in the case that the computation of a similarity matrix is required.
The quality of the encodings was compared on QSAR and toxicity benchmark problems in the results section. First, we conducted experiments using the support vector regression of LIBSVM on the wellknown Sutherland QSAR benchmark set [14]. Second, we used a large Ames toxicity classification benchmark [15] and LIBLINEAR [16] to evaluate the performance using binary hashed sparse fingerprints. On the Sutherland data sets, the averaged squared correlation of the allshortestpath and the atom triplet fingerprint was at least 5% better on ACHE than the best encoding given by Sutherland et al. On BZR and DHFR, the allshortest path fingerprint achieved a squared correlation of 0.57 and 0.76 respectively. The performance was comparable on two data sets. On the remaining three data sets, the best encoding was more than 5% worse than the results of the best encoding published by Sutherland et al. On the Ames toxicity data set, the implementation of the extended connectivity fingerprint achieved an AUC ROC performance of 0.87, which is comparable to the performance by a nonlinear support vector machine trained on stateoftheart descriptors. Nevertheless, the goal was not an exhaustive comparison but to show that the implementations are able to obtain similar results when compared against literature results. jCompoundMapper features a commandline interface but can also be used as a Java API. It depends solely on open source libraries and is licensed under the LPGL. The source code and an executable is available at Sourceforge.
The library originated from various implementations of literature fingerprints and descriptors used in comparison studies. The encodings were employed either as part of a new approach or as a reference method [1721].
To sum up, jCompoundMapper is an open source library for the encoding of chemical graphs as fingerprints. It can be used from the commandline interface or as a Java API. Hence, a further use in applications, like in KNIME http://www.knime.org webcite nodes, is possible. The overall performance of the fingerprints in machine learning experiments indicates that structuredbased models of reasonable quality can be obtained.
Methods
Prerequisites
Notation
The binned geometrical distance matrix G_{ij }encodes the spatial distances between two heavy atoms. The topological distance matrix T_{ij }encodes the shortest topological distance between atoms i and j. The labeling function types an atom according to a specific labeling scheme. The maximum distance allowed between two atoms (geometrical or topological) d defines a distance cutoff for features, all features with g_{ij }> d or t_{ij }> d are omitted. A labeled path p is a sequence of atoms connected by bonds , where bond b_{i }connects with . p_{ij }denotes a path connecting the ith atom with the jth atom. The depth d for topological patterns is the maximum number of bonds allowed for connecting the first atom with the last atom. Analogously to the definition of topological paths, a geometrical pattern must consist of different atoms, i.e. for two atoms a_{i}, a_{j }it holds that i ≠ j. Finally, a ⊕ b is defined as the concatenation of alphanumerical string symbols separated by an unique delimiter. In the following, we assume a hydrogendepleted molecular graph C with n atoms.
An encoding algorithm F has the form
where an encoding algorithm F maps some compound C to a set of features X. m depends, with the exception of fixedvector fingerprints, on C. A feature f has an unique and a string representation f.nom. f.id does not necessarily depend on f.nom. However, in most cases it is convenient to use a hash code of the string representation of a feature.
Fundamental Matrices
The geometrical distance matrix G_{ij }is computed as the matrix of binned Euclidean distances in Ångstrom between the threedimensional coordinates of all atom coordinates a_{i}.c ∈ ℝ^{3}, multiplied by a scaling factor s ∈ ℝ^{+}. The scaling factor s influences the resolution of the geometry and should be chosen according to the size of the compounds. The entry i, j (g_{ij}) in the matrix is calculated as follows
The computation time for the geometrical distance matrix is quadratic. The binning of the realvalued geometrical distance is important to produce discrete features f_{x}, f_{y}, which can be compared by the Dirac function
The topological distance matrix is defined as T_{ij}. The element i, j contains the shortest path between the ith and the jth atom (t_{ij}). t_{ij }is computed by the FloydWarshall algorithm. Therefore, the computation time for the matrix is O(n^{3}).
Feature Extraction
The molecular similarity is based on the numerical identifiers f_{x}.id of a feature x. Two features are regarded as equal if f_{x}.id = f_{y}.id. In all implementations the features of a compound C are distinguishable by recurrence, which means that we include a feature if the id of a feature is different from the previously extracted features. If a feature with the same id is generated again, the count for the feature is incremented. All atom pair encodings are extracted by regarding the upper half of the distance matrix only. For each atom pair, the string representation is generated in both reading directions. Only the version with the greater hash code is included in the final set of descriptors.
The modified depthfirst search applied in this library generates all possible paths originating from a root atom. Therefore, the feature space can be approximated by an mary tree and is therefore O(nm^{d}), where n is the number of heavy atoms and m the number of children in an mary tree, d is the depth of the tree. In organic compounds, every atom has at most 4 neighbors (m = 4  1 because one of the neighboring bonds has already been visited). Thus, the hypothetical worst case has a complexity of O(n3^{d}) at a search depth of d. If we assume an average branching factor α, which is slightly above 1 for organic compounds [6], the depthfirst search has a complexity of O(nα^{d}). The average branching factor depends on the average degree of a vertex, which is about 2 in organic molecules. We define DFS(a_{i}, d) as the set of all possible paths originating from a root atom a_{i }with a depth up to d.
For some of the definitions, we defined a can function that maps a set of features to a single canonical pattern. In an implementation this function can be realized by first sorting the patterns, which is possible if a natural order can be defined on the features. Then, the list of sorted patterns can be merged to a single canonical representation.
Atom Types and Pharmacophore Types
jCompoundMapper applies the standard atom types and ring detection algorithms implemented in the CDK. There are various typing schemes for small druglike compounds described in the literature. In the current version, jCompoundMapper features the following typing schemes as labeling function :
1. Element symbol (e.g. C, O, N, ...)
2. CDK atom types (e.g. C.sp2, O.minus, N.amine, ...)
3. Element plus the number of neighboring heavy atoms (e.g. C.2, O.1, N.2, ...)
4. Element plus ring type plus the number of neighboring heavy atoms (e.g. C.r.2, C.a.2, O.1, N.2, ...) where r is an arbitrary ring, and a is an aromatic system. If a_{i }is not contained in a ring, no ring type is set. The precedence is a > r.
5. DaylightInvariants (plus optional ring flag) have the following properties, separated by a dot: Atomic number, number of heavy atom neighbors, valency minus the number of connected hydrogens, atomic mass, atomic charge, number of connected hydrogens, and a flag if the atom is member of at least one ring. (e.g. 6.2.3.12.0.1.1 for a carbon in a benzole ring)
The following listing of potential pharmacophore points (PPPs) was published by Renner et al. [7] for the CATS autocorrelation descriptors. If PPP atom types are needed, this list is parsed and matched with the structure using the CDK SMARTS matcher or specially implemented graph searches.
1. Hydrogenbond donor (D): [#6H] oxygen atom of an OHgroup; [#7 H,#7H2] nitrogen atom of an NH or NH_{2 }group
2. Hydrogenbond acceptor (A): oxygen atom [#6]; [#7H0] nitrogen atom not adjacent to a hydrogen atom
3. Positive (P): [*+] atom with a positive charge; [#7H2] nitrogen atom of an NH_{2 }group
4. Negative (N): [*] atom with a negative charge; [C&$(C(= O)#8H1), P&$(P(= O)O), S&$(S(= O)O)] carbon, sulfur or phosphorus atom of a COOH, SOOH, or POOH group (SMARTS replaced by a direct graph search)
5. Lipophilic (L): [Cl, Br, I] chlorine, bromine, or iodine atom; [S;D2;$(S(C)(C))] sulfur; atom adjacent to exactly two carbon atoms; sulfur atom adjacent to only carbon atoms (SMARTS replaced by a direct graph search)
Encodings
Topological Fingerprints
All encodings described in the following section rely on the d parameter which constrains the maximum topological distance allowed between two atoms a_{i}, a_{j }in a feature.
AllPath Encoding (DFS)
Allpath encodings are paths generated by a graph traversal with a modified depthfirst search as proposed by Ralaivola et al. [6]. The linear fragments are obtained by iterating over all atoms in a molecular graph and performing an exhaustive search up to a predefined depth d. To generate an unique representation for each path, a temporary path object is generated and mapped to two alphanumerical string representations by generating the original and reverse string representation of the corresponding path object. The version with the higher lexicographical order is stored.
AllShortest Path Encoding (ASP)
The ASP encoding equals the DFS encoding with the exception that only the paths from an atom are stored that have shortest distances from the root atom to the last atom contained in the path, which leads to a sparser representation. During the depthfirst search, all paths are removed from the temporary set of depthfirst search paths that do not fulfill this constraint. To incorporate this information the T_{ij }matrix is required. Let p_{ij} be the length (number of bonds between the ith and the jth atom in a path) of a path between atoms i, j, then the set of features F encoding a compound C is
Thus, the allshortest path encoding is a subset of the paths contained in the DFS fingerprint. It is similar to topological atom pair approaches [8] with the exception that allshortest paths between two atoms are explicitly stored. Borgwardt et al. proposed a graph kernel based on the set of allshortest paths [22], however, only the vertex pairs and their shortestpath distances were included in this work. The explicit generation of paths is necessary because the FloydWarshall algorithm computes only the shortest distances between two vertices.
Topological Atom Pairs (AP2D)
This encoding contains atom types and the shortest path distance information between all pairs of atoms. It can be directly extracted from T_{ij }by converting the pattern where i ≠ j to a string feature. The features are canonicalized by generating the patterns from both reading directions and storing the version with the higher lexicographical order.
For 2point patterns this can be easily conducted by regarding the upper half of the distance matrix T_{ij }only (i.e. i > j). O(n^{2}) is needed for the generation of features and O(n^{3}) for the computation of T_{ij}.
Thus, the total computation time is cubic.
Topological Atom Triplets (AT2D)
This encoding extends the AP2D encoding by a further atom. The set of patterns consists of atom triplets and the topological distance to the next atom in the feature where i ≠ j ≠ k and t_{ij}, t_{jk}, t_{ki }≤ d. The total computation time for the features is cubic because all possible combinations of heavy atoms a_{i}, a_{j}, a_{k }have to be considered in the worst case.
Topological Autocorrelation Keys (CATS2D)
The CATS2D descriptors encode the pairwise topological relationships of PPP patterns in a molecular graph by a vector of fixed size. The approach was described by Schneider et al. [23], a list of PPP patterns was presented by Renner et al. [7]. The PPPs are defined in the subsection ("Atom Types and Pharmacophore Types"). The combination of all points leads to 15 possible pairs. The pairs are mapped to a key with a fixed dimensionality. The position of a feature in the key is determined by the index for the corresponding pattern pair of PPPs plus the topological distance. For example, this means the bit 76 in the CATS vector with d = 9 belongs to the PPP pair DN, and contains the number of pairs with distance 6. In the original publication of the topological CATS descriptors, d = 9 was used as the distance cutoff for the topological search, resulting in a vector with 150 dimensions. In our implementation, the search depth can be adjusted by altering the d parameter. The resulting vector has (d + 1) · 15 dimensions. The complete list of possible PPP pairs in the vector (block index in parentheses) is: AA (0), AD (1), AL (2), AN (3), AP (4); DD (5); DL (6); DN (7); DP (8); LL (9), LN (10), LP (11), NN (12), NP (13), PP (14). Let F (C) = X be a decomposition into all valid PPP pairs. Then, the CATS2D vector is
where offset(p) returns the predefined start index for the pattern p, d_{2D}(p) returns the topological distance between two atoms in the PPP pair, and φ_{p}(X) counts the number of occurrences of a pattern p in X.
Pharmacophore Pair and Triplet Encodings (PHAP2PT2D, PHAP3PT2D)
The PHAP2PT2D encoding is computed similarly to the AP2D. However, instead of atom types, the information of all PPPs of an atom is used to generate the fingerprint. Thus, we have to iterate over all PPPs of an atom. Analogously, the PHAP3PT2D encoding is computed, which uses three points. To keep the notation simple, let P_{i }denote the set of valid PPPs for the ith atom. Then, the set of valid 2point pharmacophores is
and the set of valid 3point pharmacophores is defined as
where t_{ij}, t_{jk}, t_{ki }≤ d. Actually, there are three additional inner loops over all valid pharmacophore points at atoms i, j, k. The complexity of these inner loops is theoretically 5^{3 }because the cardinality of the set of PPPs is 5. However, this complexity is further reduced because the meaning of some PPP definitions is contradicting for some combinations, such as "atom is positively charged" and "atom is negatively charged". The overall complexity is O(n^{3}) because of the constant computation time of the inner loops.
SHED Key (SHED)
The SHED Keys are closely related to the pharmacophore atom pair based encodings, with the following major differences: First, the number of dimensions is fixed, second the entries do not describe a count but the entropy of the respective atom pair descriptor [24]. The implementation differs slightly from the original implementation because it utilizes the PPP definitions as described by Renner et al. [7]. The distribution is analyzed for all possible combinations of PPPs. From that distribution of pairwise features the Shannon entropy is calculated as the descriptor in the corresponding PPP pair dimension of the SHED Key. The Shannon entropy of a PPP pair PPP_{i }is defined as
where PPP_{i}(l) denotes the ith PPP pair that is separated by a topological distance l. If pattern i was not found, the value of the ith dimension is set to 0. The distribution of a PPP pair is calculated by regarding the different distances 1, 2, ...,l, ..., d. The resulting vector has 15 realvalued entries.
Extended Connectivity Fingerprints (ECFP)
We implemented a variant of the ECFP as described by Rogers and Hahn [5]. Each ECFP feature represents a circular substructure around a center atom. The algorithm starts with the initial atom identifier of the center atom and grows a circular substructure around this atom throughout a defined number of iterations (search depth). For each round, the current extended version of the feature is added to the final set of features. In contrast to other radial fingerprints, the bonding information is included. Therefore, a feature can be extracted, for example, as canonical SMILES.
The current implementation of the ECFP in jCompoundMapper differs slightly from the original implementation. In the original algorithm, the identifiers of the alpha atoms of a center atom are used to calculate an updated identifier for the center atom. The algorithm only includes the alpha atoms of a center atom in each iteration and thus the connectivity information is completely discarded between the layers. However, the identifier of a center atom implicitly contains information from further and further away of the center atom in each iteration because the atom identifiers of the previous iteration are used. We explicitly model the growing substructure by using the initial atom identifiers in each iteration and keeping the connectivity information between the layers. After an iteration, new possible attachment points for a specific circular substructure are kept in memory and those attachment points are extended in the next iteration.
Topological Molprintlike fingerprints (RAD2D)
This encoding was proposed by Bender et al. [4,25] and describes the radial environment by the atoms with the topological distance 1, 2, ..., l, ..., d rather than the full paths containing bonds. A shell s(a_{i})_{l }in our implementation contains the canonically sorted set of topological neighbors of atom a_{i }at a distance t_{ij }= l. Additionally, we include the concatenation of all shells ≤ 1, 2, ..., l, ..., d as additional features. Therefore, a resulting set of features contains n · d features.
Local Path Environments (LSTAR)
This fingerprint is a radial fingerprint similar to RAD2D. The major difference is that all paths up to depth d are stored in a shell. First, the tree of all paths originating from an atom a_{i }is generated. Then, all paths of a certain length are assigned to a shell s(a_{i})_{d }containing the paths originating from root atom a_{i }of length d. This is equal to a canonical representation of DFS(a_{i, }d) in a single canonical feature. The paths in a shell are sorted in lexicographical order to be comparable. The resulting fingerprint contains all shells ≤ 1, 2, ..., l, ..., d. The major difference to the Molprintlike fingerprints is that the bond information is still included.
Geometrical Fingerprints
All geometrical encodings support the d parameter which defines the distance cutoff between two atoms. Another important parameter is the scaling factor s, as described at the beginning of this section.
Geometrical Atom Pairs and Atom Triplets (AP3D, AT3D)
These encodings are implemented similarly as their topological pendants AP2D and AT2D. The only difference is that G_{ij }is used for the distance information. Thus, the geometrical twopoint atom pair encoding (AP2D) is defined as
where i ≠ j and g_{ij }≤ d.
For the threepoint relationships AT3D, we have
where i ≠ j ≠ k and g_{ij}, g_{jk}, g_{ki }≤ d.
This is a standard encoding implemented in several toolkits; a kernel based on such patterns was published by Mahé et al [9].
Geometrical CATS fingerprints (CATS3D)
Our implementation differs from the description of the original CATS3D [7], which uses the Molecular Operating Environment (MOE, Chemical Computing Group, http://www.chemcomp.com/ webcite) patterns to depict surface features of a molecule. The version implemented in jCompoundMapper uses the PPP definitions which were also used in the implementation of the CATS2D vector. Again, let F(C) = X be the set of all valid PPP pairs and φ_{p}(X) a function which counts a pattern p ∈ X. Then the CATS3D vector is
where d_{3D}(p) returns the geometrical distance of the two atoms, which equals g_{ij }between any atoms i, j contained in a feature.
Geometrical pharmacophore fingerprints (PHAP2PT3 D, PHAP3PT3D)
These fingerprints are derived from their topological variants PHAP2PT2D and PHAP3PT2D by replacing the T_{ij }matrix by G_{ij}. Let P_{i }denote the set of valid PPP for the ith atom then P_{i }⊕ g_{ij }⊕ P_{j }is a valid twopoint pharmacophores and P_{i }⊕ g_{ij }⊕ P_{j }⊕ g_{jk }⊕ P_{k }⊕ g_{ki }is a valid threepoint pharmacophore, where g_{ij}, g_{jk}, g_{ki }≤ d.
The set of valid 2point pharmacophores is defined as
and the set of valid 3point pharmacophores is defined as
Again, P_{i }denotes the set of valid PPPs for the ith atom.
Geometrical Molprintlike fingerprints (RAD3D)
These encodings are the geometrical pendant of the topological RAD2D encoding. Similar to the RAD2D encoding, the atoms with g_{ij }= l at a binned geometrical distance are added to a shell descriptor. For each value in 1, 2, ..., l, ..., d a pattern containing all shells up to distance l in a canonical order is created. Therefore, the encoding contains n · d entries.
Example of encodings
Comparison Table 1 gives a direct tabular comparison of the features extracted by the encodings together with their count or value. The features are generated from the compound presented in Figure 1.
Table 1. Examples of Encodings
Figure 1. Topology and Geometry of Oxaceprol. The geometry and topology of Oxaceprol. Pharmacophore types shown in the 3 D structure are 1 = [L], 3 = [A  [#7H0]], 6 = [D  [OH], A  [O]],8 = [L], 9 = [A  [O]],10 = [N], 11 = [A  [O]], 12 = [D  [OH], A  [O]]. The geometry and topology of this compound is the basis for the exemplary fingerprints shown in Table 1. Note that multiple PPPs can be assigned to an atom: In this case atoms 6 and 12 have two valid PPPs.
Implementation
Thirdparty libraries
The underlying chemical expert system is the Chemistry Development Kit (CDK) [3,26] in its current development version 1.35. It provides the basic functionality for parsing, typing, and graph algorithms for molecular data. For the commandline interface we employed the Apache Commons commandline parser 1.2 http://commons.apache.org/cli/ webcite. The access via the API or the binary using the commandline interface enables the user to utilize the library for batch processing. The language level is Java 1.6.
Additional functionality
Import and Export of Data
The valid input format is MDL SD format with attached hydrogens for the commandline tool. The CDK molecule objects can be processed using the API.
There are exporters for various formats of popular machine learning toolboxes. The ARFF format is the native WEKA [13] format, the support vector machine libraries LIBSVM and LIBLINEAR are supported by their sparse hashed format and precomputed matrix format. Alternatively, there are several commaseparated formats which support hashed or string features, which can be imported into toolboxes like R or MATLAB.
jCompoundMapper includes a buffered random access reader for parsing the input files. Thus, it can read files of the maximum size supported by the Java runtime environment. The memory requirements are low if the encodings are exported sequentially (such as the sparse LIBSVM format) because only a single encoding has to be stored at a time. If the computation of a similarity matrix is required, all encodings are kept in memory to ensure a fast computation of similarities. For this reason, a matrix computation requires additional memory for large data sets.
The label or class for learning tasks is read from the SD property and is integrated into the specific output format. As for the ARFF format, a nominal or numeric class label is created, depending on the distribution of labels in the input format. The user may overwrite the default threshold for the number of classes (currently, this is set to five).
Hashing
All decomposition algorithms F(C) return the full list of features (the encoding). A feature f has an integer identifier f.id ∈ {0, 1, .... 2^{32}} which allows the efficient use in hash based collections. Therefore, it is possible to operate on the full set of descriptors and to generate hashed fingerprints. Hashing is useful to generate binary vectors of a fixed size. The hash function H
is used to project the set of features to a binary vector of the dimension h. The size depends on the expected number of features (see Table 1). H generates the hashed bit of a pattern depending on the numerical seed f.id assigned to each feature. In most cases, the seed equals a hash code of the string representation. The hashing step is also useful to obtain nominal features for fast comparisons. A nominal feature is a feature f with a finite set of possible values like f ∈ {red, green, blue} or, convenient for chemical graphs, f ∈ {pattern included, pattern not included}.
Similarity Matrices
jCompoundMapper offers a FeatureMap data structure, which stores the features in a wrapped hash map. Different metrics are defined on this data structure, such as MinMax and Tanimoto [27]. Thus, it is possible to compute distance matrices within seconds on an average desktop computer. Now, we assume two mappings F (C_{A}) = A and F (C_{B}) = B. Further, let φ_{p}(X) count the number of occurrences of pattern p ∈ X.
The MinMax similarity is defined as
The Tanimoto similarity can be used instead if only the occurrence of a pattern is taken into account
The feature maps permit to compute similarity matrices with jCompoundMapper on the full set of features of a compound C, without introducing noise by hashing. Nevertheless, it is also possible to generate hashed binary fingerprint objects of any of the encodings.
Results and Discussion
Computation Times Benchmarks
Table 2 presents the performance of the different encodings as implemented in jCompoundMapper. The computation time for the atombased approaches varies from the atom pair encodings which can be computed with 332339 molecules per second to the depth first search based encodings which have a performance of 68136 molecules per second. The encodings relying on the typing using the PPP SMARTS definitions are significantly slower with about 78 processed molecules per second. The conversion time includes reading, typing, and feature map creation. As benchmark data set, we chose the publicly available ChemDB random background data set published in the virtual screening study by Nasr et al. [28]. This data set comprises 175,000 random compounds from ChemDB [29].
Table 2. Conversion Time
Machine Learning Performance
A major application of molecular encodings are structurebased machine learning and data mining methods. The aim of such applications is either the prediction of molecular properties or the ranking of compounds according to a trained model. In the following experiments, we wanted to assess the quality of the encodings implemented in jCompoundMapper for several established regression and classification benchmark problems. The encodings were used with the default parameters as given in Table 2. The compounds were prepared using CORINA [30] for initial coordinates and were refined using Schrödinger MacroModel [31] with the OPLS 2005 force field.
QSAR Regression Problems with LIBSVM
LIBSVM [12] is a library for support vector machines. For the experiments on the regression benchmarks, we decided to train ϵsupport vector regression on the benchmark compilation of eight pIC50 QSAR problems published by Sutherland et al. [14]. The Gram matrices were precomputed by the MinMax similarity, which is also a valid kernel function. We conducted these experiments to find out whether there are significant differences between the performances of the different encodings.
We evaluated the nested crossvalidation performance of the different encodings and compared the outcome of the experiments against results from the literature. The parameters C and ϵ of the support vector machine were selected in a nested crossvalidation. In the experiments, a 10fold crossvalidation was repeated 20 times using an initial seed value. Therefore, the values represent the mean and the standard deviation, computed over 200 models. Based on these statistics, the corrected resampled ttest of Bouckaert and Frank [32] can be applied. The results are summarized in Table 3. With the exception of THERM and THR, the performance of the best encodings was at least comparable to the mean squared error values for a sophisticated graph kernel given by Fechner et al. [33] on the same benchmarks.
Table 3. Nested Crossvalidation MSE Performance on the Sutherland Data Sets
An analogue setup was used to compute nested leaveoneout crossvalidation results to compare the predictive performance of the support vector machine in combination with the jCompoundMapper encodings to literature results. Again, we optimized the parameters C and ϵ (we used a 10fold crossvalidation repeated 2 times to select the best parameter combination in the inner loop) and trained a model for each of the n leaveoneout sets and predicted the external sample for each model.
Table 4 and 5 summarize the results of the nested leaveone out crossvalidation according to the mean squared error and Pearson's correlation coefficient. Sutherland et al. used several descriptorbased approaches to model the activity of the benchmark set (presented in Table 4 and 5) using partial least squares (PLS) [14]. Compared against the results (squared correlation coefficient) given for the best descriptor approach presented in this study, the jCompoundMapper encodings are competitive. The findings are summarized in Table 6 which shows a similar performance in two cases, a better performance in three cases, and a worse performance in three cases.
Table 4. Nested Leaveoneout MSE Performance on Sutherland Data Sets
Table 5. Nested Leaveoneout Correlation Performance on Sutherland Data Sets
Table 6. Comparison with Literature Results on Sutherland Data Sets
Classification of Toxic Compounds with LIBLINEAR
Another increasingly important task is to build models on large data sets of chemicals. A machine that can cope with such a setup is LIBLINEAR [16], a linear largescale support vector machine. The large Ames data set was published by Hansen et al. [15] and contains 6512 compounds and their measured toxicity in an Ames test. We skipped the encodings based on the PPP typer because several compounds do not have any pharmacophore point according to the PPP definition. The results were obtained by tuning the C parameter in log_{2 }∈ {8, 7, ..., 2} within a 2fold crossvalidation on the training set and evaluating the model performance on five defined splits as described in [15].
Table 7 shows the AUC ROC results for the large Ames toxicity benchmark set. The ECFP and LSTAR encodings achieved the best results, comparable to the results of several supervised classifiers presented by Hansen et al. [15] using dragonX [34] descriptors. Hansen et al. applied knearest neighbor, support vector machines with a radial basis function, Gaussian processes, and random decision forests to build models on dragonX descriptors for this problem. The approaches were evaluated on the same defined splits. The performance of LIBLINEAR with the best encodings is comparable to the best approaches Gaussian processes and support vector machines. Even the worst performing encodings (AP2D, AP3D, and DFS) were competitive to the knearest neighbor classifier on dragonX descriptors.
Table 7. Comparison with Literature Results on the Ames Toxicity Benchmark
Java API and Commandline Interface
Java API Usage Example
The core of the library is a Java API. The API enables to process chemical information from an abstract level, similar to a workflow tool. The example given in Appendix 1 reads an MDL SD file, converts the compounds to feature maps and calculates all pairwise similarities.
CommandLine Interface Example
The following section gives an example of using the binary executable version of jCompoundMapper. As a case in point, this version can be used in shell scripts. Calling the commandline tool using h gives an overview of possible parameters (see Figure 2).
Figure 2. Commandline Interface. The binary can be accessed via a commandline interface, which allows for scripting.
Using the defaults (or via ff 0), jCompoundMapper generates a hashed LIBSVM output format using the depthfirst search encoding with element plus neighbor count atom types.
In the following, we process the training and the known test set from the environmental toxicity challenge http://www.cadaster.eu/node/65 webcite) which were converted to MDL SD format. The label (MDL property) to be learned is log(IGC501). Using these settings, the structures of the training set were mapped to hashed fingerprints with the default settings.
java jar jCMapperCLI.jar f challenge_train.sdf l "log(IGC501)"
After the computation, an overall statistic is plotted showing e.g. the average number of features in the fingerprints. In the next step, we map the test file to the same representation. Bits in the test file were set in exactly the same positions in the vector because the random numbers are generated by using the seed value defined by the features.
java jar jCMapperCLI.jar f challenge_test_known.sdf l "log(IGC501)"
In the next step, a crossvalidation is conducted by using the precompiled binary distribution of LIBSVM that can be downloaded from the LIBSVM homepage. The parameters are set as follows: t 0 sets the linear kernel (dot product), s 3 sets ∈ regression, and c 2 sets the error weight to 2. The file for training was produced in the previous step.
svmtrain t 0 s 3 v 10 c 2 challenge_train.DFS.LIBSVM_SPARSE
LIBSVM produces no model in crossvalidation mode. However, the LIBSVM crossvalidations statistics showed that the model has an MSE of 0.32 and an Q^{2 }of 0.71, indicating a reasonable parametrization.
Cross Validation Mean squared error = 0.324891
Cross Validation Squared correlation coefficient = 0.712412
Finally, the model is trained by omitting the crossvalidation flag v.
svmtrain t 0 s 3 c 2 challenge_train.DFS.LIBSVM_SPARSE
This step produces a separate model file, which can be used to predict the external test set that was generated during the second step. This is conducted by calling svmpredict.
svmpredict challenge_test_known.DFS.LIBSVM_SPARSE challenge_train.DFS.LIBSVM_SPARSE.model result
The results are printed by LIBSVM highlighting that the performance on the external test set is MSE = 0.29 and R^{2 }= 0.74. The result on the known test of the environmental toxicity prediction challenge would be in the top ranks of the competition. The prediction values can be obtained by opening the LIBSVM output file result in an editor.
Mean squared error = 0.291011 (regression)
Squared correlation coefficient = 0.742283 (regression)
Conclusions
jCompoundMapper is an open source library for molecular fingerprinting with a focus on machine learning and data mining applications. A commandline interface exists for the user who is not familiar with programming, which allows a simple usage from the shell or the application in scripts. The architecture provides the functionality to derive fingerprints from existing ones or to integrate own encodings. In contrast to closed source fingerprinting toolkits, a scientist knows exactly how the fingerprint is computed (like the labeling function, distance cutoffs) and can even inspect the source code of the generation routine. We compared the performance using linear and nonlinear support vector machines on standard machine learning benchmarks in the research field. The results show that the machine learning performance using the encodings with default parameters is already close to more sophisticated stateoftheart descriptors. The binary version provides a commandline interface allowing for the generation of models from the shell with open source software such as LIBSVM or WEKA in reasonable time on average desktop computers. The library itself uses only functionality of open source software licensed under the LGPL. Therefore, the library can be used in any project compatible with the CDK. Further projects with the library, such as a KNIME node wrapping jCompoundMapper, are planned.
Availability
The following files are available for download from http://jcompoundmapper.sourceforge.net/ webcite
1. External library, which can be integrated as Java jar library file
2. External library, including sources
3. Binary commandline tool (requires a Java runtime environment) and a short tutorial with a prepared data set
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
GH wrote most of the code and the manuscript. LR implemented the Molprintlike fingerprints and the extended connectivity fingerprint, helped to design the library, and participated in writing the manuscript. AJ implemented an initial version of the pharmacophore typer and the CATS2D descriptors. NF tested some of the encodings in experiments and helped to develop the atom typing schemes. AZ supervised the study and participated in the discussion of the results. All authors read and approved the final manuscript.
Appendix
Appendix 1  Usage of the API
Example of using the API: Read molecules, map the compounds to encodings, and compute a similarity matrix.
1 // read sd file
2 RandomAccessMDLReader reader = new RandomAccessMDLReader(new File (" molecules . sdf " ) );
3
4 // convert all compounds in the data set to feature maps
5 final ArrayList <FeatureMap> featureMaps = new ArrayList <FeatureMap >( );
6 final FingerPrinter finger printer = new Encoding2DAllShortestPaths ( ) ;
7 for (int i = 0; i < reader. getSize ( ); i++) {
8 ArrayList <IFeature > rawFeatures = finger printer . getFingerprint (reader. getMol (i));
9 featureMaps . add (new FeatureMap (rawFeatures ) ) ;
10 }
11
12 //compute all pairwise distances by feature maps
13 final int dim = featureMaps . size ( ) ; double [ ] [ ] matrix = new double [ dim ] [ dim ] ;
14 IDistanceMeasure similarity = new DistanceTanimoto ( ) ;
15 for ( int i = 0; i < dim; i++) {
16 for ( int j = i ; j < dim ; j++) {
17 matrix [ i ] [ j ] = similarity . getSimilarity (featureMaps . get ( i ), featureMaps . get ( j ));
18 }
19 }
References

Brown N: Chemoinformatics  An Introduction for Computer Scientists.
ACM Comput Surv 2009, 41:8:18:38. Publisher Full Text

Willett P, Barnard JM, Downs GM: Chemical Similarity Searching.

Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): an opensource Java library for Chemo and Bioinformatics.
J Chem Inf Comput Sci 2003, 43(2):493500. PubMed Abstract  Publisher Full Text

Bender A, Mussa HY, Glen RC, Reiling S: Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance.
J Chem Inf Comput Sci 2004, 44(5):17081718. PubMed Abstract  Publisher Full Text

Rogers D, Hahn M: Extendedconnectivity fingerprints.
J Chem Inf Model 2010, 50(5):742754. PubMed Abstract  Publisher Full Text

Ralaivola L, Swamidass SJ, Saigo H, Baldi P: Graph kernels for chemical informatics.
Neural Networks 2005, 18(8):10931110. PubMed Abstract  Publisher Full Text

Renner S, Fechner U, Schneider G: Alignmentfree Pharmacophore Patterns  A Correlation Vector Approach. In Pharmacophores and Pharmacophore Searches, Pharmacophores and Pharmacophore Searches. Edited by Langer T, Hoffmann R. Weinheim: WileyVCH; 2006:4979.

Carhart RE, Smith DH, Venkataraghavan R: Atom Pairs as Features in StructureActivity Studies: Definition and Applications.

Mahé P, Ralaivola L, Stoven V, Vert JP: The Pharmacophore Kernel for Virtual Screening with Support Vector Machines.
J Chem Inf Model 2006, 46(5):20032014. PubMed Abstract  Publisher Full Text

Bender A, Mussa HY, Gill GS, Glen RC: Molecular Surface Point Environments for Virtual Screening and the Elucidation of Binding Patterns (MOLPRINT 3D).
J Med Chem 2004, 47(26):65696583. PubMed Abstract  Publisher Full Text

Brown N, McKay B, Gasteiger J: Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3DQSAR Modelling.
QSAR Comb Sci 2005, 24:480484. Publisher Full Text

Chang CC, Lin CJ: [http://www.csie.ntu.edu.tw/~cjlin/libsvm] webcite

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA Data Mining Software: An Update.
SIGKDD Explorations 2009, 11:1018. Publisher Full Text

Sutherland JJ, O'Brien LA, Weaver DF: A Comparison of Methods for Modeling Quantitative StructureActivity Relationships.
J Med Chem 2004, 47(22):55415554. PubMed Abstract  Publisher Full Text

Hansen K, Mika S, Schroeter T, Sutter A, ter Laak A, StegerHartmann T, Heinrich N, Müller KR: Benchmark Data Set for in Silico Prediction of Ames Mutagenicity.
J Chem Inf Model 2009, 49(9):20772081. PubMed Abstract  Publisher Full Text

Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ: LIBLINEAR: A Library for Large Linear Classification.

Fechner N, Jahn A, Hinselmann G, Zell A: Estimation of the applicability domain of kernelbased machine learning models for virtual screening.
J Cheminf 2010, 2:2. BioMed Central Full Text

Hinselmann G, Fechner N, Jahn A, Eckert M, Zell A: Graph kernels for chemical compounds using topological and threedimensional local atom pair environments.
Neurocomputing 2010, 74:219229. Publisher Full Text

Hinselmann G, Jahn A, Fechner N, Zell A: Chronic Rat Toxicity Prediction of Chemical Compounds Using Kernel Machines. In Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics: 7th European Conference (EvoBio 2009). Volume 5483. Tübingen, Germany: Springer; 2009::2536. Publisher Full Text

Jahn A, Hinselmann G, Fechner N, Zell A: Optimal Assignment Methods for LigandBased Virtual Screening.
J Cheminf 2009, 1:14. BioMed Central Full Text

Jahn A, Hinselmann G, Fechner N, Henneges C, Zell A: Probabilistic Modeling of Conformational Space for 3D Machine Learning Approaches.
Molecular Informatics 2010, 29(5):441455. Publisher Full Text

Borgwardt KM, Ong CS, Schönauer S, Vishwanathan SVN, Smola AJ, Kriegel HP: Protein function prediction via graph kernels.
Bioinformatics 2005, 21:4756. Publisher Full Text

Schneider G, Neidhart W, Giller T, Schmid G: ScaffoldHopping by Topological Pharmacophore Search: A Contribution to Virtual Screening.
Angew Chem., Int Ed 1999, 38(19):28942896. Publisher Full Text

GregoriPuigjané E, Mestres J: SHED: Shannon Entropy Descriptors from Topological Feature Distributions.
J Chem Inf Model 2006, 46(4):16151622. PubMed Abstract  Publisher Full Text

Bender A, Mussa HY, Glen RC: Screening for Dihydrofolate Reductase Inhibitors Using MOLPRINT 2D, a Fast FragmentBased Method Employing the Naive Bayesian Classifier: Limitations of the Descriptor and the Importance of Balanced Chemistry in Training and Test Sets.
J Biomol Screen 2005, 10(7):658666. PubMed Abstract  Publisher Full Text

Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen E: Recent developments of the chemistry development kit (CDK)  an opensource java library for chemo and bioinformatics.
Curr Pharm Des 2006, 12(17):21112120. PubMed Abstract  Publisher Full Text

Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P: Kernels for small molecules and the prediction of mutagenicity, toxicity and anticancer activity.
Bioinformatics 2005, 21:359368. Publisher Full Text

Nasr R, Swamidass SJ, Baldi P: Large scale study of multiplemolecule queries.
J Cheminf 2009, 1:7. BioMed Central Full Text

Chen J, Swamidass SJ, Dou Y, Baldi P: ChemDB: a public database of small molecules and related chemoinformatics resources.
Bioinformatics 2005, 21:41334139. PubMed Abstract  Publisher Full Text

Gasteiger J, Rudolph C, Sadowski J: Automatic Generation of 3DAtomic Coordinates for Organic Molecules.
Tetrahedron Comput Methodol 1992, 3:537547. Publisher Full Text

Schrödinger LLC: Schrödinger MacroModel 9.6. Schrödinger, LLC, New York, NY; 2008.

Bouckaert RR, Frank E: Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. In Advances in Knowledge Discovery and Data Mining  Proceedings of 8th PacificAsia Conference, PAKDD 2004. Volume 3056. Edited by Dai H, Srikant R, Zhang C. Springer; 2004::312.

Fechner N, Jahn A, Hinselmann G, Zell A: Atomic Local Neighborhood Flexibility Incorporation into a Structured Similarity Measure for QSAR.
J Chem Inf Model 2009, 49(3):549560. PubMed Abstract  Publisher Full Text

Talete srl, Milano, Italy: [http://www.talete.mi.it/] webcite
dragonX 1.4 for Linux (Molecular Descriptor Calculation Software).