This article is part of the supplement: 5th German Conference on Cheminformatics: 23. CIC-Workshop
Quantifying model errors using similarity to training data
Journal of Cheminformatics 2010, 2(Suppl 1):O7 doi:10.1186/1758-2946-2-S1-O7
Published: 4 May 2010First paragraph (this article has no abstract)
When making a prediction with a statistical model, it is not sufficient to know that the model is "good", in the sense that it is able to make accurate predictions on test data. Another relevant question is: How good is the model for a specific sample whose properties we wish to predict? Stated another way: Is the sample within or outside the model's domain of applicability or what is the degree to which a test compound is within the model's domain of applicability. Numerous studies have been done on determining appropriate measures to address this question [1-4]. Here we focus on a derivative question: Can we determine an applicability domain measure suitable for deriving quantitative error bars -- that is, error bars which accurately reflect the expected error when making predictions for specified values of the domain measure? Such a measure could then be used to provide an indication of the confidence in a given prediction (i.e. the likely error in a prediction based on to what degree the test compound is part of the model's domain of applicability).Ideally, we wish such a measure to be simple to calculate and to understand, to apply to models of all types -- including classification and regression models for both molecular and non-molecular data - and to be free of adjustable parameters. Consistent with recent work by others [5,6], the measures we have seen that best meet these criteria are distances to individual samples in the training data. We describe our attempts to construct a recipe for deriving quantitative error bars from these distances.