Score function of violations and best cutpoint to identify druggable molecules and associated disease targets (invited paper)


Hudson, I. L.; Leemaqz, S. Y.; Abell, A. D.


Predicting druggability and prioritising certain disease modifying targets is critical in drug discovery. Expanding the spectrum of disease-relevant targets to pharmacological manipulation is vital to reducing morbidity and mortality. We test a druggability rule, based on 10 molecular parameters (scores counting violations, denoted by score10), which uses cutpoints for each molecular parameter based on mixture clustering discriminant analysis (MC/DA) (Hudson et al., 2014). A total of 1279 small molecules from the DrugBank chem-informatics database (Knox et al., 2011), combining detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with drug disease target information, were analysed and these were shown to be aligned with 173 targets. The score10 function comprised 4 traditional parameters of the rule of five (Ro5) (Lipinski, 2016), plus 5 extra parameters (polar surface area PSA, number of rotatable bonds, rings and halogens, N and O atoms) with an extra candidate of lipophicility, log D (the distribution coefficient) recently suggested by Bhal et al., 2007 as a possible preferable predictor for permeation (Zafar, Hudson et al., 2016, 2013;) to Lipinski’s traditional partition coefficient, Log P, a predictor for permeation. Multivariate skew normal (SN) (Lee and Mc Lachlan 2013) and Gaussian (MN) mixture clustering identified 5 molecule groups based on the 10 predictors, or 9 predictors when the number of halogen atoms was omitted. MN clusters were highly differentiable with 3 of the 5 clusters classified as poor druggable candidates, similarly the SN clusters. Logistic regression was used to determine the best cutpoint, C, for the total number of violations, score10 (< C versus greater or equal to C, for C= 3, 4 or 5) using predictor models containing the molecule’s Ro5 status (if Ro5 compliant the molecule is druggable by Lipinski’s rule), oral status, and poor vs good druggability grouping based on the clustering. We studied the performance of a support vector machine (SVM) and Recursive partitioning (RP) based on the 10 molecular descriptors, to classify compounds with high or low violator scores (defined by our optimal cutpoint, C). RP was applied to find simple hierarchical rules to classify the high score violators from the low (< C). PRoC analyses (Robin et al., 2011) and logit analyses showed that a cutpoint of 5 is best in partitioning chemo-space. For either partition of the score10 function, logistic models with the MN10 cluster predictor were superior to that of the (SN10). The best model was obtained for a cutpoint of 5 (AIC = 1403.79) and established that molecules with 5 or more violations tended to be non-oral candidates (p <0.00001), MN10 poor (p <0.00001) and be Ro5 violators (p <0.00001), with a significant oral by cluster interaction (P< 0.03) found. The SVM classifier of the score10 partition (C=5) gave a Matthews coefficient C= 0.887. PROC analyses gave high values for the area under the curve (AUC) of 98.7%, with 95% CI (98.2%-99.3%), sensitivity (r) and specificity (s), 0.961 and 0.924, respectively for the training set. For the validation set SVM gave an AUC of 98.1%, 95% CI (97%-99.2%), r=0.927, s=0.983 and likewise a high C=0.818. The RP classification gave similar but slightly lower AUC and C values as the SVM. Specifically, the RP classifier for the score10 partition yielded an AUC of 95.1% with 95%CI (93.8%-96.4%), sensitivity of 0.918, specificity 0.936, and C= 0.845 for the training set; for the validation set an AUC of 95.3% with 95% CI (93.1%-97.5%), with r=0.924, s=0.886 and C=0.809. The RP rules to classify the high score violators from the low (< 5) confirmed the value of log D’s inclusion in the scoring function and supported the original MC/DA cutpoints established for each molecular descriptor (Hudson et al., 2014). Our work illustrated that SVM used in combination with simple molecular descriptors can provide a reliable assessment of our simple scoring function of counts of violations partition. Moreover, molecules with score10 representing 5 or more violations were shown to be associated with specific disease targets, namely, Anti- Bacterial, Antineoplastic, Antihypertensive and Anti-allergic, within which most of the drugs have a non-oral delivery mode. Target drugs with a median score10 < 5 were Adrenergic, Dietary, Analgesics, Anti-infective, Anesthetics, Adjuvants, Anti-convulsants, Antimetabolites and Antidepressants, all of which, except Dietary and Anesthetics, were non-oral.

Publication year


Publication type

Conference paper


22nd International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand, December 2017, pp. 487-493


Modelling and Simulation Society of Australia and New Zealand Inc.




Copyright © 2017 MSSANZ.