![]()
CHAPTER 1
WHAT ARE OUR MODELS REALLY TELLING US? A PRACTICAL TUTORIAL ON AVOIDING COMMON MISTAKES WHEN BUILDING PREDICTIVE MODELS
W. PATRICK WALTERS
1.1 INTRODUCTION
Predictive models have become a common part of modern day drug discovery [1]. Models are used to predict a range of key parameters including:
- Physical properties such as aqueous solubility or octonol/water partition coefficients [2–4]
- Off-target activities such as CYP or hERG inhibition [5–7]
- Binding geometry and affinity of small molecules in protein targets [8].
When building these models, it is essential that the cheminformatics practitioner be aware of factors that could potentially mislead and confuse those using the models. In this chapter, we will focus on some common traps and pitfalls and discuss strategies for realistic evaluation of models.
We will consider a few important, and often overlooked, issues in the model-building process.
- How does the dynamic range of the data being modeled impact the apparent performance of the model?
- How does experimental error impact the apparent predictivity of a model?
- How can we determine whether a model is applicable to a new dataset?
- How should we compare the performance of regression models?
The chapter will take a tutorial format. We will analyze some commonly used datasets and use this analysis to make a few points about the process of building and evaluating predictive models. One of the most important aspects of scientific investigation is reproducibility. As such, all of the analyses discussed in this chapter were performed using readily available, open source software. This makes it possible for the reader to follow along, carry out the analyses, and experiment with the datasets. All of the code used to perform the analyses is available in the listings section at the end of the chapter. The datasets and scripts used in this chapter can also be downloaded from the author’s website https://github.com/PatWalters/cheminformaticsbook. It is hoped that these scripts will kindle an appreciation for aspects of the model-building process and will provide the basis for further exploration.
The software tools required for the analyses are
The Python programming language – http://www.python.org
The RDKit cheminformatics programming library – http://www.rdkit.org
The R statistics program – http://www.r-project.org
Python scripts can be run by executing the command
python script_name.py (Unix and OS-X)
python.exe script_name.py (Windows)
where script_name.py is the name of the script to run.
R scripts can be run by executing the following two commands within the R console.
setwd(“directory_path”)
source(“script.R”)
In the aforementioned commands, “directory_path” is the full path to the directory (folder) containing the scripts and data, and “script.R” is the name of the script to execute. The R scripts used in this chapter utilize a number of libraries that are not included as part of the base R distribution. These libraries can be easily installed by typing the command
source(“install_libraries.R”)
in the R console. Since these libraries are being downloaded from the Internet, it is necessary for your computer to be connected to the Internet when executing the aforementioned command.
Those unfamiliar with Python or R are urged to consult references associated with those languages [9–12]. We now live in a data rich world where every cheminformatics practitioner should possess at least rudimentary programming skills.
1.2 PRELIMINARIES
In order to better understand some of the nuances associated with the construction and evaluation of predictive models, it is useful to consider actual examples. In this chapter, we will examine a number of datasets containing measured values for aqueous solubility and use these datasets to build and evaluate predictive models. Solubility in water or buffer is an important parameter in drug discovery [13]. Poorly soluble compounds tend to have poor pharmacokinetics and can precipitate or cause other problems in assays. As such, the prediction of aqueous solubility has been an area of high interest in the pharmaceutical industry. Over the last 15 years, numerous papers have been published on methods for predicting aqueous solubility [2, 3, 14]. Although many papers have been published and commercial software for predicting aqueous solubility has been released, reliable solubility prediction remains a challenge.
The challenges in developing models for predicting solubility can arise from a number of experimental factors. The aqueous solubility of a compound can vary depending on a number of factors including:
- Temperature at which the solubility measurement is performed
- Purity of the compound
- Crystal form—different polymorphs of the same compound can have vastly different solubilities.
In addition to confounding experimental factors, a number of published solubility models are somewhat misleading due to a lack of proper computational controls. While we sometimes have limited control over the experimental data used to build models, we have complete control over the way models are evaluated and should always employ appropriate means of evaluating our models. In subsequent sections, we will use solubility datasets to examine some of these control strategies.
1.3 DATASETS
In this chapter, we will consider three different, publicly available, solubility datasets.
The Huuskonen Dataset This set of 1274 experimental solubility values (Log S) was one of the first large solubility datasets published [15, 16] and has subsequently been used in a number of other publications [14, 17]. The data in this set was extracted from the AQUASOL [18, 19] database, compiled by the Yalkowsky group at the University of Arizona and the PHYSPROP [20] database, compiled by the Syracuse Research Corporation.
The JCIM Dataset This is a set of 94 experimental solubility values that were published as the training set for a “blind challenge” published in 2008 [21]. All of the solubility values reported in this paper were measured by a single group under a consistent set of conditions. The objective of this challenge was for groups to use a consistently measured set of solubility values to build a model that could subsequently be used to predict the solubility of a set of test compounds. Results of the challenge were reported in a subsequent paper in 2009 [22].
The PubChem Dataset A randomly selected subset of 1000 measured solubility values selected from a set of 58,000 values that were experimentally determined using chemilumenescent nitrogen detection (CLND) by the Sanford-Burnham Medical Research Institute and deposited in the PubChem database (AID 1996) [23] This dataset is composed primarily of screening compounds from the NIH Molecular Libraries initiative and can be considered representative of the types of compounds typically found in early stage drug discovery programs. Values in this dataset were reported with a qualifier “>”, “=”, “<” to indicate whether the values were below, within, or above the limit of detection for the assay. Only values within the limit of detection (designated by “=”) were selected for the subset used in this analysis.
In order to compare predictions with these three datasets, we first need to format the data in a consistent fashion. We begin by formatting all of the data as Log S, the log of the molar solubility of the compounds. Data in the PubChem and JCIM datasets were originally reported in µg/ml, so the data was transformed to Log S using the formula
LogS = log10((solubility in µg/ml)/(1000.0 * MW))
Where log10 is the base 10 logarithm and MW is the molecular weight.
1.3.1 Exploring Datasets
One of the first things to consider in evaluating a new dataset is the range and distribution of values reported. An excellent tool for visualizing data distributions is the boxplot [24, 25]. The “box” at the center of the boxplot shows the range covered by the middle 50% of the data, while the “whiskers” show the maximum and minimum values (discounting the presence of outliers). Outliers in the boxplot are drawn as circles. More information on boxplots can be found on the Wikipedia page [26] and references therein. The anatomy of a boxplot is detailed in Figure 1.1.
Figure 1.2 shows a boxpl...