Chemoinformatics for Drug Discovery
eBook - ePub

Chemoinformatics for Drug Discovery

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Chemoinformatics for Drug Discovery

About this book

Chemoinformatics strategies to improve drug discovery results

With contributions from leading researchers in academia and the pharmaceutical industry as well as experts from the software industry, this book explains how chemoinformatics enhances drug discovery and pharmaceutical research efforts, describing what works and what doesn't. Strong emphasis is put on tested and proven practical applications, with plenty of case studies detailing the development and implementation of chemoinformatics methods to support successful drug discovery efforts. Many of these case studies depict groundbreaking collaborations between academia and the pharmaceutical industry.

Chemoinformatics for Drug Discovery is logically organized, offering readers a solid base in methods and models and advancing to drug discovery applications and the design of chemoinformatics infrastructures. The book features 15 chapters, including:

  • What are our models really telling us? A practical tutorial on avoiding common mistakes when building predictive models
  • Exploration of structure-activity relationships and transfer of key elements in lead optimization
  • Collaborations between academia and pharma
  • Applications of chemoinformatics in pharmaceutical research—experiences at large international pharmaceutical companies
  • Lessons learned from 30 years of developing successful integrated chemoinformatic systems

Throughout the book, the authors present chemoinformatics strategies and methods that have been proven to work in pharmaceutical research, offering insights culled from their own investigations. Each chapter is extensively referenced with citations to original research reports and reviews.

Integrating chemistry, computer science, and drug discovery, Chemoinformatics for Drug Discovery encapsulates the field as it stands today and opens the door to further advances.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Chemoinformatics for Drug Discovery by Jürgen Bajorath in PDF and/or ePUB format, as well as other popular books in Physical Sciences & Physical & Theoretical Chemistry. We have over one million books available in our catalogue for you to explore.

Information

CHAPTER 1

WHAT ARE OUR MODELS REALLY TELLING US? A PRACTICAL TUTORIAL ON AVOIDING COMMON MISTAKES WHEN BUILDING PREDICTIVE MODELS

W. PATRICK WALTERS

1.1 INTRODUCTION

Predictive models have become a common part of modern day drug discovery [1]. Models are used to predict a range of key parameters including:
  • Physical properties such as aqueous solubility or octonol/water partition coefficients [2–4]
  • Off-target activities such as CYP or hERG inhibition [5–7]
  • Binding geometry and affinity of small molecules in protein targets [8].
When building these models, it is essential that the cheminformatics practitioner be aware of factors that could potentially mislead and confuse those using the models. In this chapter, we will focus on some common traps and pitfalls and discuss strategies for realistic evaluation of models.
We will consider a few important, and often overlooked, issues in the model-building process.
  • How does the dynamic range of the data being modeled impact the apparent performance of the model?
  • How does experimental error impact the apparent predictivity of a model?
  • How can we determine whether a model is applicable to a new dataset?
  • How should we compare the performance of regression models?
The chapter will take a tutorial format. We will analyze some commonly used datasets and use this analysis to make a few points about the process of building and evaluating predictive models. One of the most important aspects of scientific investigation is reproducibility. As such, all of the analyses discussed in this chapter were performed using readily available, open source software. This makes it possible for the reader to follow along, carry out the analyses, and experiment with the datasets. All of the code used to perform the analyses is available in the listings section at the end of the chapter. The datasets and scripts used in this chapter can also be downloaded from the author’s website https://github.com/PatWalters/cheminformaticsbook. It is hoped that these scripts will kindle an appreciation for aspects of the model-building process and will provide the basis for further exploration.
The software tools required for the analyses are
The Python programming language – http://www.python.org
The RDKit cheminformatics programming library – http://www.rdkit.org
The R statistics program – http://www.r-project.org
Python scripts can be run by executing the command
python script_name.py (Unix and OS-X)
python.exe script_name.py (Windows)
where script_name.py is the name of the script to run.
R scripts can be run by executing the following two commands within the R console.
setwd(“directory_path”)
source(“script.R”)
In the aforementioned commands, “directory_path” is the full path to the directory (folder) containing the scripts and data, and “script.R” is the name of the script to execute. The R scripts used in this chapter utilize a number of libraries that are not included as part of the base R distribution. These libraries can be easily installed by typing the command
source(“install_libraries.R”)
in the R console. Since these libraries are being downloaded from the Internet, it is necessary for your computer to be connected to the Internet when executing the aforementioned command.
Those unfamiliar with Python or R are urged to consult references associated with those languages [9–12]. We now live in a data rich world where every cheminformatics practitioner should possess at least rudimentary programming skills.

1.2 PRELIMINARIES

In order to better understand some of the nuances associated with the construction and evaluation of predictive models, it is useful to consider actual examples. In this chapter, we will examine a number of datasets containing measured values for aqueous solubility and use these datasets to build and evaluate predictive models. Solubility in water or buffer is an important parameter in drug discovery [13]. Poorly soluble compounds tend to have poor pharmacokinetics and can precipitate or cause other problems in assays. As such, the prediction of aqueous solubility has been an area of high interest in the pharmaceutical industry. Over the last 15 years, numerous papers have been published on methods for predicting aqueous solubility [2, 3, 14]. Although many papers have been published and commercial software for predicting aqueous solubility has been released, reliable solubility prediction remains a challenge.
The challenges in developing models for predicting solubility can arise from a number of experimental factors. The aqueous solubility of a compound can vary depending on a number of factors including:
  • Temperature at which the solubility measurement is performed
  • Purity of the compound
  • Crystal form—different polymorphs of the same compound can have vastly different solubilities.
In addition to confounding experimental factors, a number of published solubility models are somewhat misleading due to a lack of proper computational controls. While we sometimes have limited control over the experimental data used to build models, we have complete control over the way models are evaluated and should always employ appropriate means of evaluating our models. In subsequent sections, we will use solubility datasets to examine some of these control strategies.

1.3 DATASETS

In this chapter, we will consider three different, publicly available, solubility datasets.
The Huuskonen Dataset This set of 1274 experimental solubility values (Log S) was one of the first large solubility datasets published [15, 16] and has subsequently been used in a number of other publications [14, 17]. The data in this set was extracted from the AQUASOL [18, 19] database, compiled by the Yalkowsky group at the University of Arizona and the PHYSPROP [20] database, compiled by the Syracuse Research Corporation.
The JCIM Dataset This is a set of 94 experimental solubility values that were published as the training set for a “blind challenge” published in 2008 [21]. All of the solubility values reported in this paper were measured by a single group under a consistent set of conditions. The objective of this challenge was for groups to use a consistently measured set of solubility values to build a model that could subsequently be used to predict the solubility of a set of test compounds. Results of the challenge were reported in a subsequent paper in 2009 [22].
The PubChem Dataset A randomly selected subset of 1000 measured solubility values selected from a set of 58,000 values that were experimentally determined using chemilumenescent nitrogen detection (CLND) by the Sanford-Burnham Medical Research Institute and deposited in the PubChem database (AID 1996) [23] This dataset is composed primarily of screening compounds from the NIH Molecular Libraries initiative and can be considered representative of the types of compounds typically found in early stage drug discovery programs. Values in this dataset were reported with a qualifier “>”, “=”, “<” to indicate whether the values were below, within, or above the limit of detection for the assay. Only values within the limit of detection (designated by “=”) were selected for the subset used in this analysis.
In order to compare predictions with these three datasets, we first need to format the data in a consistent fashion. We begin by formatting all of the data as Log S, the log of the molar solubility of the compounds. Data in the PubChem and JCIM datasets were originally reported in µg/ml, so the data was transformed to Log S using the formula
LogS = log10((solubility in µg/ml)/(1000.0 * MW))
Where log10 is the base 10 logarithm and MW is the molecular weight.

1.3.1 Exploring Datasets

One of the first things to consider in evaluating a new dataset is the range and distribution of values reported. An excellent tool for visualizing data distributions is the boxplot [24, 25]. The “box” at the center of the boxplot shows the range covered by the middle 50% of the data, while the “whiskers” show the maximum and minimum values (discounting the presence of outliers). Outliers in the boxplot are drawn as circles. More information on boxplots can be found on the Wikipedia page [26] and references therein. The anatomy of a boxplot is detailed in Figure 1.1.
Figure 1.2 shows a boxpl...

Table of contents

  1. COVER
  2. TITLE PAGE
  3. COPYRIGHT PAGE
  4. PREFACE
  5. CONTRIBUTORS
  6. CHAPTER 1: WHAT ARE OUR MODELS REALLY TELLING US? A PRACTICAL TUTORIAL ON AVOIDING COMMON MISTAKES WHEN BUILDING PREDICTIVE MODELS
  7. CHAPTER 2: THE CHALLENGE OF CREATIVITY IN DRUG DESIGN
  8. CHAPTER 3: A ROUGH SET THEORY APPROACH TO THE ANALYSIS OF GENE EXPRESSION PROFILES
  9. CHAPTER 4: BIMODAL PARTIAL LEAST-SQUARES APPROACH AND ITS APPLICATION TO CHEMOGENOMICS STUDIES FOR MOLECULAR DESIGN
  10. CHAPTER 5: STABILITY IN MOLECULAR FINGERPRINT COMPARISON
  11. CHAPTER 6: CRITICAL ASSESSMENT OF VIRTUAL SCREENING FOR HIT IDENTIFICATION
  12. CHAPTER 7: CHEMOMETRIC APPLICATIONS OF NAÏVE BAYESIAN MODELS IN DRUG DISCOVERY: BEYOND COMPOUND RANKING
  13. CHAPTER 8: CHEMOINFORMATICS IN LEAD OPTIMIZATION
  14. CHAPTER 9: USING CHEMOINFORMATICS TOOLS TO ANALYZE CHEMICAL ARRAYS IN LEAD OPTIMIZATION
  15. CHAPTER 10: EXPLORATION OF STRUCTURE–ACTIVITY RELATIONSHIPS (SARs) AND TRANSFER OF KEY ELEMENTS IN LEAD OPTIMIZATION
  16. CHAPTER 11: DEVELOPMENT AND APPLICATIONS OF GLOBAL ADMET MODELS: IN SILICO PREDICTION OF HUMAN MICROSOMAL LABILITY
  17. CHAPTER 12: CHEMOINFORMATICS AND BEYOND: MOVING FROM SIMPLE MODELS TO COMPLEX RELATIONSHIPS IN PHARMACEUTICAL COMPUTATIONAL TOXICOLOGY
  18. CHAPTER 13: APPLICATIONS OF CHEMINFORMATICS IN PHARMACEUTICAL RESEARCH: EXPERIENCES AT BOEHRINGER INGELHEIM IN GERMANY
  19. CHAPTER 14: LESSONS LEARNED FROM 30 YEARS OF DEVELOPING SUCCESSFUL INTEGRATED CHEMINFORMATIC SYSTEMS
  20. CHAPTER 15: MOLECULAR SIMILARITY ANALYSIS
  21. SUPPLEMENTAL IMAGES
  22. INDEX