II
Emerging Big Data Applications
CHAPTER 5
Matrix Factorization for Drug–Target Interaction Prediction
Yong Liu, Min Wu, and Xiao-Li Li
Institute of Infocomm Research (I2R)
A*STAR, Singapore
Peilin Zhao
Artificial Intelligence Department
Ant Financial Services Group, China
CONTENTS
5.1 Introduction
5.2 Related Work
5.2.1 Classification-Based Methods
5.2.2 Matrix Factorization-Based Methods
5.3 Neighborhood Regularized Logistic Matrix Factorization
5.3.1 Problem Formalization
5.3.2 Logistic Matrix Factorization
5.3.3 Neighborhood Regularization
5.3.4 Combined Model
5.3.5 Neighborhood Smoothing
5.4 Experimental Results
5.4.1 Experimental Settings
5.4.2 Performance Comparison
5.4.3 Neighborhood Benefits
5.4.4 Parameter Sensitivity Analysis
5.4.5 Predicting Novel Interactions
5.5 Conclusions
References
5.1 INTRODUCTION
The drug discovery is one of the primary objectives of the pharmaceutical sciences, which is an interdisciplinary research field of fundamental sciences covering biology, chemistry, physics, statistics, etc. In the drug discovery process, the prediction of drug–target interactions (DTIs) is an important step that aims to identify potential new drugs or new targets for existing drugs. Therefore, it can help guide the experimental validation and reduce costs. In recent years, the DTI prediction has attracted vast research attention and numerous algorithms have been proposed [1, 2]. Existing methods predict DTIs based on a small number of experimentally validated interactions in existing databases, for example, ChEMBL [3], DrugBank [4], KEGG DRUG [5], and SuperTarget [6]. Previous studies have shown that a fraction of new interactions between drugs and targets can be predicted based on the experimentally validated DTIs, and the computational methods for identifying DTIs can significantly improve the drug discovery efficiency.
In general, traditional methods developed for DTI prediction can be categorized into two main groups: docking simulation approaches and ligand-based approaches [7–9]. The docking simulation approaches predict potential DTIs, considering the structural information of target proteins. However, the docking simulation is extensively time-consuming, and the structural information may not be available for some protein families, for example, the G-protein coupled receptors (GPCRs). In the ligand-based approaches, potential DTIs are predicted by comparing a candidate ligand with the known ligands of the target proteins. This kind of approaches may not perform well for the targets with a small number of ligands.
Recently, the rapid development of machine learning techniques provides effective and efficient ways to predict DTIs. An intuitive idea is to formulate the DTI prediction as a binary classification problem, where the drug-target pairs are treated as instances, and the chemical structures of drugs and the amino acid subsequences of targets are treated as features. Then, classical classification methods [e.g., support vector machines (SVM) and regularized least square (RLS)] can be used for DTI prediction [10–16]. Essentially, the DTI prediction problem is a recommendation task that aims to suggest a list of potential DTIs. Therefore, another line of research for DTI prediction is the application of recommendation technologies, especially matrix factorization-based approaches [17–20]. The matrix factorization methods aim to map both drugs and targets into a shared latent space with low dimensionality and model the DTIs using the combinations of the latent representations of drugs and targets.
In this chapter, we introduce a DTI prediction approach, named neighborhood regularized logistic matrix factorization (NRLMF), which focuses on predicting the probability that a drug would interact with a target [21]. Specifically, the properties of a drug and a target are represented by two vectors in the shared low-dimensional latent space, respectively. For each drug-target pair, the interaction probability is modeled by a logistic function of the drug- specific and target-specific latent vectors. This is different from the kernelized Bayesian matrix factorization (KBMF) method [17] that predicts the interaction probability using a standard normal cumulative distribution function of the drug-specific and target-specific latent vectors [22]. In NRLMF, an observed interacting drug-target pair (i.e., positive observation) is treated as c = (c ≥ 1) positive examples, while an unknown pair (i.e., negative observation) is treated as a single negative example. As such, NRLMF assigns higher importance levels to positive observations than negatives. The reason is that the positive observations are biologically validated and thus usually more trustworthy. However, the negative observations could contain potential DTIs and are thus unreliable. This differs from previous matrix factorization-based DTI prediction methods [17–19] that treat the interaction and unknown pairs equally.
Furthermore, NRLMF also studies the local structure of the interaction data to improve the DTI prediction accuracy by exploiting the neighborhood influences from most similar drugs and most similar targets. In particular, NRLMF imposes individual regularization constraints on the latent representations of a drug and its nearest neighbors, which are most similar with the given drug. Similar neighborhood regularization constraints have also been added on the latent representations of targets. Note that this neighborhood regularization method is different from previous approaches that exploit the drug similarities and target similarities using kernels [12, 13, 15, 23] or factorizing the similarity matrices [19]. Moreover, the proposed approach only considers nearest neighbors instead of all similar neighbo...