1.1 The Groundwork of Machine Learning for Disease Modeling
Recent years have witnessed dramatic growth in the volume and complexity of data available to biomedical research scientists (Bui & Van Horn, 2017). Deriving actionable insights that can improve diagnosis and treatment from these large heterogeneous datasets presents a formidable challenge for the community. Machine learning (ML) techniques have become a popular tool for the analysis of biomedical datasets, albeit not one yet widely used in medicine (Deo, 2015). Indeed, while the application of machine learning to disease classification holds considerable promise, it also faces unique obstacles arising from the nature of the data and from stakeholder expectations. A comprehensive review of machine learning algorithms and their applications in disease classification would be an ambitious task and one we will not attempt here. Rather, this perspective will provide an accessible high-level introduction to machine learning for disease classification: the mechanics of some popular algorithms, the challenges, and pitfalls this field confronts, and some examples and insights from recent literature.
It is important to realize that machine learning is not some kind of mathematical alchemy that can transform irreproducible data into golden insights. ML is subject to the same âgarbage in = garbage outâ limitation that applies elsewhere in modeling; hence, good data curation is key for success (Beam & Kohane, 2018; Rajkomar et al., 2019). Nor is machine learning a new field of study; modern deep learning algorithms, for example, are an extension of perceptron models first proposed in the 1950s and 1960s (Schmidhuber, 2015). Instead, it is probably best to view machine learning as an extension of statistical modeling techniques, with the main difference that while statistics seeks to make inferences about populations, machine learning tries to find patterns that provide predictive power and that can be generalized (Bzdok et al., 2018).
Machine learning problems can be broadly classified into three paradigms. Unsupervised learning techniques like clustering and dimension reduction seek structure in unlabeled data and hence are often useful in data exploration or hypothesis generation. Unsupervised learning techniques would for example likely be useful for seeking subsets of a patient population that share many common features. In a study of this type, subgroups identified via clustering could then be further studied to determine whether they respond differently to a treatment. Supervised learning techniques by contrast learn to predict an output variable from an input vector, matrix or a multidimensional array or tensor. In disease classification, a common task for a supervised learning algorithm might be to determine whether an image from an MRI or a histology image indicates the presence or absence of disease; in this example, the output variable to be predicted would be the category, while the input would be the pixel values of the image. Finally, in the reinforcement learning paradigm, the algorithm is provided with a set of choices and is offered ârewardsâ when its choice leads to a better outcome. Although unsupervised learning and clustering is often a powerful tool for analysis of multi-omics data, in this review, we will focus on supervised learning as the most directly relevant task for disease classification.
1.2 The âBig Brotherâ of Predictions: Supervised Learning
Supervised learning problems are those where an output or label y, sometimes called the âground truthâ, must be correctly predicted based on an input. The output y for disease classification is typically a category but may in some instances be a real-valued or complex number (regression), or a ranked category (ordinal regression). In some cases, the output may also be a real-valued vector (e.g., prediction of dihedral angles in a protein, based on an amino acid sequence). Successfully applying supervised learning to disease classification requires selecting and defining the right prediction problem. This may involve careful consideration of the labels to be predicted, the structure of existing workflows and pipelines, and the availability of relevant data.
It is of course crucial that the data is labeled correctly, and this may represent a challenge for biomedical applications in general. International Classification of Disease (ICD) codes, for example, are often used to indicate diagnoses in electronic health records (EHR). Errors in ICD code assignment are however not infrequent â estimates of the accuracy of ICD coding vary widely â and may arise from multiple possible sources of error (OâMalley et al., 2005). Physicians sometimes for example use abbreviations in their notes whose meaning may be ambiguous to the medical coder responsible for selecting and entering an appropriate diagnosis code (Sheppard et al., 2008).
Another more subtle problem can arise when there is a mismatch between the categories used to label the data and the ultimate objective of the study or of stakeholders. This problem is perhaps best illustrated with an example. Cancers that share the same tissue of origin exhibit a striking level of genetic diversity both within a single patient and across patients (Mroz & Rocco, 2017). It is well-established that different cancer subtypes exhibit different prognoses and may respond differently to the same treatment â indeed, many drug discovery efforts have focused on the development of drugs that target cancers with specific genetic features (Haque et al., 2012; Yersal, 2014). Breast cancers, for example, have been divided into five âintrinsic subtypesâ (Howlader et al., 2018). More complex classification schemes and a variety of other risk markers have been proposed, since substantial diversity in genetic and transcriptomic profiles and in outcomes are observed within subtypes (Bayani et al., 2017; Dawson et al., 2013; Curtis et al., 2012; Russnes et al., 2017). If a model is trained to classify breast cancers based on genetic, transcriptomic, and/or other information is desired, clearly, the categorization chosen should be appropriate for the ultimate goal of the study: in other words, the labels that are generated should be clinically useful.
When defining metrics for predictive models, is it equally important to take into consideration existing workflows currently in practice. Ideally, a model should be chosen to solve a problem that takes advantage of the strengths and minimizes the weaknesses of existing pipelines. Steiner et al. (2018), for example, found that their deep learning algorithm exhibited a reduced false-negative rate for identification of breast cancer metastases in lymph nodes when compared with human pathologists. The algorithm, however, also exhibited an increased false-positive rate, especially if acquired images were out of focus. To overcome this problem, they designed a machine learning-assisted pipeline whereby the deep learning algorithm color-highlighted regions of interest for review by the pathologist, where different colors indicated different levels of confidence. This pipeline significantly improved both accuracy and speed compared to identification performed by unaided pathologists, thereby improving rather than reinventing the existing pipeline.
In addition to these considerations, the model should require only data that will be readily available at the time when the prediction will be made (Chen et al., 2019). In some cases, accurate diagnosis may require time-consuming lab tests whose results will seldom be available at the time of admission. A model that relies on late-arriving information may be severely limited in scope, while a model that can provide an accurate diagnosis without it may in such instances offer a key advantage. In 2019, for example, Yelin et al. used patient data from over 700,000 urinary tract infections (UTIs) to build a gradient boosted trees model and a logistic regression model to predict antibiotic resistance category solely based on patient history (Yelin et al., 2019). Their models were able to significantly outperform physicians and dramatically reduce the rate of incorrect prescriptions (i.e., situations where a patient has prescribed an antibiotic to which their infection is resistant). Since only patient history data was required, their approach can choose an antibiotic at the time of admission without waiting for antibiotic susceptibility testing results, which may require several days or more (Van Camp et al., 2020).
Availability of relevant data is a key challenge for developing machine learning models for disease classification. Healthcare datasets are in general both highly heterogenous and highly fragmented. A wide variety of EHR systems are marketed; there is little standardization across systems and software packages, so that pooling data acquired on different systems is inherently challenging (DeMartino & Larsen, 2013; Miller, 2011). EHR systems are often designed to prioritize the needs of medical billers and the insurance payors with whom they will communicate, so that the data is seldom formatted in a manner conducive to the needs of researchers or even physicians, many of whom report dissatisfaction with their healthcare systemâs EHR software (Agrawal & Prabakaran, 2020; Gawande, 2018). Physicians frequently record their observations in the form of notes that cannot easily be translated into encoded input suitable for modeling purposes (DeMartino & Larsen, 2013). Pooling. Furthermore, the pooling and sharing of data between healthcare providers and different sources are substantially hindered by patient privacy and regulatory concerns (Agrawal et al., 2020).
Ultimately, these issues combine to ensure that assembling and pre-processing healthcare datasets are necessary for predictive modeling which may incur substantial effort and expense. Even once such datasets have been assembled, they may appear to be large and yet contain data for a wide array of conditions, so that only a handful of datapoints relevant to a particular disorder or outcome of interest appear in the dataset. Adibuzzaman et al., for example, report their experience with the Medical Information Mart for Intensive Care (MIMIC III) from Beth Israel Deaconess Hospital. This superficially large dataset contains data for some 50,000 patient encounters; yet if a researcher interested in drug-drug interactions were to query it for patients on antidepressants also taking an antihistamine, for example, they would retrieve a mere 44 datapoints (Adibuzzaman et al., 2017). Finally, most healthcare datasets contain missing values such that key information available for some patients is unavailable for others (Allen et al., 2014).
For all these reasons, organizing healthcare data to improve access for biomedical...