Chapter 1
Introduction
Haibo He
Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI, USA
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, it becomes critical to advance raw data from fundamental research on the Big Data challenge to support decision-making processes. Although existing machine-learning and data-mining techniques have shown great success in many real-world applications, learning from imbalanced data is a relatively new challenge. This book is dedicated to the state-of-the-art research on imbalanced learning, with a broader discussions on the imbalanced learning foundations, algorithms, databases, assessment metrics, and applications. In this chapter, we provide an introduction to problem formulation, a brief summary of the major categories of imbalanced learning methods, and an overview of the challenges and opportunities in this field. This chapter lays the structural foundation of this book and directs readers to the interesting topics discussed in subsequent chapters.
1.1 Problem Formulation
We start with the definition of imbalanced learning in this chapter to lay the foundation for further discussions in the book. Specifically, we define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process. The learning process could involve supervised learning, unsupervised learning, semi-supervised learning, or a combination of two or all of them. The task of imbalanced learning could also be applied to regression, classification, or clustering tasks. In this Chapter, we provide a brief introduction to the problem formulation, research methods, and challenges and opportunities in this field. This chapter is based on a recent comprehensive survey and critical review of imbalanced learning as presented in [1], and interested readers could refer to that survey paper for more details regarding imbalanced learning.
Imbalanced learning not only presents significant new challenges to the data research community but also raises many critical questions in real-world data-intensive applications, ranging from civilian applications such as financial and biomedical data analysis to security- and defense-related applications such as surveillance and military data analysis [1]. This increased interest in imbalanced learning is reflected in the recent significantly increased number of publications in this field as well as in the organization of dedicated workshops, conferences, symposiums, and special issues, [2, 3, 4].
To start with a simple example of imbalanced learning, let us consider a popular case study in biomedical data analysis [1]. Consider the "Mammography Data Set," a collection of images acquired from a series of mammography examinations performed on a set of distinct patients [5-7]. For such a dataset, the natural classes that arise are "Positive" or "Negative" for an image representative of a "cancerous" or "healthy" patient, respectively. From experience, one would expect the number of noncancerous patients to exceed greatly the number of cancerous patients; indeed, this dataset contains 10,923 "Negative" (majority class) and 260 "Positive" (minority class) samples. Preferably, we require a classifier that provides a balanced degree of predictive accuracy for both the minority and majority classes on the dataset. However, in many standard learning algorithms, we find that classifiers tend to provide a severely imbalanced degree of accuracy, with the majority class having close to 100% accuracy and the minority class having accuracies of 0 ∼ 10%; see for instance, [5, 7]. Suppose a classifier achieves 5% accuracy on the minority class of the mammography dataset. Analytically, this would suggest that 247 minority samples are misclassified as majority samples (i.e., 247 cancerous patients are diagnosed as noncancerous). In the medical industry, the ramifications of such a consequence can be overwhelmingly costly, more so than classifying a noncancerous patient as cancerous [8]. Furthermore, this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. In an extreme case, if a given dataset includes 1% of minority class examples and 99% of majority class examples, a naive approach of classifying every example to be a majority class example would provide an accuracy of 99%. Taken at face value, 99% accuracy across the entire dataset appears superb; however, by the same token, this description fails to reflect the fact that none of the minority examples are identified, when in many situations, those minority examples are of much more interest. This clearly demonstrates the need to revisit the assessment metrics for imbalanced learning, which is discussed in Chapter 8.
1.2 State-of-the-Art Research
Given the new challenges facing imbalanced learning, extensive efforts and significant progress have been made in the community to tackle this problem. In this section, we provide a brief summary of the major category of approaches for imbalanced learning. Our goal is just to highlight some of the major research methodologies while directing the readers to different chapters in this book for the latest research development in each category of approach. Furthermore, a comprehensive summary and critical review of various types of imbalanced learning techniques can also be found in a recent survey [1].
1.2.1 Sampling Methods
Sampling methods seem to be the dominate type of approach in the community as they tackle imbalanced learning in a straightforward manner. In general, the use of sampling methods in imbalanced learning consists of the modification of an imbalanced dataset by some mechanism in order to provide a balanced distribution. Representative work in this area includes random oversampling [9], random undersampling [10], synthetic sampling with data generation [5, 11–13], cluster-based sampling methods [14], and integration of sampling and boosting [6, 15, 16].
The key aspect of sampling methods is the mechanism used to sample the original dataset. Under different assumptions and with different objective considerations, various approaches have been proposed. For instance, the mechanism of random oversampling follows naturally from its description by replicating a randomly selected set of examples from the minority class. On the basis of such simple sampling techniques, many informed sampling methods have been proposed, such as the EasyEnsemble and BalanceCascade algorithms [17]. Synthetic sampling with data generation techniques has also attracted much attention. For example, the synthetic minority oversampling technique (SMOTE) algorithm creates artificial data based on the feature space similarities between existing minority examples [5]. Adaptive sampling methods have also been proposed, such as the borderline-SMOTE [11] and adaptive synthetic (ADASYN) sampling [12] algorithms. Sampling strategies have also been integrated with ensemble learning techniques by the community, such as in SMOTEBoost [15], RAMOBoost [18], and DataBoost-IM [6]. Data-cleaning techniques, such as Tomek links [19], have been effectively applied to remove the overlapping that is introduced from sampling methods for imbalanced learning. Some representative work in this area includes the one-side selection (OSS) method [13] and the neighborhood cleaning rule (NCL) [20].
1.2.2 Cost-Sensitive Methods
Cost-sensitive learning methods target the problem of imbalanced learning by using different cost matrices that describe the costs for misclassifying any particular data example [21, 22]. Research in the past indicates that there is a strong connection between cost-sensitive learning and imbalanced learning [4, 23, 24]. In general, there are three categories of approaches to implement cost-sensitive learning for imbalanced data. The first class of techniques applies misclassification costs to the dataset as a form of dataspace weighting (translation theorem [25]); these techniques are essentially cost-sensitive bootstrap sampling approaches where misclassification costs are used to select the best training distribution. The second class applies cost-minimizing techniques to the combination schemes of ensemble methods (Metacost framework [26]); this class consists of various meta techniques, such as the AdaC1, AdaC2, and AdaC3 methods [27] and AdaCost [28]. The third class of techniques incorporates cost-sensitive functions or features directly into classification paradigms to essentially "fit" the cost-sensitive framework into these classifiers, such as the cost-sensitive decision trees [21, 24], cost-sensitive neural networks [29, 30], cost-sensitive Bayesian classifiers [31, 32], and cost-sensitive support vector machines (SVMs) [33–35].
1.2.3 Kernel-Based Learning Methods
There have been many studies that integrate ke...