Imbalanced Learning
eBook - ePub

Imbalanced Learning

Foundations, Algorithms, and Applications

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Imbalanced Learning

Foundations, Algorithms, and Applications

About this book

The first book of its kind to review the current status and future direction of the exciting new branch of machine learning/data mining called imbalanced learning

Imbalanced learning focuses on how an intelligent system can learn when it is provided with imbalanced data. Solving imbalanced learning problems is critical in numerous data-intensive networked systems, including surveillance, security, Internet, finance, biomedical, defense, and more. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation.

The first comprehensive look at this new branch of machine learning, this book offers a critical review of the problem of imbalanced learning, covering the state of the art in techniques, principles, and real-world applications. Featuring contributions from experts in both academia and industry, Imbalanced Learning: Foundations, Algorithms, and Applications provides chapter coverage on:

  • Foundations of Imbalanced Learning
  • Imbalanced Datasets: From Sampling to Classifiers
  • Ensemble Methods for Class Imbalance Learning
  • Class Imbalance Learning Methods for Support Vector Machines
  • Class Imbalance and Active Learning
  • Nonstationary Stream Data Learning with Imbalanced Class Distribution
  • Assessment Metrics for Imbalanced Learning

Imbalanced Learning: Foundations, Algorithms, and Applications will help scientists and engineers learn how to tackle the problem of learning from imbalanced datasets, and gain insight into current developments in the field as well as future research directions.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Chapter 1
Introduction
Haibo He
Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI, USA
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, it becomes critical to advance raw data from fundamental research on the Big Data challenge to support decision-making processes. Although existing machine-learning and data-mining techniques have shown great success in many real-world applications, learning from imbalanced data is a relatively new challenge. This book is dedicated to the state-of-the-art research on imbalanced learning, with a broader discussions on the imbalanced learning foundations, algorithms, databases, assessment metrics, and applications. In this chapter, we provide an introduction to problem formulation, a brief summary of the major categories of imbalanced learning methods, and an overview of the challenges and opportunities in this field. This chapter lays the structural foundation of this book and directs readers to the interesting topics discussed in subsequent chapters.

1.1 Problem Formulation

We start with the definition of imbalanced learning in this chapter to lay the foundation for further discussions in the book. Specifically, we define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process. The learning process could involve supervised learning, unsupervised learning, semi-supervised learning, or a combination of two or all of them. The task of imbalanced learning could also be applied to regression, classification, or clustering tasks. In this Chapter, we provide a brief introduction to the problem formulation, research methods, and challenges and opportunities in this field. This chapter is based on a recent comprehensive survey and critical review of imbalanced learning as presented in [1], and interested readers could refer to that survey paper for more details regarding imbalanced learning.
Imbalanced learning not only presents significant new challenges to the data research community but also raises many critical questions in real-world data-intensive applications, ranging from civilian applications such as financial and biomedical data analysis to security- and defense-related applications such as surveillance and military data analysis [1]. This increased interest in imbalanced learning is reflected in the recent significantly increased number of publications in this field as well as in the organization of dedicated workshops, conferences, symposiums, and special issues, [2, 3, 4].
To start with a simple example of imbalanced learning, let us consider a popular case study in biomedical data analysis [1]. Consider the "Mammography Data Set," a collection of images acquired from a series of mammography examinations performed on a set of distinct patients [5-7]. For such a dataset, the natural classes that arise are "Positive" or "Negative" for an image representative of a "cancerous" or "healthy" patient, respectively. From experience, one would expect the number of noncancerous patients to exceed greatly the number of cancerous patients; indeed, this dataset contains 10,923 "Negative" (majority class) and 260 "Positive" (minority class) samples. Preferably, we require a classifier that provides a balanced degree of predictive accuracy for both the minority and majority classes on the dataset. However, in many standard learning algorithms, we find that classifiers tend to provide a severely imbalanced degree of accuracy, with the majority class having close to 100% accuracy and the minority class having accuracies of 0 ∼ 10%; see for instance, [5, 7]. Suppose a classifier achieves 5% accuracy on the minority class of the mammography dataset. Analytically, this would suggest that 247 minority samples are misclassified as majority samples (i.e., 247 cancerous patients are diagnosed as noncancerous). In the medical industry, the ramifications of such a consequence can be overwhelmingly costly, more so than classifying a noncancerous patient as cancerous [8]. Furthermore, this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. In an extreme case, if a given dataset includes 1% of minority class examples and 99% of majority class examples, a naive approach of classifying every example to be a majority class example would provide an accuracy of 99%. Taken at face value, 99% accuracy across the entire dataset appears superb; however, by the same token, this description fails to reflect the fact that none of the minority examples are identified, when in many situations, those minority examples are of much more interest. This clearly demonstrates the need to revisit the assessment metrics for imbalanced learning, which is discussed in Chapter 8.

1.2 State-of-the-Art Research

Given the new challenges facing imbalanced learning, extensive efforts and significant progress have been made in the community to tackle this problem. In this section, we provide a brief summary of the major category of approaches for imbalanced learning. Our goal is just to highlight some of the major research methodologies while directing the readers to different chapters in this book for the latest research development in each category of approach. Furthermore, a comprehensive summary and critical review of various types of imbalanced learning techniques can also be found in a recent survey [1].

1.2.1 Sampling Methods

Sampling methods seem to be the dominate type of approach in the community as they tackle imbalanced learning in a straightforward manner. In general, the use of sampling methods in imbalanced learning consists of the modification of an imbalanced dataset by some mechanism in order to provide a balanced distribution. Representative work in this area includes random oversampling [9], random undersampling [10], synthetic sampling with data generation [5, 1113], cluster-based sampling methods [14], and integration of sampling and boosting [6, 15, 16].
The key aspect of sampling methods is the mechanism used to sample the original dataset. Under different assumptions and with different objective considerations, various approaches have been proposed. For instance, the mechanism of random oversampling follows naturally from its description by replicating a randomly selected set of examples from the minority class. On the basis of such simple sampling techniques, many informed sampling methods have been proposed, such as the EasyEnsemble and BalanceCascade algorithms [17]. Synthetic sampling with data generation techniques has also attracted much attention. For example, the synthetic minority oversampling technique (SMOTE) algorithm creates artificial data based on the feature space similarities between existing minority examples [5]. Adaptive sampling methods have also been proposed, such as the borderline-SMOTE [11] and adaptive synthetic (ADASYN) sampling [12] algorithms. Sampling strategies have also been integrated with ensemble learning techniques by the community, such as in SMOTEBoost [15], RAMOBoost [18], and DataBoost-IM [6]. Data-cleaning techniques, such as Tomek links [19], have been effectively applied to remove the overlapping that is introduced from sampling methods for imbalanced learning. Some representative work in this area includes the one-side selection (OSS) method [13] and the neighborhood cleaning rule (NCL) [20].

1.2.2 Cost-Sensitive Methods

Cost-sensitive learning methods target the problem of imbalanced learning by using different cost matrices that describe the costs for misclassifying any particular data example [21, 22]. Research in the past indicates that there is a strong connection between cost-sensitive learning and imbalanced learning [4, 23, 24]. In general, there are three categories of approaches to implement cost-sensitive learning for imbalanced data. The first class of techniques applies misclassification costs to the dataset as a form of dataspace weighting (translation theorem [25]); these techniques are essentially cost-sensitive bootstrap sampling approaches where misclassification costs are used to select the best training distribution. The second class applies cost-minimizing techniques to the combination schemes of ensemble methods (Metacost framework [26]); this class consists of various meta techniques, such as the AdaC1, AdaC2, and AdaC3 methods [27] and AdaCost [28]. The third class of techniques incorporates cost-sensitive functions or features directly into classification paradigms to essentially "fit" the cost-sensitive framework into these classifiers, such as the cost-sensitive decision trees [21, 24], cost-sensitive neural networks [29, 30], cost-sensitive Bayesian classifiers [31, 32], and cost-sensitive support vector machines (SVMs) [3335].

1.2.3 Kernel-Based Learning Methods

There have been many studies that integrate ke...

Table of contents

  1. Cover
  2. Preface
  3. Contributors
  4. Chapter 1: Introduction
  5. Chapter 2: Foundations of Imbalanced Learning
  6. Chapter 3: Imbalanced Datasets: From Sampling to Classifiers
  7. Chapter 4: Ensemble Methods for Class Imbalance Learning
  8. Chapter 5: Class Imbalance Learning Methods for Support Vector Machines
  9. Chapter 6: Class Imbalance and Active Learning
  10. Chapter 7: Nonstationary Stream Data Learning with Imbalanced Class Distribution
  11. Chapter 8: Assessment Metrics for Imbalanced Learning
  12. Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Imbalanced Learning by Haibo He,Yunqian Ma in PDF and/or ePUB format, as well as other popular books in Technology & Engineering & Electrical Engineering & Telecommunications. We have over one million books available in our catalogue for you to explore.