Data Mining
eBook - ePub

Data Mining

Concepts, Models, Methods, and Algorithms

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Data Mining

Concepts, Models, Methods, and Algorithms

About this book

Presents the latest techniques for analyzing and extracting information from large amounts of data in high-dimensional data spaces

The revised and updated third edition of Data Mining contains in one volume an introduction to a systematic approach to the analysis of large data sets that integrates results from disciplines such as statistics, artificial intelligence, data bases, pattern recognition, and computer visualization. Advances in deep learning technology have opened an entire new spectrum of applications. The author—a noted expert on the topic—explains the basic concepts, models, and methodologies that have been developed in recent years.

This new edition introduces and expands on many topics, as well as providing revised sections on software tools and data mining applications. Additional changes include an updated list of references for further study, and an extended list of problems and questions that relate to each chapter.This third edition presents new and expanded information that:

•    Explores big data and cloud computing

•    Examines deep learning

•    Includes information on convolutional neural networks (CNN)

•    Offers reinforcement learning

•    Contains semi-supervised learning and S3VM

•    Reviews model evaluation for unbalanced data

Written for graduate students in computer science, computer engineers, and computer information systems professionals, the updated third edition of Data Mining continues to provide an essential guide to the basic principles of the technology and the most recent developments in the field.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Data Mining by Mehmed Kantardzic in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

1
DATA‐MINING CONCEPTS

Chapter Objectives

  • Understand the need for analyses of large, complex, information‐rich data sets.
  • Identify the goals and primary tasks of the data‐mining process.
  • Describe the roots of data‐mining technology.
  • Recognize the iterative character of a data‐mining process and specify its basic steps.
  • Explain the influence of data quality on a data‐mining process.
  • Establish the relation between data warehousing and data mining.
  • Discuss concepts of big data and data science.

1.1 INTRODUCTION

Modern science and engineering are based on using first‐principle models to describe physical, biological, and social systems. Such an approach starts with a basic scientific model, such as Newton's laws of motion or Maxwell's equations in electromagnetism, and then builds upon them various applications in mechanical engineering or electrical engineering. In this approach, experimental data are used to verify the underlying first‐principle models and to estimate some of the parameters that are difficult or sometimes impossible to measure directly. However, in many domains the underlying first principles are unknown, or the systems under study are too complex to be mathematically formalized. With the growing use of computers, there is a great amount of data being generated by such systems. In the absence of first‐principle models, such readily available data can be used to derive models by estimating useful relationships between a system's variables (i.e., unknown input–output dependencies). Thus there is currently a paradigm shift from classical modeling and analyses based on first principles to developing models and the corresponding analyses directly from data.
We have grown accustomed gradually to the fact that there are tremendous volumes of data filling our computers, networks, and lives. Government agencies, scientific institutions, and businesses have all dedicated enormous resources to collecting and storing data. In reality, only a small amount of these data will ever be used because, in many cases, the volumes are simply too large to manage or the data structures themselves are too complicated to be analyzed effectively. How could this happen? The primary reason is that the original effort to create a data set is often focused on issues such as storage efficiency; it does not include a plan for how the data will eventually be used and analyzed.
The need to understand large, complex, information‐rich data sets is common to virtually all fields of business, science, and engineering. In the business world, corporate and customer data are becoming recognized as a strategic asset. The ability to extract useful knowledge hidden in these data and to act on that knowledge is becoming increasingly important in today's competitive world. The entire process of applying a computer‐based methodology, including new techniques, for discovering knowledge from data is called data mining.
Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an ā€œinterestingā€ outcome. Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.
In practice, the two primary goals of data mining tend to be prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. Therefore, it is possible to put data‐mining activities into one of two categories:
  1. Predictive data mining, which produces the model of the system described by the given data set, or
  2. Descriptive data mining, which produces new, nontrivial information based on the available data set.
On the predictive end of the spectrum, the goal of data mining is to produce a model, expressed as an executable code, which can be used to perform classification, prediction, estimation, or other similar tasks. On the other, descriptive end of the spectrum, the goal is to gain an understanding of the analyzed system by uncovering patterns and relationships in large data sets. The relative importance of prediction and description for particular data‐mining applications can vary considerably. The goals of prediction and description are achieved by using data‐mining techniques, explained later in this book, for the following primary data‐mining tasks:
  1. Classification—Discovery of a predictive learning function that classifies a data item into one of several predefined classes.
  2. Regression—Discovery of a predictive learning function, which maps a data item to a real‐value prediction variable.
  3. Clustering—A common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data.
  4. Summarization—An additional descriptive task that involves methods for finding a compact description for a set (or subset) of data.
  5. Dependency modeling—Finding a local model that describes significant dependencies between variables or between the values of a feature in a data set or in a part of a data set.
  6. Change and deviation detection—Discovering the most significant changes in the data set.
The more formal approach, with graphical interpretation of data‐mining tasks for complex and large data sets and illustrative examples, is given in Chapter 4. Current introductory classifications and definitions are given here only to give the reader a feeling of the wide spectrum of problems and tasks that may be solved using data‐mining technology.
The success of a data‐mining engagement depends largely on the amount of energy, knowledge, and creativity that the designer puts into it. In essence, data mining is like solving a puzzle. The individual pieces of the puzzle are not complex structures in and of themselves. Taken as a collective whole, however, they can constitute very elaborate systems. As you try to unravel these systems, you will probably get frustrated, start forcing parts together, and generally become annoyed at the entire process; but once you know how to work with the pieces, you realize that it was not really that hard in the first place. The same analogy can be applied to data mining. In the beginning, the designers of the data‐mining process probably do not know much about the data sources; if they did, they would most likely not be interested in performing data mining. Individually, the data seem simple, complete, and explainable. But collectively, they take on a whole new appearance that is intimidating and difficult to comprehend, like the puzzle. Therefore, being an analyst and designer in a data‐mining process requires, besides thorough professional knowledge, creative thinking and a willingness to see problems in a different light.
Data mining is one of the fastest growing fields in the computer industry. Once a small interest area within computer science and statistics, it has quickly expanded into a field of its own. One of the greatest strengths of data mining is reflected in its wide range of methodologies and techniques that can be applied to a host of problem sets. Since data mining is a natural activity to be performed on large data sets, one of the largest target markets is the entire data‐warehousing, data‐mart, and decision‐support community, encompassing professionals from such industries as retail, manufacturing, telecommunications, healthcare, insurance, and transportation. In the business community, data mining can be used to discover new purchasing trends, plan investment strategies, and detect unauthorized expenditures in the accounting system. It can improve marketing campaigns, and the outcomes can be used to provide customers with more focused support and attention. Data‐mining techniques can be applied to problems of business process reengineering, in which the goal is to understand interactions and relationships among business practices and organizations.
Many law enforcement and special investigative units, whose mission is to identify fraudulent activities and discover crime trends, have also used data mining successfully. For example, these methodologies can aid analysts in the identification of critical behavior patterns, the communication interactions of narcotics organizations, the monetary transactions of money laundering and insider trading operations, the movements of serial killers, and the targeting of smugglers at border crossings. Data‐mining techniques have also been employed by people in the intelligence community who maintain many large data sources as a part of the activities relating to matters of national security. Appendix B of the book gives a brief overview of typical commercial applications of data‐mining technology today. Despite a considerable level of over‐hype and strategic misuse, data mining has not only persevered but also matured and adapted for practical use in the business world.

1.2 DATA‐MINING ROOTS

Looking at how different authors describe data mining, it is clear that we are far from a universal agreement on the definition of data mining or even what constitutes data mining. Is data mining a form of statistics enriched with learning theory, or is it a revolutionary new concept? In our view, most data‐mining...

Table of contents

  1. Cover
  2. Table of Contents
  3. PREFACE
  4. PREFACE TO THE SECOND EDITION
  5. PREFACE TO THE FIRST EDITION
  6. 1 DATA‐MINING CONCEPTS
  7. 2 PREPARING THE DATA
  8. 3 DATA REDUCTION
  9. 4 LEARNING FROM DATA
  10. 5 STATISTICAL METHODS
  11. 6 DECISION TREES AND DECISION RULES
  12. 7 ARTIFICIAL NEURAL NETWORKS
  13. 8 ENSEMBLE LEARNING
  14. 9 CLUSTER ANALYSIS
  15. 10 ASSOCIATION RULES
  16. 11 WEB MINING AND TEXT MINING
  17. 12 ADVANCES IN DATA MINING
  18. 13 GENETIC ALGORITHMS
  19. 14 FUZZY SETS AND FUZZY LOGIC
  20. 15 VISUALIZATION METHODS
  21. APPENDIX A: INFORMATION ON DATA MINING
  22. APPENDIX B: DATA‐MINING APPLICATIONS
  23. BIBLIOGRAPHY
  24. INDEX
  25. End User License Agreement