Chapter 1
Introduction
Abstract
This chapter includes four sections. Section 1 describes the motivity of data mining, objectives and scope of data mining, classification of data mining systems, and major issues in data mining for geosciences, indicating the particularities of underground data and relative processing methods with other fields. Section 2 introduces the database, data warehouse, and data bank, which are data systems usable by data mining. Section 3 discusses the linear and nonlinear algorithms, error analysis of calculation results, differences between regression and classification algorithms, nonlinearity of studied problem, and solution accuracy of studied problem, which are shared by regression and classification algorithms introduced in Chapters 2 through 6 and Chapter 10, for the latter two of which the five ranks have been presented. Section 4 introduces the functions, flowchart, and data preprocessing of data mining systems and summarizes the algorithms and case studies in this book. Finally, 10 exercises are provided.
Keywords
data mining; data mining system; geosciences particularities; regression algorithms; classification algorithms; linear algorithms; nonlinear algorithms; error analysis; nonlinearity ranking; solution accuracy ranking
Outline
1.1. Introduction to Data Mining
1.1.1. Motivity of Data Mining
1.1.2. Objectives and Scope of Data Mining
1.1.2.1. Generalization
1.1.2.2. Association
1.1.2.3. Classification and Clustering
1.1.2.4. Prediction
1.1.2.5. Deviation
1.1.3. Classification of Data Mining Systems
1.1.3.1. To Classify According to the Mined DB Type
1.1.3.2. To Classify According to the Mined Knowledge Type
1.1.3.3. To Classify According to the Available Techniques Type
1.1.3.4. To Classify According to the Application
1.1.4. Major Issues in Data Mining for Geosciences
1.2. Data Systems Usable by Data Mining
1.2.1. Databases
1.2.1.1. Database types
1.2.1.2. Data Properties
1.2.1.3. Development Phases
1.2.1.4. Commonly Used Databases
1.2.2. Data Warehousing
1.2.2.1. Data Storage
1.2.2.2. Construction Step
1.2.3. Data Banks
1.3. Commonly Used Regression and Classification Algorithms
1.3.1. Linear and Nonlinear Algorithms
1.3.2. Error Analysis of Calculation Results
1.3.3. Differences between Regression and Classification Algorithms
1.3.4. Nonlinearity of a Studied Problem
1.3.5. Solution Accuracy of Studied Problem
1.4. Data Mining System
1.4.1. System Functions
1.4.2. System Flowcharts
1.4.3. Data Preprocessing
1.4.3.1. Data Cleaning
1.4.3.2. Data Integration
1.4.3.3. Data Transformation
1.4.3.4. Data Reduction
1.4.4. Summary of Algorithms and Case Studies
Exercises
References
In the early 21st century, data mining (DM) was predicted to be “one of the most revolutionary developments of the next decade” and was chosen as one of 10 emerging technologies that will change the world (Hand et al., 2001; Larose, 2005; Larose, 2006). In fact, in the past 20 years, the field of DM has seen enormous success, both in terms of broad-ranging application achievements and in terms of scientific progress and understanding. DM is the computerized process of extracting previously unknown and important actionable information and knowledge from a database (DB). This knowledge can then be used to make crucial decisions by leveraging the individual’s intuition and experience to objectively generate opportunities that might otherwise go undiscovered. So, DM is also called knowledge discovery in database (KDD). It has been widely used in some fields of business and sciences (Hand et al, 2001; Tan et al., 2005; Witten and Frank, 2005; Han and Kamber, 2006; Soman et al., 2006), but the DM application to geosciences is still in its initial stage (Wong, 2003; Zangl and Hannerer, 2003; Aminzadeh, 2005; Mohaghegh, 2005; Shi, 2011). This is because geosciences are different from the other fields, with miscellaneous data types, huge quantities, different measuring precision, and lots of uncertainties as to data mining results.
With the establishment of numbers of DB for geosciences, including data banks, data warehouses, and so on, the question of how to search for new important information and knowledge from large amounts of data is becoming an urgent task after the data bank is constructed. Facing such large amounts of geoscientific data, people can use the DB management system to conduct conventional applications (such as query, search, and simple statistical analysis) but cannot obtain the available knowledge inhered in data, falling into a puzzle of “rich data but poor knowledge.” The only solution is to develop DM techniques in geoscientific databases.
We need to stress here that attributes and variables mentioned in this book are the same terminology; the term attribute refers to data related to datalogy, whereas variable refers to data related to mathematics. These two terms are called parameters when they are related to applications, so these three terms are absolutely the same. There are two types for these three terminologies; one is the continuous or real type, referring to lots of unequal real numbers occurring in the sample value, and the other is the discrete or integer type, referring to the fact that sample values are integer numbers such as 1, 2, 3, and so on. Continuous and discrete are the words of datalogy, such as continuous attribute, discrete attribute, continuous variable, discrete variable; whereas real type and integer type are terms related to software, such as real attribute, integer attribute, real variable, and integer variable.
1.1 INTRODUCTION TO DATA MINING
1.1.1 Motivity of Data Mining
Just as its meaning implies, data mining involves digging out the useful information from a large amount of data. With the wider application of computers, large amounts of data have piled up each year. It is possible to mine “gold” from these large amounts of data by applying DM techniques.
We are living in an era in which telecommunications, computers, and network technology are changing human beings and society. However, large amounts of information introduce large numbers of problems while bringing convenience to people. For example, it is hard to digest the excessive amounts of information, to identify true and false information, to ensure information safety, and to deal with inconsistent forms of information.
On the other hand, with the rapid development of DB techniques and the wide application of DB management systems, the amounts of data that people accumulate are growing more and more. A great deal of important information is hidden behind the increased data. It is our hope to analyze this information at a higher level so as to make better use of these data.
The current DB systems can efficiently realize the function of data records, queries, and statistics, but they cannot discover the relationship and rules that exist in the data and cannot predict the future development tendency based on the available data.
The phenomenon of rich data but poor knowledge results from the lack of effective means to mine the hidden knowledge in the data. Facing this challenge, DM techniques have been introduced and appear to be vital. The prediction of DM is the next hotpoint technique following netw...