Technology & Engineering

Preprocessing

Preprocessing refers to the manipulation of raw data to prepare it for further analysis. It involves cleaning, transforming, and organizing data to make it easier to work with. Preprocessing is a crucial step in data analysis as it helps to ensure that the data is accurate and reliable.

Written by Perlego with AI-assistance

Related key terms

1 of 5

4 Key excerpts on "Preprocessing"

eBook - ePub
Generative Adversarial Networks in Practice
- Mehdi Ghayoumi(Author)
- 2023(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
2 Data Preprocessing
DOI: 10.1201/9781003281344-2
This chapter discusses data processing approaches and includes the following:

Data Preprocessing

Data Cleaning

Data Transformation

Balancing Data

Data Augmentation

Data Reduction

Dataset Partitioning

Data Preparation Steps

Data Preprocessing Examples

Data Preprocessing Issues

Data Preprocessing Implementation Tips

2.1 Preface

Data preparation is a crucial step in creating a machine learning model. It involves a series of operations performed on raw data to transform it into a format that can be readily processed by machines, thereby enhancing the reliability of the output and results. Raw data in any format, including video, audio, and text, is typically unsuitable for direct machine processing due to the presence of noise, errors, or missing information. Furthermore, the data may contain sensitive information or be insufficient for the machine learning model’s requirements. Data Preprocessing steps rectify these issues, structuring and cleaning the data, which improves the processes of machine learning modeling, data science, and data mining. Here are some common tasks involved in data Preprocessing:

Data cleaning: This involves removing irrelevant, duplicate, or inaccurate data points from the dataset.

Data transformation: This involves converting the raw data into a more useful format. For example, categorical data might be transformed into numerical data, or text might be converted into a format that machine learning algorithms can process.

Data normalization: This involves scaling the data so that it falls within a specific range. Normalization ensures that different features have equal weight in the analysis.

Data reduction:
Sign up to read
Learn more about book
eBook - ePub
Data Mining: Concepts and Techniques
- Jiawei Han, Micheline Kamber, Jian Pei(Authors)
- 2011(Publication Date)
- Morgan Kaufmann
  (Publisher)
Section 3.5 ).

3.1 Data Preprocessing: An Overview
This section presents an overview of data Preprocessing. Section 3.1.1 illustrates the many elements defining data quality. This provides the incentive behind data Preprocessing. Section 3.1.2 outlines the major tasks in data Preprocessing.

3.1.1 Data Quality: Why Preprocess the Data?

Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data quality , including accuracy, completeness, consistency, timeliness, believability , and interpretability .

Imagine that you are a manager at AllElectronics and have been charged with analyzing the company’s data with respect to your branch’s sales. You immediately set out to perform this task. You carefully inspect the company’s database and data warehouse, identifying and selecting the attributes or dimensions (e.g., item , price , and units_sold ) to be included in your analysis. Alas! You notice that several of the attributes for various tuples have no recorded value. For your analysis, you would like to include information as to whether each item purchased was advertised as on sale, yet you discover that this information has not been recorded. Furthermore, users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for some transactions. In other words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data); inaccurate or noisy (containing errors, or values that deviate from the expected); and inconsistent
Sign up to read
Learn more about book
eBook - ePub
Data Mining: Know It All
- Soumen Chakrabarti, Richard E. Neapolitan, Dorian Pyle, Mamdouh Refaat, Markus Schneider, Toby J. Teorey, Ian H. Witten, Earl Cox, Eibe Frank, Ralf Hartmut Güting, Jiawei Han, Xia Jiang, Micheline Kamber, Sam S. Lightstone, Thomas P. Nadeau(Authors)
- 2008(Publication Date)
- Morgan Kaufmann
  (Publisher)
Chapter 3. Data Preprocessing

Today's real-world databases are highly susceptible to noisy, missing, and inconsistent data because of their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results.

How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process?

There are a number of data Preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined or the time required for the actual mining.

In Section 3.1 of this chapter, we introduce the basic concepts of data Preprocessing. Section 3.2 presents descriptive data summarization, which serves as a foundation for data Preprocessing. Descriptive data summarization helps us study the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning and data integration. The methods for data Preprocessing are organized into the following categories: data cleaning (Section 3.3 ), data integration and transformation (Section 3.4 ), and data reduction (Section 3.5 ). Concept hierarchies can be used in an alternative form of data reduction where we replace low-level data (such as raw values for age) with higher-level concepts (such as youth, middle-aged, or senior). This form of data reduction is the topic of Section 3.6
Sign up to read
Learn more about book
eBook - PDF
Discovering Knowledge in Data
An Introduction to Data Mining
- Daniel T. Larose, Chantal D. Larose(Authors)
- 2014(Publication Date)
- Wiley
  (Publisher)
Here in this chapter, we examine the next two phases of the CRISP-DM standard process, data understanding and data preparation. We will show how to evaluate the quality of the data, clean the raw data, deal with missing data, and perform transfor- mations on certain variables. All of Chapter 3, Exploratory Data Analysis, is devoted to this very important aspect of the data understanding phase. The heart of any data mining project is the modeling phase, which we begin examining in Chapter 4. 2.1 WHY DO WE NEED TO PREPROCESS THE DATA? Much of the raw data contained in databases is unpreprocessed, incomplete, and noisy. For example, the databases may contain  Fields that are obsolete or redundant,  Missing values,  Outliers,  Data in a form not suitable for the data mining models,  Values not consistent with policy or common sense. In order to be useful for data mining purposes, the databases need to undergo Preprocessing, in the form of data cleaning and data transformation. Data mining often deals with data that have not been looked at for years, so that much of the data contain field values that have expired, are no longer relevant, or are simply missing. The overriding objective is to minimize GIGO, to minimize the Garbage that gets Into our model, so that we can minimize the amount of Garbage that our models give Out. Depending on the data set, data Preprocessing alone can account for 10–60% of all the time and effort for the entire data mining process. In this chapter, we shall examine several ways to preprocess the data for further analysis downstream. 2.2 DATA CLEANING To illustrate the need for cleaning up the data, let us take a look at some of the kinds of errors that could creep into even a tiny data set, such as that in Table 2.1. Let us discuss, attribute by attribute, some of the problems that have found their way into the data set in Table 2.1.
Sign up to read
Learn more about book

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

1 of 8

View all

Preprocessing

Related key terms

4 Key excerpts on "Preprocessing"

Generative Adversarial Networks in Practice

2.1 Preface

Data Mining: Concepts and Techniques

3.1 Data Preprocessing: An Overview

3.1.1 Data Quality: Why Preprocess the Data?

Data Mining: Know It All

Chapter 3. Data Preprocessing

Discovering Knowledge in Data

An Introduction to Data Mining

Explore more topic indexes