eBook - ePub

Data Mining Applications with R

Name: Data Mining Applications with R
Author: Yanchang Zhao, Yonghua Cen

Yanchang Zhao, Yonghua Cen

Buch teilen

514 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Data Mining Applications with R

Yanchang Zhao, Yonghua Cen

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Data Mining Applications with R is a great resource for researchers and professionals to understand the wide use of R, a free software environment for statistical computing and graphics, in solving different problems in industry. R is widely used in leveraging data mining techniques across many different industries, including government, finance, insurance, medicine, scientific research and more. This book presents 15 different real-world case studies illustrating various techniques in rapidly growing areas. It is an ideal companion for data mining researchers in academia and industry looking for ways to turn this versatile software into a powerful analytic tool.

R code, Data and color figures for the book are provided at the RDataMining.com website.

Helps data miners to learn to use R in their specific area of work and see how R can apply in different industries
Presents various case studies in real-world applications, which will help readers to apply the techniques in their work
Provides code examples and sample data for readers to easily learn the techniques by running the code by themselves

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Data Mining Applications with R als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Data Mining Applications with R von Yanchang Zhao, Yonghua Cen im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Computer Science & Programming Languages. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Academic Press

Jahr

2013

ISBN

9780124115200

Thema

Computer Science

Thema

Programming Languages

Chapter 1

Power Grid Data Analysis with R and Hadoop

Ryan Hafen, Tara Gibson, Kerstin Kleese van Dam and Terence Critchlow, Pacific Northwest National Laboratory, Richland, Washington, USA

Abstract

In this chapter, we use the R and Hadoop Integrated Programming Environment (RHIPE) as a flexible, scalable environment for analyzing multiterabyte data sets being produced by a phasor measurement unit sensor network on the electrical power grid. RHIPE enables exploratory data analysis on the entire data set, allowing us to develop both data cleaning and event classification methods that reflect event characteristics as represented by the actual data instead of relying on theoretical models. We describe several of the data cleaning filters that we have developed as well as one approach we have used for event detection. To ensure the generality of this chapter, we focus on the techniques we are using for our data analysis and example code that demonstrates how these techniques are used within the RHIPE package, instead of the domain-specific details of the data or events that we are extracting.

Keywords

R; Hadoop; RHIPE; Large data; Data cleaning; Event detection; Power grid

1.1 Introduction

This chapter presents an approach to analysis of large-scale time series sensor data collected from the electric power grid. This discussion is driven by our analysis of a real-world data set and, as such, does not provide a comprehensive exposition of either the tools used or the breadth of analysis appropriate for general time series data. Instead, we hope that this section provides the reader with sufficient information, motivation, and resources to address their own analysis challenges.

Our approach to data analysis is on the basis of exploratory data analysis techniques. In particular, we perform an analysis over the entire data set to identify sequences of interest, use a small number of those sequences to develop an analysis algorithm that identifies the relevant pattern, and then run that algorithm over the entire data set to identify all instances of the target pattern. Our initial data set is a relatively modest 2TB data set, comprising just over 53 billion records generated from a distributed sensor network. Each record represents several sensor measurements at a specific location at a specific time. Sensors are geographically distributed but reside in a fixed, known location. Measurements are taken 30 times per second and synchronized using a global clock, enabling a precise reconstruction of events. Because all of the sensors are recording on the status of the same, tightly connected network, there should be a high correlation between all readings.

Given the size of our data set, simply running R on a desktop machine is not an option. To provide the required scalability, we use an analysis package called RHIPE (pronounced ree-pay) (RHIPE, 2012). RHIPE, short for the R and Hadoop Integrated Programming Environment, provides an R interface to Hadoop. This interface hides much of the complexity of running parallel analyses, including many of the traditional Hadoop management tasks. Further, by providing access to all of the standard R functions, RHIPE allows the analyst to focus instead on the analysis of code development, even when exploring large data sets. A brief introduction to both the Hadoop programming paradigm, also known as the MapReduce paradigm, and RHIPE is provided in Section 1.3. We assume that readers already have a working knowledge of R.

As with many sensor data sets, there are a large number of erroneous records in the data, so a significant focus of our work has been on identifying and filtering these records. Identifying bad records requires a variety of analysis techniques including summary statistics, distribution checking, autocorrelation detection, and repeated value distribution characterization, all of which are discovered or verified by exploratory data analysis. Once the data set has been cleaned, meaningful events can be extracted. For example, events that result in a network partition or isolation of part of the network are extremely interesting to power engineers.

The core of this chapter is the presentation of several example algorithms to manage, explore, clean, and apply basic feature extraction routines over our data set. These examples are generalized versions of the code we use in our analysis. Section 1.3.3.2.2 describes these examples in detail, complete with sample code. Our hope is that this approach will provide the reader with a greater understanding of how to proceed when unique modifications to standard algorithms are warranted, which in our experience occurs quite frequently.

Before we dive into the analysis, however, we begin with an overview of the power grid, which is our application domain.

1.2 A Brief Overview of the Power Grid

The U.S. national power grid, also known as “the electrical grid” or simply “the grid,” was named the greatest engineering achievement of the twentieth century by the U.S. National Academy of Engineering (Wulf, 2000). Although many of us take for granted the flow of electricity when we flip a switch or plug in our chargers, it takes a large and complex infrastructure to reliably support our dependence on energy.

Built over 100 years ago, at its core the grid connects power producers and consumers through a complex network of transmission and distribution lines connecting almost every building in the country. Power producers use a variety of generator technologies, from coal to natural gas to nuclear and hydro, to create electricity. There are hundreds of large and small generation facilities spread across the country. Power is transferred from the generation facility to the transmission network, which moves it to where it is needed. The transmission network is comprised of high-voltage lines that connect the generators to distribution points. The network is designed with redundancy, which allows power to flow to most locations even when there is a break in the line or a generator goes down unexpectedly. At specific distribution points, the voltage is decreased and then transferred to the consumer. The distribution networks are disconnected from each other.

The US grid has been divided into three smaller grids: the western interconnection, the eastern interconnection, and the Texas int...