eBook - ePub

Data Mining Applications with R

Name: Data Mining Applications with R
Author: Yanchang Zhao, Yonghua Cen

Yanchang Zhao, Yonghua Cen

Partager le livre

514 pages
English
ePUB (adapté aux mobiles)
Disponible sur iOS et Android

eBook - ePub

Data Mining Applications with R

Yanchang Zhao, Yonghua Cen

Détails du livre

Aperçu du livre

Table des matières

Citations

À propos de ce livre

Data Mining Applications with R is a great resource for researchers and professionals to understand the wide use of R, a free software environment for statistical computing and graphics, in solving different problems in industry. R is widely used in leveraging data mining techniques across many different industries, including government, finance, insurance, medicine, scientific research and more. This book presents 15 different real-world case studies illustrating various techniques in rapidly growing areas. It is an ideal companion for data mining researchers in academia and industry looking for ways to turn this versatile software into a powerful analytic tool.

R code, Data and color figures for the book are provided at the RDataMining.com website.

Helps data miners to learn to use R in their specific area of work and see how R can apply in different industries
Presents various case studies in real-world applications, which will help readers to apply the techniques in their work
Provides code examples and sample data for readers to easily learn the techniques by running the code by themselves

Foire aux questions

Comment puis-je résilier mon abonnement ?

Il vous suffit de vous rendre dans la section compte dans paramètres et de cliquer sur « Résilier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez résilié votre abonnement, il restera actif pour le reste de la période pour laquelle vous avez payé. Découvrez-en plus ici.

Puis-je / comment puis-je télécharger des livres ?

Pour le moment, tous nos livres en format ePub adaptés aux mobiles peuvent être téléchargés via l’application. La plupart de nos PDF sont également disponibles en téléchargement et les autres seront téléchargeables très prochainement. Découvrez-en plus ici.

Quelle est la différence entre les formules tarifaires ?

Les deux abonnements vous donnent un accès complet à la bibliothèque et à toutes les fonctionnalités de Perlego. Les seules différences sont les tarifs ainsi que la période d’abonnement : avec l’abonnement annuel, vous économiserez environ 30 % par rapport à 12 mois d’abonnement mensuel.

Qu’est-ce que Perlego ?

Nous sommes un service d’abonnement à des ouvrages universitaires en ligne, où vous pouvez accéder à toute une bibliothèque pour un prix inférieur à celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! Découvrez-en plus ici.

Prenez-vous en charge la synthèse vocale ?

Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte à haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accélérer ou le ralentir. Découvrez-en plus ici.

Est-ce que Data Mining Applications with R est un PDF/ePUB en ligne ?

Oui, vous pouvez accéder à Data Mining Applications with R par Yanchang Zhao, Yonghua Cen en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Computer Science et Programming Languages. Nous disposons de plus d’un million d’ouvrages à découvrir dans notre catalogue.

Informations

Éditeur

Academic Press

Année

2013

ISBN

9780124115200

Sujet

Computer Science

Sous-sujet

Programming Languages

Chapter 1

Power Grid Data Analysis with R and Hadoop

Ryan Hafen, Tara Gibson, Kerstin Kleese van Dam and Terence Critchlow, Pacific Northwest National Laboratory, Richland, Washington, USA

Abstract

In this chapter, we use the R and Hadoop Integrated Programming Environment (RHIPE) as a flexible, scalable environment for analyzing multiterabyte data sets being produced by a phasor measurement unit sensor network on the electrical power grid. RHIPE enables exploratory data analysis on the entire data set, allowing us to develop both data cleaning and event classification methods that reflect event characteristics as represented by the actual data instead of relying on theoretical models. We describe several of the data cleaning filters that we have developed as well as one approach we have used for event detection. To ensure the generality of this chapter, we focus on the techniques we are using for our data analysis and example code that demonstrates how these techniques are used within the RHIPE package, instead of the domain-specific details of the data or events that we are extracting.

Keywords

R; Hadoop; RHIPE; Large data; Data cleaning; Event detection; Power grid

1.1 Introduction

This chapter presents an approach to analysis of large-scale time series sensor data collected from the electric power grid. This discussion is driven by our analysis of a real-world data set and, as such, does not provide a comprehensive exposition of either the tools used or the breadth of analysis appropriate for general time series data. Instead, we hope that this section provides the reader with sufficient information, motivation, and resources to address their own analysis challenges.

Our approach to data analysis is on the basis of exploratory data analysis techniques. In particular, we perform an analysis over the entire data set to identify sequences of interest, use a small number of those sequences to develop an analysis algorithm that identifies the relevant pattern, and then run that algorithm over the entire data set to identify all instances of the target pattern. Our initial data set is a relatively modest 2TB data set, comprising just over 53 billion records generated from a distributed sensor network. Each record represents several sensor measurements at a specific location at a specific time. Sensors are geographically distributed but reside in a fixed, known location. Measurements are taken 30 times per second and synchronized using a global clock, enabling a precise reconstruction of events. Because all of the sensors are recording on the status of the same, tightly connected network, there should be a high correlation between all readings.

Given the size of our data set, simply running R on a desktop machine is not an option. To provide the required scalability, we use an analysis package called RHIPE (pronounced ree-pay) (RHIPE, 2012). RHIPE, short for the R and Hadoop Integrated Programming Environment, provides an R interface to Hadoop. This interface hides much of the complexity of running parallel analyses, including many of the traditional Hadoop management tasks. Further, by providing access to all of the standard R functions, RHIPE allows the analyst to focus instead on the analysis of code development, even when exploring large data sets. A brief introduction to both the Hadoop programming paradigm, also known as the MapReduce paradigm, and RHIPE is provided in Section 1.3. We assume that readers already have a working knowledge of R.

As with many sensor data sets, there are a large number of erroneous records in the data, so a significant focus of our work has been on identifying and filtering these records. Identifying bad records requires a variety of analysis techniques including summary statistics, distribution checking, autocorrelation detection, and repeated value distribution characterization, all of which are discovered or verified by exploratory data analysis. Once the data set has been cleaned, meaningful events can be extracted. For example, events that result in a network partition or isolation of part of the network are extremely interesting to power engineers.

The core of this chapter is the presentation of several example algorithms to manage, explore, clean, and apply basic feature extraction routines over our data set. These examples are generalized versions of the code we use in our analysis. Section 1.3.3.2.2 describes these examples in detail, complete with sample code. Our hope is that this approach will provide the reader with a greater understanding of how to proceed when unique modifications to standard algorithms are warranted, which in our experience occurs quite frequently.

Before we dive into the analysis, however, we begin with an overview of the power grid, which is our application domain.

1.2 A Brief Overview of the Power Grid

The U.S. national power grid, also known as “the electrical grid” or simply “the grid,” was named the greatest engineering achievement of the twentieth century by the U.S. National Academy of Engineering (Wulf, 2000). Although many of us take for granted the flow of electricity when we flip a switch or plug in our chargers, it takes a large and complex infrastructure to reliably support our dependence on energy.

Built over 100 years ago, at its core the grid connects power producers and consumers through a complex network of transmission and distribution lines connecting almost every building in the country. Power producers use a variety of generator technologies, from coal to natural gas to nuclear and hydro, to create electricity. There are hundreds of large and small generation facilities spread across the country. Power is transferred from the generation facility to the transmission network, which moves it to where it is needed. The transmission network is comprised of high-voltage lines that connect the generators to distribution points. The network is designed with redundancy, which allows power to flow to most locations even when there is a break in the line or a generator goes down unexpectedly. At specific distribution points, the voltage is decreased and then transferred to the consumer. The distribution networks are disconnected from each other.

The US grid has been divided into three smaller grids: the western interconnection, the eastern interconnection, and the Texas int...