eBook - ePub

Learning Cascading

Name: Learning Cascading
Author: Michael Covert, Victoria Loewengart

Michael Covert, Victoria Loewengart

Buch teilen

276 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Learning Cascading

Michael Covert, Victoria Loewengart

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Learning Cascading als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Learning Cascading von Michael Covert, Victoria Loewengart im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Computer Science & Programming in Java. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Packt Publishing

Jahr

2015

ISBN

9781785288913

Thema

Computer Science

Thema

Programming in Java

Learning Cascading

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. The Big Data Core Technology Stack

Reviewing Hadoop

Hadoop architecture

HDFS – the Hadoop Distributed File System

The NameNode

The secondary NameNode

DataNodes

MapReduce execution framework

The JobTracker

The TaskTracker

Hadoop jobs

Distributed cache

Counters

YARN – MapReduce version 2

A simple MapReduce job

Beyond MapReduce

The Cascading framework

The execution graph and flow planner

How Cascading produces MapReduce jobs

Summary

2. Cascading Basics in Detail

Understanding common Cascading themes

Data flows as processes

Understanding how Cascading represents records

Using tuples and defining fields

Using a Fields object, named field groups, and selectors

Data typing and coercion

Defining schemes

Schemes in detail

TupleEntry

Understanding how Cascading controls data flow

Using pipes

Creating and chaining

Pipe operations

Each

Splitting

GroupBy and sorting

Every

Merging and joining

The Merge pipe

The join pipes – CoGroup and HashJoin

CoGroup

HashJoin

Default output selectors

Using taps

Flow

FlowConnector

Cascades

Local and Hadoop modes

Common errors

Putting it all together

Summary

3. Understanding Custom Operations

Understanding operations

Operations and fields

The Operation class and interface hierarchy

The basic operation lifecycle

Contexts

FlowProcess

OperationCall<Context>

An operation processing sequence and its methods

Operation types

Each operations

Filters

Filter calling sequence

Built-in filters

Function

Function calling sequence

Built-in functions

Every operations

Aggregator

Aggregator calling sequence

Built-in aggregators

Buffers

Buffer calling sequence

Built-in buffers

Assertions

ValueAssertion calling sequence

GroupAssertion calling sequence

AssertionLevel

Using assertions

Built-in assertions

A note about implementing BaseOperation methods

Summary

4. Creating Custom Operations

Writing custom operations

Writing a filter

Writing a function

Writing an aggregator

Writing a custom assertion

Writing a buffer

Identifying common use cases for custom operations

Putting it all together

Summary

5. Code Reuse and Integration

Creating and using subassemblies

Built-in subassemblies

Creating a new custom subassembly

Using custom subassemblies

Using cascades

Building a complex workflow using cascades

Skipping a flow in a cascade

Intermediate file management

Dynamically controlling flows

Instrumentation and counters

Using counters to control flow

Using existing MapReduce jobs

Using fluent programming techniques

The FlowDef fluent interface

Integrating external components

Flow and cascade events

Using external JAR files

Using Cascading as insulation from big data migrations and upgrades

Summary

6. Testing a Cascading Application

Debugging a Cascading application

Getting your environment ready for debugging

Using Cascading local mode debugging

Setting up Eclipse

Remote debugging

Using assertions

The Debug() filter

Managing exceptions with traps

Checkpoints

Managing bad data

Viewing flow sequencing using DOT files

Testing strategies

Unit testing and JUnit

Mocking

Integration testing

Load and performance testing

Summary

7. Optimizing the Performance of a Cascading Application

Optimizing performance

Optimizing Cascading

Optimizing Hadoop

A note about the effective use of checkpoints

Summary

8. Creating a Real-world Application in Cascading

Project description – Business Intelligence case study on monitoring the competition

Project scope – understanding requirements

Understanding the project domain – text analytics and natural language processing (NLP)

Conducting a simple named entity extraction

Defining the project – the Cascading development methodology

Project roles and responsibilities

Conducting data analysis

Performing functional decomposition

Designing the process and components

Creating and integrating the operations

Creating and using subassemblies

Building the workflow

Building flows

Managing the context

Building the cascade

Designing the test plan

Performing a unit test

Performing an integration test

Performing a cluster test

Performing a full load test

Refining and adjusting

Software packaging and delivery to the cluster

Next steps

Summary

9. Planning for Future Growth

Finding online resources

Using other Cascading tools

Lingual

Pattern

Driven

Fluid

Load

Multitool

Support for other languages

Hortonworks

Custom taps

Cascading serializers

Java open source mock frameworks

Summary

A. Downloadable Software

Contents

Installing and using

Index

Learning Cascading

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: May 2015

Production reference: 1250515

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-891-3

www.packtpub.com

Credits

Authors

Michael Covert

Victoria Loewengart

Reviewers

Allen Driskill

Bernie French

Supreet Oberoi

Commissioning Editor

Veena Pagare

Acquisition Editor

Vivek Anantharaman

Content Development Editor

Prachi Bisht

Technical Editors

Ruchi Desai

Ankita Thakur

Copy Editors

Sonia Michelle Cheema

Ameesha Green

Project Coordinator

Shipra Chawhan

Proofreaders

Stephen Copestake

Safis Editing

Indexer

Monica Ajmera Mehta

Graphics

Disha Haria

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

Foreword

The Cascading project was started in 2007 to complete the promise that Apache Hadoop was indirectly making to people like me—that we can dramatically simplify data-oriented application development and deployment. This can be done not only from a tools perspective, but more importantly, from an organizational perspective. Take a thousand machines and make them look like one: one storage layer and a computing layer. This promise means I would never have to ask our IT group for another storage array, more disk space, or another overpriced box to manage. My team and I could just throw our data and applications at the cluster and move on...

Häufig gestellte Fragen

Information

Learning Cascading

Table of Contents

Learning Cascading

Credits

Foreword

Inhaltsverzeichnis