eBook - ePub

Apache Flume: Distributed Log Collection for Hadoop

Name: Apache Flume: Distributed Log Collection for Hadoop
Author: Steve Hoffman

Steve Hoffman

Share book

108 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Apache Flume: Distributed Log Collection for Hadoop

Steve Hoffman

Book details

Book preview

Table of contents

Citations

About This Book

In Detail

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.

Approach

A starter guide that covers Apache Flume in detail.

Who this book is for

Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Apache Flume: Distributed Log Collection for Hadoop an online PDF/ePUB?

Yes, you can access Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman in PDF and/or ePUB format, as well as other popular books in Computer Science & Open Source Programming. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2013

ISBN

9781782167914

Edition

Topic

Computer Science

Subtopic

Open Source Programming

Index

Computer Science

Apache Flume: Distributed Log Collection for Hadoop

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Overview and Architecture

Flume 0.9

Flume 1.X (Flume-NG)

The problem with HDFS and streaming data/logs

Sources, channels, and sinks

Flume events

Interceptors, channel selectors, and sink processors

Tiered data collection (multiple flows and/or agents)

Summary

2. Flume Quick Start

Downloading Flume

Flume in Hadoop distributions

Flume configuration file overview

Starting up with "Hello World"

Summary

3. Channels

Memory channel

File channel

Summary

4. Sinks and Sink Processors

HDFS sink

Path and filename

File rotation

Compression codecs

Event serializers

Text output

Text with headers

Apache Avro

File type

Sequence file

Data stream

Compressed stream

Timeouts and workers

Sink groups

Load balancing

Failover

Summary

5. Sources and Channel Selectors

The problem with using tail

The exec source

The spooling directory source

Syslog sources

The syslog UDP source

The syslog TCP source

The multiport syslog TCP source

Channel selectors

Replicating

Multiplexing

Summary

6. Interceptors, ETL, and Routing

Interceptors

Timestamp

Host

Static

Regular expression filtering

Regular expression extractor

Custom interceptors

Tiering data flows

Avro Source/Sink

Command-line Avro

Log4J Appender

The Load Balancing Log4J Appender

Routing

Summary

7. Monitoring Flume

Monitoring the agent process

Monit

Nagios

Monitoring performance metrics

Ganglia

The internal HTTP server

Custom monitoring hooks

Summary

8. There Is No Spoon – The Realities of Real-time Distributed Data Collection

Transport time versus log time

Time zones are evil

Capacity planning

Considerations for multiple data centers

Compliance and data expiry

Summary

Index

Apache Flume: Distributed Log Collection for Hadoop

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2013

Production Reference: 1090713

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-791-4

www.packtpub.com

Cover Image by Abhishek Pandey (<[email protected]>)

Credits

Author

Steve Hoffman

Reviewers

Subash D'Souza

Stefan Will

Acquisition Editor

Kunal Parikh

Commissioning Editor

Sharvari Tawde

Technical Editors

Jalasha D'costa

Mausam Kothari

Project Coordinator

Sherin Padayatty

Proofreader

Aaron Nash

Indexer

Monica Ajmera Mehta

Graphics

Valentina D'silva

Abhinash Sahu

Production Coordinator

Kirtee Shingan

Cover Work

Kirtee Shingan

About the Author

Steve Hoffman has 30 years of software development experience and holds a B.S. in computer engineering from the University of Illinois Urbana-Champaign and a M.S. in computer science from the DePaul University. He is currently a Principal Engineer at Orbitz Worldwide.

More information on Steve can be found at http://bit.ly/bacoboy or on Twitter @bacoboy.

This is Steve's first book.

About This Book

In Detail

Approach

Who this book is for

Frequently asked questions

Information

Apache Flume: Distributed Log Collection for Hadoop

Table of Contents

Apache Flume: Distributed Log Collection for Hadoop

Credits

About the Author

Table of contents