eBook - ePub

Apache Flume: Distributed Log Collection for Hadoop

Name: Apache Flume: Distributed Log Collection for Hadoop
ISBN: 9781782167914

Steve Hoffman,

108 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Apache Flume: Distributed Log Collection for Hadoop

Steve Hoffman,

About this book

In Detail

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.

Approach

A starter guide that covers Apache Flume in detail.

Who this book is for

Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2013

Edition

eBook ISBN

9781782167914

Topic

Computer Science

Subtopic

Data Warehousing

Index

Computer Science

Apache Flume: Distributed Log Collection for Hadoop

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Overview and Architecture

Flume 0.9

Flume 1.X (Flume-NG)

The problem with HDFS and streaming data/logs

Sources, channels, and sinks

Flume events

Interceptors, channel selectors, and sink processors

Tiered data collection (multiple flows and/or agents)

Summary

2. Flume Quick Start

Downloading Flume

Flume in Hadoop distributions

Flume configuration file overview

Starting up with "Hello World"

Summary

3. Channels

Memory channel

File channel

Summary

4. Sinks and Sink Processors

HDFS sink

Path and filename

File rotation

Compression codecs

Event serializers

Text output

Text with headers

Apache Avro

File type

Sequence file

Data stream

Compressed stream

Timeouts and workers

Sink groups

Load balancing

Failover

Summary

5. Sources and Channel Selectors

The problem with using tail

The exec source

The spooling directory source

Syslog sources

The syslog UDP source

The syslog TCP source

The multiport syslog TCP source

Channel selectors

Replicating

Multiplexing

Summary

6. Interceptors, ETL, and Routing

Interceptors

Timestamp

Host

Static

Regular expression filtering

Regular expression extractor

Custom interceptors

Tiering data flows

Avro Source/Sink

Command-line Avro

Log4J Appender

The Load Balancing Log4J Appender

Routing

Summary

7. Monitoring Flume

Monitoring the agent process

Monit

Nagios

Monitoring performance metrics

Ganglia

The internal HTTP server

Custom monitoring hooks

Summary

8. There Is No Spoon – The Realities of Real-time Distributed Data Collection

Transport time versus log time

Time zones are evil

Capacity planning

Considerations for multiple data centers

Compliance and data expiry

Summary

Index

Apache Flume: Distributed Log Collection for Hadoop

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2013

Production Reference: 1090713

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-791-4

www.packtpub.com

Cover Image by Abhishek Pandey (<[email protected]>)

Credits

Author

Steve Hoffman

Reviewers

Subash D'Souza

Stefan Will

Acquisition Editor

Kunal Parikh

Commissioning Editor

Sharvari Tawde

Technical Editors

Jalasha D'costa

Mausam Kothari

Project Coordinator

Sherin Padayatty

Proofreader

Aaron Nash

Indexer

Monica Ajmera Mehta

Graphics

Valentina D'silva

Abhinash Sahu

Production Coordinator

Kirtee Shingan

Cover Work

Kirtee Shingan

About the Author

Steve Hoffman has 30 years of software development experience and holds a B.S. in computer engineering from the University of Illinois Urbana-Champaign and a M.S. in computer science from the DePaul University. He is currently a Principal Engineer at Orbitz Worldwide.

More information on Steve can be found at http://bit.ly/bacoboy or on Twitter @bacoboy.

This is Steve's first book.

Apache Flume: Distributed Log Collection for Hadoop

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Apache Flume: Distributed Log Collection for Hadoop an online PDF/ePUB?

Yes, you can access Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Warehousing. We have over 1.5 million books available in our catalogue for you to explore.

Apache Flume: Distributed Log Collection for Hadoop

Apache Flume: Distributed Log Collection for Hadoop

About this book

In Detail

Approach

Who this book is for

Trusted by 375,005 students

Information

Apache Flume: Distributed Log Collection for Hadoop

Table of Contents

Apache Flume: Distributed Log Collection for Hadoop

Credits

About the Author

Table of contents

Frequently asked questions