![]()
Real-Time Big Data Analytics
Table of Contents
Real-Time Big Data Analytics
Credits
About the Authors
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introducing the Big Data Technology Landscape and Analytics Platform
Big Data – a phenomenon
The Big Data dimensional paradigm
The Big Data ecosystem
The Big Data infrastructure
Components of the Big Data ecosystem
The Big Data analytics architecture
Building business solutions
Dataset processing
Solution implementation
Presentation
Distributed batch processing
Batch processing in distributed mode
Push code to data
Distributed databases (NoSQL)
Advantages of NoSQL databases
Choosing a NoSQL database
Real-time processing
The telecoms or cellular arena
Transportation and logistics
The connected vehicle
The financial sector
Summary
2. Getting Acquainted with Storm
An overview of Storm
The journey of Storm
Storm abstractions
Streams
Topology
Spouts
Bolts
Tasks
Workers
Storm architecture and its components
A Zookeeper cluster
A Storm cluster
How and when to use Storm
Storm internals
Storm parallelism
Storm internal message processing
Summary
3. Processing Data with Storm
Storm input sources
Meet Kafka
Getting to know more about Kafka
Other sources for input to Storm
A file as an input source
A socket as an input source
Kafka as an input source
Reliability of data processing
The concept of anchoring and reliability
The Storm acking framework
Storm simple patterns
Joins
Batching
Storm persistence
Storm's JDBC persistence framework
Summary
4. Introduction to Trident and Optimizing Storm Performance
Working with Trident
Transactions
Trident topology
Trident tuples
Trident spout
Trident operations
Merging and joining
Filter
Function
Aggregation
Grouping
State maintenance
Understanding LMAX
Memory and cache
Ring buffer – the heart of the disruptor
Producers
Consumers
Storm internode communication
ZeroMQ
Storm ZeroMQ configurations
Netty
Understanding the Storm UI
Storm UI landing page
Topology home page
Optimizing Storm performance
Summary
5. Getting Acquainted with Kinesis
Architectural overview of Kinesis
Benefits and use cases of Amazon Kinesis
High-level architecture
Components of Kinesis
Creating a Kinesis streaming service
Access to AWS Kinesis
Configuring the development environment
Creating Kinesis streams
Creating Kinesis stream producers
Creating Kinesis stream consumers
Generating and consuming crime alerts
Summary
6. Getting Acquainted with Spark
An overview of Spark
Batch data processing
Real-time data processing
Apache Spark – a one-stop solution
When to use Spark – practical use cases
The architecture of Spark
High-level architecture
Spark extensions/libraries
Spark packaging structure and core APIs
The Spark execution model – master-worker view
Resilient distributed datasets (RDD)
RDD – by definition
Fault tolerance
Storage
Persistence
Shuffling
Writing and executing our first Spark program
Hardware requirements
Installation of the basic software
Spark
Java
Scala
Eclipse
Configuring the Spark cluster
Coding a Spark job in Scala
Coding a Spark job in Java
Troubleshooting – tips and tricks
Port numbers used by Spark
Classpath issues – class not found exception
Other common exceptions
Summary
7. Programming with RDDs
Understanding Spark transformations and actions
RDD APIs
RDD transformation operations
RDD action operations
Programming Spark transformations and actions
Handling persistence in Spark
Summary
8. SQL Query Engine for Spark – Spark SQL
The architecture of Spark SQL
The emergence of Spark SQL
The components of Spark SQL
The DataFrame API
DataFrames and RDD
User-defined functions
DataFrames and SQL
The Catalyst optimizer
SQL and Hive contexts
Coding our first Spark SQL job
Coding a Spark SQL job in Scala
Coding a Spark SQL job in Java
Converting RDDs to DataFrames
Automated process
The manual process
Working with Parquet
Persisting Parquet data in HDFS
Partitioning and schema evolution or merging
Partitioning
Schema evolution/merging
Working with Hive tables
Performance tuning and best practices
Partitioning and parallelism
Serialization
Caching
Memory tuning
Summary
9. Analysis of Streaming Data Using Spark Streaming
High-level architecture
The components of Spark Streaming
The packaging structure of Spark Streaming
Spark Streaming APIs
Spark Streaming operations
Coding our first Spark Streaming job
Creating a stream producer
Writing our Spark Streaming job in Scala
Writing our Spark Streaming job in Java
Executing our Spark Streaming job
Querying streaming data in real time
The high-level architecture of our job
Coding the crime producer
Coding the stream consumer and transformer
Executing the SQL Streaming Crime Analyzer
Deployment and monitoring
Cluster managers for Spark Streaming
Executing Spark Streaming applications on Yarn
Executing Spark Streaming applications on Apache Mesos
Monitoring Spark Streaming applications
Summary
10. Introducing Lambda Architecture
What is Lambda Architecture
The need for Lambda Architecture
Layers/components of Lambda Architecture
The technology matrix for Lambda Architecture
Realization of Lambda Architecture
high-level architecture
Configuring Apache Cassandra and Spark
Coding the custom producer
Coding the real-time layer
Coding the batch layer
Coding the serving layer
Executing all the layers
Summary
Index
![]()
Real-Time Big Data Analytics
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2016
Production reference: 1230216
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-140-9
www.packtpub.com
![]()
Authors
Sumit Gupta
Shilpi Saxena
Reviewer
Pethuru Raj
Commissioning Editor
Akram Hussain
Acquisition Editor
Larissa Pinto
Content Development Editor
Shweta Pant
Technical Editors
Taabish Khan
Madhunikita Sunil Chindarkar
Copy Editors
Roshni Banerjee
Yesha Gangani
Rashmi Sawant
Project Coordinator
Kinjal Bari
Proofreader
Safis Editing
Indexer
Tejal Daruwale Soni
Graphics
Kirk D'Penha
Disha Haria
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph
2323__perlego__chapter_divider__2...