Real-Time Big Data Analytics
eBook - ePub

Real-Time Big Data Analytics

  1. 326 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Real-Time Big Data Analytics

About this book

Design, process, and analyze large sets of complex data in real time

About This Book

  • Get acquainted with transformations and database-level interactions, and ensure the reliability of messages processed using Storm
  • Implement strategies to solve the challenges of real-time data processing
  • Load datasets, build queries, and make recommendations using Spark SQL

Who This Book Is For

If you are a Big Data architect, developer, or a programmer who wants to develop applications/frameworks to implement real-time analytics using open source technologies, then this book is for you.

What You Will Learn

  • Explore big data technologies and frameworks
  • Work through practical challenges and use cases of real-time analytics versus batch analytics
  • Develop real-word use cases for processing and analyzing data in real-time using the programming paradigm of Apache Storm
  • Handle and process real-time transactional data
  • Optimize and tune Apache Storm for varied workloads and production deployments
  • Process and stream data with Amazon Kinesis and Elastic MapReduce
  • Perform interactive and exploratory data analytics using Spark SQL
  • Develop common enterprise architectures/applications for real-time and batch analytics

In Detail

Enterprise has been striving hard to deal with the challenges of data arriving in real time or near real time.

Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases.

From the beginning of the book, we will cover the basics of varied real-time data processing frameworks and technologies. We will discuss and explain the differences between batch and real-time processing in detail, and will also explore the techniques and programming concepts using Apache Storm.

Moving on, we'll familiarize you with "Amazon Kinesis" for real-time data processing on cloud. We will further develop your understanding of real-time analytics through a comprehensive review of Apache Spark along with the high-level architecture and the building blocks of a Spark program.

You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark.

At the end of this book, we will introduce Spark Streaming, the streaming library of Spark, and will walk you through the emerging Lambda Architecture (LA), which provides a hybrid platform for big data processing by combining real-time and precomputed batch data to provide a near real-time view of incoming data.

Style and approach

This step-by-step is an easy-to-follow, detailed tutorial, filled with practical examples of basic and advanced features.

Each topic is explained sequentially and supported by real-world examples and executable code snippets.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Information

Real-Time Big Data Analytics


Table of Contents

Real-Time Big Data Analytics
Credits
About the Authors
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introducing the Big Data Technology Landscape and Analytics Platform
Big Data – a phenomenon
The Big Data dimensional paradigm
The Big Data ecosystem
The Big Data infrastructure
Components of the Big Data ecosystem
The Big Data analytics architecture
Building business solutions
Dataset processing
Solution implementation
Presentation
Distributed batch processing
Batch processing in distributed mode
Push code to data
Distributed databases (NoSQL)
Advantages of NoSQL databases
Choosing a NoSQL database
Real-time processing
The telecoms or cellular arena
Transportation and logistics
The connected vehicle
The financial sector
Summary
2. Getting Acquainted with Storm
An overview of Storm
The journey of Storm
Storm abstractions
Streams
Topology
Spouts
Bolts
Tasks
Workers
Storm architecture and its components
A Zookeeper cluster
A Storm cluster
How and when to use Storm
Storm internals
Storm parallelism
Storm internal message processing
Summary
3. Processing Data with Storm
Storm input sources
Meet Kafka
Getting to know more about Kafka
Other sources for input to Storm
A file as an input source
A socket as an input source
Kafka as an input source
Reliability of data processing
The concept of anchoring and reliability
The Storm acking framework
Storm simple patterns
Joins
Batching
Storm persistence
Storm's JDBC persistence framework
Summary
4. Introduction to Trident and Optimizing Storm Performance
Working with Trident
Transactions
Trident topology
Trident tuples
Trident spout
Trident operations
Merging and joining
Filter
Function
Aggregation
Grouping
State maintenance
Understanding LMAX
Memory and cache
Ring buffer – the heart of the disruptor
Producers
Consumers
Storm internode communication
ZeroMQ
Storm ZeroMQ configurations
Netty
Understanding the Storm UI
Storm UI landing page
Topology home page
Optimizing Storm performance
Summary
5. Getting Acquainted with Kinesis
Architectural overview of Kinesis
Benefits and use cases of Amazon Kinesis
High-level architecture
Components of Kinesis
Creating a Kinesis streaming service
Access to AWS Kinesis
Configuring the development environment
Creating Kinesis streams
Creating Kinesis stream producers
Creating Kinesis stream consumers
Generating and consuming crime alerts
Summary
6. Getting Acquainted with Spark
An overview of Spark
Batch data processing
Real-time data processing
Apache Spark – a one-stop solution
When to use Spark – practical use cases
The architecture of Spark
High-level architecture
Spark extensions/libraries
Spark packaging structure and core APIs
The Spark execution model – master-worker view
Resilient distributed datasets (RDD)
RDD – by definition
Fault tolerance
Storage
Persistence
Shuffling
Writing and executing our first Spark program
Hardware requirements
Installation of the basic software
Spark
Java
Scala
Eclipse
Configuring the Spark cluster
Coding a Spark job in Scala
Coding a Spark job in Java
Troubleshooting – tips and tricks
Port numbers used by Spark
Classpath issues – class not found exception
Other common exceptions
Summary
7. Programming with RDDs
Understanding Spark transformations and actions
RDD APIs
RDD transformation operations
RDD action operations
Programming Spark transformations and actions
Handling persistence in Spark
Summary
8. SQL Query Engine for Spark – Spark SQL
The architecture of Spark SQL
The emergence of Spark SQL
The components of Spark SQL
The DataFrame API
DataFrames and RDD
User-defined functions
DataFrames and SQL
The Catalyst optimizer
SQL and Hive contexts
Coding our first Spark SQL job
Coding a Spark SQL job in Scala
Coding a Spark SQL job in Java
Converting RDDs to DataFrames
Automated process
The manual process
Working with Parquet
Persisting Parquet data in HDFS
Partitioning and schema evolution or merging
Partitioning
Schema evolution/merging
Working with Hive tables
Performance tuning and best practices
Partitioning and parallelism
Serialization
Caching
Memory tuning
Summary
9. Analysis of Streaming Data Using Spark Streaming
High-level architecture
The components of Spark Streaming
The packaging structure of Spark Streaming
Spark Streaming APIs
Spark Streaming operations
Coding our first Spark Streaming job
Creating a stream producer
Writing our Spark Streaming job in Scala
Writing our Spark Streaming job in Java
Executing our Spark Streaming job
Querying streaming data in real time
The high-level architecture of our job
Coding the crime producer
Coding the stream consumer and transformer
Executing the SQL Streaming Crime Analyzer
Deployment and monitoring
Cluster managers for Spark Streaming
Executing Spark Streaming applications on Yarn
Executing Spark Streaming applications on Apache Mesos
Monitoring Spark Streaming applications
Summary
10. Introducing Lambda Architecture
What is Lambda Architecture
The need for Lambda Architecture
Layers/components of Lambda Architecture
The technology matrix for Lambda Architecture
Realization of Lambda Architecture
high-level architecture
Configuring Apache Cassandra and Spark
Coding the custom producer
Coding the real-time layer
Coding the batch layer
Coding the serving layer
Executing all the layers
Summary
Index

Real-Time Big Data Analytics

Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2016
Production reference: 1230216
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-140-9
www.packtpub.com

Credits

Authors
Sumit Gupta
Shilpi Saxena
Reviewer
Pethuru Raj
Commissioning Editor
Akram Hussain
Acquisition Editor
Larissa Pinto
Content Development Editor
Shweta Pant
Technical Editors
Taabish Khan
Madhunikita Sunil Chindarkar
Copy Editors
Roshni Banerjee
Yesha Gangani
Rashmi Sawant
Project Coordinator
Kinjal Bari
Proofreader
Safis Editing
Indexer
Tejal Daruwale Soni
Graphics
Kirk D'Penha
Disha Haria
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph
2323__perlego__chapter_divider__2...

Table of contents

  1. Real-Time Big Data Analytics

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Real-Time Big Data Analytics by Sumit Gupta, Shilpi in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.