eBook - ePub

Real-Time Big Data Analytics

Name: Real-Time Big Data Analytics
ISBN: 9781784391409

Sumit Gupta,

Shilpi,

326 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Real-Time Big Data Analytics

Sumit Gupta,

Shilpi,

About this book

Design, process, and analyze large sets of complex data in real time

About This Book

Get acquainted with transformations and database-level interactions, and ensure the reliability of messages processed using Storm
Implement strategies to solve the challenges of real-time data processing
Load datasets, build queries, and make recommendations using Spark SQL

Who This Book Is For

If you are a Big Data architect, developer, or a programmer who wants to develop applications/frameworks to implement real-time analytics using open source technologies, then this book is for you.

What You Will Learn

Explore big data technologies and frameworks
Work through practical challenges and use cases of real-time analytics versus batch analytics
Develop real-word use cases for processing and analyzing data in real-time using the programming paradigm of Apache Storm
Handle and process real-time transactional data
Optimize and tune Apache Storm for varied workloads and production deployments
Process and stream data with Amazon Kinesis and Elastic MapReduce
Perform interactive and exploratory data analytics using Spark SQL
Develop common enterprise architectures/applications for real-time and batch analytics

In Detail

Enterprise has been striving hard to deal with the challenges of data arriving in real time or near real time.

Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases.

From the beginning of the book, we will cover the basics of varied real-time data processing frameworks and technologies. We will discuss and explain the differences between batch and real-time processing in detail, and will also explore the techniques and programming concepts using Apache Storm.

Moving on, we'll familiarize you with "Amazon Kinesis" for real-time data processing on cloud. We will further develop your understanding of real-time analytics through a comprehensive review of Apache Spark along with the high-level architecture and the building blocks of a Spark program.

You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark.

At the end of this book, we will introduce Spark Streaming, the streaming library of Spark, and will walk you through the emerging Lambda Architecture (LA), which provides a hybrid platform for big data processing by combining real-time and precomputed batch data to provide a near real-time view of incoming data.

Style and approach

This step-by-step is an easy-to-follow, detailed tutorial, filled with practical examples of basic and advanced features.

Each topic is explained sequentially and supported by real-world examples and executable code snippets.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Packt Publishing

Year

2016

eBook ISBN

9781784391409

Edition

Topic

Computer Science

Subtopic

Data Processing

Index

Computer Science

Real-Time Big Data Analytics

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Introducing the Big Data Technology Landscape and Analytics Platform

Big Data – a phenomenon

The Big Data dimensional paradigm

The Big Data ecosystem

The Big Data infrastructure

Components of the Big Data ecosystem

The Big Data analytics architecture

Building business solutions

Dataset processing

Solution implementation

Presentation

Distributed batch processing

Batch processing in distributed mode

Push code to data

Distributed databases (NoSQL)

Advantages of NoSQL databases

Choosing a NoSQL database

Real-time processing

The telecoms or cellular arena

Transportation and logistics

The connected vehicle

The financial sector

Summary

2. Getting Acquainted with Storm

An overview of Storm

The journey of Storm

Storm abstractions

Streams

Topology

Spouts

Bolts

Tasks

Workers

Storm architecture and its components

A Zookeeper cluster

A Storm cluster

How and when to use Storm

Storm internals

Storm parallelism

Storm internal message processing

Summary

3. Processing Data with Storm

Storm input sources

Meet Kafka

Getting to know more about Kafka

Other sources for input to Storm

A file as an input source

A socket as an input source

Kafka as an input source

Reliability of data processing

The concept of anchoring and reliability

The Storm acking framework

Storm simple patterns

Joins

Batching

Storm persistence

Storm's JDBC persistence framework

Summary

4. Introduction to Trident and Optimizing Storm Performance

Working with Trident

Transactions

Trident topology

Trident tuples

Trident spout

Trident operations

Merging and joining

Filter

Function

Aggregation

Grouping

State maintenance

Understanding LMAX

Memory and cache

Ring buffer – the heart of the disruptor

Producers

Consumers

Storm internode communication

ZeroMQ

Storm ZeroMQ configurations

Netty

Understanding the Storm UI

Storm UI landing page

Topology home page

Optimizing Storm performance

Summary

5. Getting Acquainted with Kinesis

Architectural overview of Kinesis

Benefits and use cases of Amazon Kinesis

High-level architecture

Components of Kinesis

Creating a Kinesis streaming service

Access to AWS Kinesis

Configuring the development environment

Creating Kinesis streams

Creating Kinesis stream producers

Creating Kinesis stream consumers

Generating and consuming crime alerts

Summary

6. Getting Acquainted with Spark

An overview of Spark

Batch data processing

Real-time data processing

Apache Spark – a one-stop solution

When to use Spark – practical use cases

The architecture of Spark

High-level architecture

Spark extensions/libraries

Spark packaging structure and core APIs

The Spark execution model – master-worker view

Resilient distributed datasets (RDD)

RDD – by definition

Fault tolerance

Storage

Persistence

Shuffling

Writing and executing our first Spark program

Hardware requirements

Installation of the basic software

Spark

Java

Scala

Eclipse

Configuring the Spark cluster

Coding a Spark job in Scala

Coding a Spark job in Java

Troubleshooting – tips and tricks

Port numbers used by Spark

Classpath issues – class not found exception

Other common exceptions

Summary

7. Programming with RDDs

Understanding Spark transformations and actions

RDD APIs

RDD transformation operations

RDD action operations

Programming Spark transformations and actions

Handling persistence in Spark

Summary

8. SQL Query Engine for Spark – Spark SQL

The architecture of Spark SQL

The emergence of Spark SQL

The components of Spark SQL

The DataFrame API

DataFrames and RDD

User-defined functions

DataFrames and SQL

The Catalyst optimizer

SQL and Hive contexts

Coding our first Spark SQL job

Coding a Spark SQL job in Scala

Coding a Spark SQL job in Java

Converting RDDs to DataFrames

Automated process

The manual process

Working with Parquet

Persisting Parquet data in HDFS

Partitioning and schema evolution or merging

Partitioning

Schema evolution/merging

Working with Hive tables

Performance tuning and best practices

Partitioning and parallelism

Serialization

Caching

Memory tuning

Summary

9. Analysis of Streaming Data Using Spark Streaming

High-level architecture

The components of Spark Streaming

The packaging structure of Spark Streaming

Spark Streaming APIs

Spark Streaming operations

Coding our first Spark Streaming job

Creating a stream producer

Writing our Spark Streaming job in Scala

Writing our Spark Streaming job in Java

Executing our Spark Streaming job

Querying streaming data in real time

The high-level architecture of our job

Coding the crime producer

Coding the stream consumer and transformer

Executing the SQL Streaming Crime Analyzer

Deployment and monitoring

Cluster managers for Spark Streaming

Executing Spark Streaming applications on Yarn

Executing Spark Streaming applications on Apache Mesos

Monitoring Spark Streaming applications

Summary

10. Introducing Lambda Architecture

What is Lambda Architecture

The need for Lambda Architecture

Layers/components of Lambda Architecture

The technology matrix for Lambda Architecture

Realization of Lambda Architecture

high-level architecture

Configuring Apache Cassandra and Spark

Coding the custom producer

Coding the real-time layer

Coding the batch layer

Coding the serving layer

Executing all the layers

Summary

Index

Real-Time Big Data Analytics

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2016

Production reference: 1230216

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78439-140-9

www.packtpub.com

Credits

Authors

Sumit Gupta

Shilpi Saxena

Reviewer

Pethuru Raj

Commissioning Editor

Akram Hussain

Acquisition Editor

Larissa Pinto

Content Development Editor

Shweta Pant

Technical Editors

Taabish Khan

Madhunikita Sunil Chindarkar

Copy Editors

Roshni Banerjee

Yesha Gangani

Rashmi Sawant

Project Coordinator

Kinjal Bari

Proofreader

Safis Editing

Indexer

Tejal Daruwale Soni

Graphics

Kirk D'Penha

Disha Haria

Production Coordinator

Manu Joseph

Cover Work

Manu Joseph

2323__perlego__chapter_divider__2...

Real-Time Big Data Analytics

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Real-Time Big Data Analytics by Sumit Gupta, Shilpi in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Real-Time Big Data Analytics

Real-Time Big Data Analytics

About this book

Tools to learn more effectively

Information

Real-Time Big Data Analytics

Table of Contents

Real-Time Big Data Analytics

Credits

Table of contents

Frequently asked questions