eBook - ePub

Mastering Apache Storm

Name: Mastering Apache Storm
ISBN: 9781787120402

Ankit Jain,

249 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Mastering Apache Storm

Ankit Jain,

About this book

Master the intricacies of Apache Storm and develop real-time stream processing applications with easeAbout This Book• Exploit the various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more• Integrate Storm with other Big Data technologies like Hadoop, HBase, and Apache Kafka• An easy-to-understand guide to effortlessly create distributed applications with StormWho This Book Is ForIf you are a Java developer who wants to enter into the world of real-time stream processing applications using Apache Storm, then this book is for you. No previous experience in Storm is required as this book starts from the basics. After finishing this book, you will be able to develop not-so-complex Storm applications. What You Will Learn• Understand the core concepts of Apache Storm and real-time processing• Follow the steps to deploy multiple nodes of Storm Cluster• Create Trident topologies to support various message-processing semantics• Make your cluster sharing effective using Storm scheduling• Integrate Apache Storm with other Big Data technologies such as Hadoop, HBase, Kafka, and more• Monitor the health of your Storm clusterIn DetailApache Storm is a real-time Big Data processing framework that processes large amounts of data reliably, guaranteeing that every message will be processed. Storm allows you to scale your data as it grows, making it an excellent platform to solve your big data problems. This extensive guide will help you understand right from the basics to the advanced topics of Storm.The book begins with a detailed introduction to real-time processing and where Storm fits in to solve these problems. You'll get an understanding of deploying Storm on clusters by writing a basic Storm Hello World example. Next we'll introduce you to Trident and you'll get a clear understanding of how you can develop and deploy a trident topology. We cover topics such as monitoring, Storm Parallelism, scheduler and log processing, in a very easy to understand manner. You will also learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.With real-world examples and clear explanations, this book will ensure you will have a thorough mastery of Apache Storm. You will be able to use this knowledge to develop efficient, distributed real-time applications to cater to your business needs.Style and approachThis easy-to-follow guide is full of examples and real-world applications to help you get an in-depth understanding of Apache Storm. This book covers the basics thoroughly and also delves into the intermediate and slightly advanced concepts of application development with Apache Storm.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Packt Publishing

Year

2017

eBook ISBN

9781787120402

Edition

Topic

Computer Science

Subtopic

Data Mining

Index

Computer Science

Storm Deployment, Topology Development, and Topology Options

In this chapter, we are going to start with deployment of Storm on multiple node (three Storm and three ZooKeeper) clusters. This chapter is very important because it focuses on how we can set up the production Storm cluster and why we need the high availability of both the Storm Supervisor, Nimbus, and ZooKeeper (as Storm uses ZooKeeper for storing the metadata of the cluster, topology, and so on)?

The following are the key points that we are going to cover in this chapter:

Deployment of the Storm cluster
Program and deploy the word count example
Different options of the Storm UI--kill, active, inactive, and rebalance
Walkthrough of the Storm UI
Dynamic log level settings
Validating the Nimbus high availability

Storm prerequisites

You should have the Java JDK and ZooKeeper ensemble installed before starting the deployment of the Storm cluster.

Installing Java SDK 7

Perform the following steps to install the Java SDK 7 on your machine. You can also go with JDK 1.8:

Download the Java SDK 7 RPM from Oracle's site (http://www.oracle.com/technetwork/java/javase/downloads/index.html).
Install the Java jdk-7u<version>-linux-x64.rpm file on your CentOS machine using the following command:

sudo rpm -ivh jdk-7u<version>-linux-x64.rpm

Add the following environment variable in the ~/.bashrc file:

export JAVA_HOME=/usr/java/jdk<version>

Add the path of the bin directory of the JDK to the PATH system environment variable to the ~/.bashrc file:

export PATH=$JAVA_HOME/bin:$PATH

Run the following command to reload the bashrc file on the current login terminal:

source ~/.bashrc

Check the Java installation as follows:

java -version

The output of the preceding command is as follows:

java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

Deployment of the ZooKeeper cluster

In any distributed application, various processes need to coordinate with each other and share configuration information. ZooKeeper is an application that provides all these services in a reliable manner. Being a distributed application, Storm also uses a ZooKeeper cluster to coordinate various processes. All of the states associated with the cluster and the various tasks submitted to Storm are stored in ZooKeeper. This section describes how you can set up a ZooKeeper cluster. We will be deploying a ZooKeeper ensemble of three nodes that will handle one node failure. Following is the deployment diagram of the three node ZooKeeper ensemble:

In the ZooKeeper ensemble, one node in the cluster acts as the leader, while the rest are followers. If the leader node of the ZooKeeper cluster dies, then an election for the new leader takes places among the remaining live nodes, and a new leader is elected. All write requests coming from clients are forwarded to the leader node, while the follower nodes only handle the read requests. Also, we can't increase the write performance of the ZooKeeper ensemble by increasing the number of nodes because all write operations go through the leader node.

It is advised to run an odd number of ZooKeeper nodes, as the ZooKeeper cluster keeps working as long as the majority (the number of live nodes is greater than n/2, where n being the number of deployed nodes) of the nodes are running. So if we have a cluster of four ZooKeeper nodes (3 > 4/2; only one node can die), then we can handle only one node failure, while if we had five nodes (3 > 5/2; two nodes can die) in the cluster, then we can handle two node failures.

Steps 1 to 4 need to be performed on each node to deploy the ZooKeeper ensemble:

Download the latest stable ZooKeeper release from the ZooKeeper site (http://zookeeper.apache.org/releases.html). At the time of writing, the latest version is ZooKeeper 3.4.6.
Once you have downloaded the latest version, unzip it. Now, we set up the ZK_HOME environment variable to make the setup easier.
Point the ZK_HOME environment variable to the unzipped directory. Create the configuration file, zoo.cfg, at the $ZK_HOME/conf directory using the following commands:

cd $ZK_HOME/conf  touch zoo.cfg

Add the following properties to the zoo.cfg file:

tickTime=2000  dataDir=/var/zookeeper  clientPort=2181  initLimit=5  syncLimit=2  server.1=zoo1:2888:3888  server.2=zoo2:2888:3888  server.3=zoo3.2888.3888

Here, zoo1, zoo2, and zoo3 are the IP addresses of the ZooKeeper nodes. The following are the definitions for each of the properties:

- tickTime: This is the basic unit of time in milliseconds used by ZooKeeper. It is used to send heartbeats, and the minimum session timeout will be twice the tickTime value.
- dataDir: This is the directory to store the in-memory database snapshots and transactional log.
- clientPort: This is the port used to listen to client connections.
- initLimit: This is the number of tickTime values needed to allow followers to connect and sync to a leader node.
- syncLimit: This is the number of tickTime values that a follower can take to sync with the leader node. If the sync does not happen within this time, the follower will be dropped from the ensemble.

The last three lines of the server.id=host:port:port format specify that there are three nodes in the ensemble. In an ensemble, each ZooKeeper node must have a unique ID number between 1 and 255. This ID is defined by creating a file named myid in the dataDir directory of each node. For example, the node with the ID 1 (server.1=zoo1:2888:3888) will have a myid file at directory /var/zookeeper with text 1 insid...

Title Page
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Real-Time Processing and Storm Introduction
Storm Deployment, Topology Development, and Topology Options
Storm Parallelism and Data Partitioning
Trident Introduction
Trident Topology and Uses
Storm Scheduler
Monitoring of Storm Cluster
Integration of Storm and Kafka
Storm and Hadoop Integration
Storm Integration with Redis, Elasticsearch, and HBase
Apache Log Processing with Storm
Twitter Tweet Collection and Machine Learning

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Mastering Apache Storm by Ankit Jain in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions