eBook - ePub

Big Data and Hadoop- Learn by Example

Name: Big Data and Hadoop- Learn by Example
ISBN: 9789386551993

Learn by example

Mayank Bhushan,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Big Data and Hadoop- Learn by Example

Learn by example

Mayank Bhushan,

About this book

A Practical Guide to learn HIVE, PIG, SQOOP Key Features

Overview Of Big Data
Basics of Hadoop
Hadoop Distributed File System
HBase, MapReduce
HIVE: The Data-ware House Of Hadoop
PIG: The Higher Level Programming Environment
SQOOP: Importing Data From Heterogeneous Sources
Flume, Ozzie, Zookeeper & Big Data Stream Mining.

Description
The book contains the latest trend in IT industry 'Big Data and Hadoop'. It explain about how big is 'Big Data' and why everybody is trying to implement this into their IT project.
It includes research work of various topics, theoretical and practical approach, each component of the architecture is described along with current industry trends.
Big Data and Hadoop taken together are a new skill as per the industry standards. Readers will get a compact book along with the industry experience, and would be a reference to help readers. What You Will Learn

Big Data, No SQL Data Management
Basics of Hadoop, its Installation
MapReduce Applications
Hadoop Related Tools
Who This Book is For
This book promises to be a very good starting point for beginners and an asset to advanced users too. Difficult concepts of Big Data-Hadoop is given in an easy and practical way, so that students can able to understand it in an efficient manner. Table of Contents
1. Big Data- Introduction and Demand
2. No SQL Data Management
3. Basics of Hadoop
4. Hadoop Installation (Step-by-Step)
5. MapReduce Applications
6. Hadoop Related Tools-I (Hbase & Cassandra)
7. Hadoop Related Tools -II (PigLatin & HiveQL)
8. Practical & Research based Topics
9. Appendix: Hadoop Commands
10. Chapter wise Questions
11. Previous Year Question Paper About the Author
Mayank Bhushan, Asst. Prof. in ABES Engineering College, Ghaziabad with an experience of more than eight years of teaching. He has done his graduation and post graduation in Computer Science & Engineering. He holds global certification on Big Data Analytics and Salesforce-Cloud computing, certification from Indian Institute of Technology Kharagpur on Linux platform.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

BPB Publications

Year

2018

Topic

Computer Science

eBook ISBN

9789386551993

Subtopic

Databases

Index

Computer Science

CHAPTER 1 Big Data-Introduction and Demand

“…Data is useless without the skill to analyse it.”

-Jeanne Harris, senior executive at Accenture Institute for High Performance,

“Taking a hunch, you have about the world and pursuing it in a structural, mathematical way to understand something new about the world.”

-Hilary Mason American data scientist and the founder of technology start-up Fast Forward Labs

1.1 Big Data

In today's scenario, we all are surrounded by bulk of data. We as human also an example of big data as we are surrounded by devices and generating data every minute.

“I spend most of my time assuming the world is not ready for the technology revolution that will be happening to them soon,”

Eric Schmidt Executive Chairman Google

In the matter of fact, if we compare present situation to past scenario we can find that we are creating as much information in just two days as we did up-to 2003. That means we are creating five Exabyte of data in every two days.

Real problem is that the user generated data which they are producing continuously. At the time of data analysis, we have challenges to store and analysis those data.

“The real issue is user-generated content,”

Schmidt

Mostly it helps Google for analysis the data and sell data analytics to companies who required it. We are producing data only the rough mobile as we already logged in when we buy system:

Map: that collect data of our travelling.
App: that gather information about our mood swings and record activity in which we involve most of the time.
E-Commerce sites: It also collect information of our requirement and show whatever we are supposed to buy.
Emails: It produce data of our requirement depend upon the conversation as all conversation generally filtered through companies that own mailing addresses.

During the past few decades, technologies like remote sensing, geographical data systems, and world positioning systems of map have remodelled the approach of distribution of human population across the world. For that scenario, we need to map those population data to meaningful survey that is performing by big companies. As a result, spatially careful changes across scales of days, weeks, or months, or maybe year to year, area unit tough to assess and limit the applying of human population maps in things within which timely data is needed, like disasters, conflicts, or epidemics. Information being collected on daily basis by mobile network suppliers across the planet, the prospect of having the ability to map up to date and ever-changing human population distributions over comparatively short intervals exist, paving the approach for brand new applications and a close to period of time understanding the patterns and processes in human science.

Some of the facts related to exponential data production are:

Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009.
In 2012, 2.5 quintillion bytes of data were generated daily, and 90% of current data worldwide originated in the past two years.
Facebook alone stores, accesses, and analyses 30 + PB of user-generated data.
In 2008, Google was processing 20,000 TB of data daily.
Walmart processes over 1 million customer transactions, thus generating data more than 2.5 PB as an estimate.
More than 5 billion people worldwide call, text, tweet, and browse on mobile devices.
The amount of e-mail accounts created worldwide is expected to increase from 3.3 billion in 2012 to over 4.3 billion by late 2016 at an average annual rate of 6% over the next four years. In 2012, a total of 89 billion e-mails were sent and received daily, and this value is expected to increase at an average annual rate of 13% over the next four years to exceed 143 billion by the end of 2016.
Boston.com reported that in 2013, approximately 507 billion e-mails were sent daily. Currently, an e-mail is sent every 3.5 × 10”7 seconds. Thus, the volume of data increases per second because of rapid data generation.
By 2020, enterprise data is expected to total 40 ZB, as per International Data Corporation.
The New York Stock Exchange generating about one terabytes of data for new trade.

Based on this estimation, business-to-consumer (B2C) and internet-business-to-business (B2B) transactions will amount to 450 billion per day.

All are the facts that are sufficient to prove that world is generating large amount of data that is not structured. That case leads to innovation or thinking that can provide solution for solving those issues.

Big data is the one which is use to deal with current scenario. Big data is the concept for handling unstructured and structured data other than traditional way.

Number	Symbol	In Binary
Bit	B	1 bit
Nibble/Nybble	Nibble	4 bits
Byte	B	8 bits
KiloByte	KB	1024 B
MegaByte	MB	1024 KB
GigaByte	GB	1024 MB
TeraByte	TB	1024 GB
PetaByte	PB	1024 TB
ExaByte	EB	1024 PB
ZettaByte	ZB	1024 EB
YottaByte	YB	1024 ZB
SabiByte	SB	1024 YB
JobiByte	JB	1024 SB

Table 1.1: Introduction of data

Table 1.1 is showing flow of data from bottom to top. In today's scenario, any type of data is possible to store and processing.

1.1.1 Characteristics of Big Data

Big data is data which gives the capacity to think beyond the traditional database system. Since that data can be used in Big data, it may be structured or unstructured data with huge amount of capacity, it requires fast movement, fast storage, fast processing other than conventional database techniques. These requirements of processing of data demand tools that can perform functions fast and meaningful that are difficult by any traditional database tools. Properties of Big data provide next generation way to handle the situation and provide easy and efficient way to handle data for organization. As we all see around, that there are lot of devices which are continuously generating data with exponential increment and all human being digging themselves into social networking. These types of unstructured and structured data are creating challenges of storing and processing data.

Every day, world is creating 2.5 quintillion bytes of data that is 90% of the data in the world today that was created in the last two years alone and sources of those data from sensors, videos, post, twitter, WhatsApp, Facebook and many more digital sites of many users.

Big data Vs Traditional techniques of databases

Traditional Database ”Schema on Write”	Big Data “Schema on Read”
There is need to create schema before data is loaded into database.	Data firstly copied to HDFS after than transformation needed.
Load operator perform explicitly to transform database.	Only required columns are extracted to perform operations.
It uses scale-in property for the enhancement of data at server side.	There is use of scale-out property to enrich data any time.

There are 3 V's that defined its characteristics in very clear manner.

Fig. 1.1: 3 V's of Big Data

Fig. 1.1 showing 3 initial V's on which big data is dependent. Volume refers to any large amount of data which need storage for analytics of data. As data is increasing exponentially so up to YB of data processing can be possible. Companies can think of it now with solution. The volume of data is growing. Consultants predict that the amount of information within the world can grow up to 25 ZB in 2020 that is with the exponential rate of increment.

An article could be a few weight unit bytes, a sound file could be a few megabytes whereas a full-length pic could be a few gigabytes. Additional sources of information area unit are adding on continuous basis. For any company, this time all the information generated is for not only by companies' employees but also by its machine as well like CCTV cameras, punching machines or sensible sensors etc.

More sources of information with a bigger size of data mix to extend the amount of information that needs to be analysed. If we look around there is no cost of GB of data in commodity systems. Soon all will be replaced by TB's of data.

Velocity refers to the speed of data that is exponential increases. Data is increasingly accelerating the velocity at which it is created and at which it is integrated. We have moved from batch to a real-time business.

At starting there is trend to analyse data in batch processing since amount of data is large, that simply means that there is need to submit data on server and wait for its processing. It is obvious that result will get delay. With latest source of data there is different type of data producing by machines which can be handle by Big data easily. The data is now processed into the server in real time scenario, in a continuous fashion; delivery of data output also depends on delay of sources omitting data.

It is not guarantee that data comes at machine in bulk it might be slow some time. So, when there is need to handle pace variance of data flow techniques there is easy and accurate solution by Big data.

Variety shows for different type of input that required for information extraction. Fact says that 80% of the world's data is unstructured while we have options in traditional data handle techniques. Text (SMS), photo, audio, video, web, GPS data, sensor data, relational data bases, documents, pdf, flash, etc. are the data that are flowing and required control to store and process it. Facebook, emails etc. have no control over input that can be provided by any user. The variety of data sources continues to increase. It includes:

Internet data (i.e., click stream, social media, and social networking links)
Primary research (i.e., surveys, experiments, observations)
Secondary research (i.e., competitive and marketplace data, industry reports, consumer data, business data)
Location data (i.e., mobile device data, geospatial data, GPS)
Image data (i.e., video, satellite image, surveillance)
Supply chain data (i.e., EDI, vendor catalogues and pricing, quality information)
Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)

Fig. 1.2: Additional V's

There are two additional V's (Fig. 1.2) that are useful to take attention of user in showing characteristics of Big data. As all we can find out messiness of data around like Twitter hash tag, smiley with text etc. All these data are very typical to handle when there is need of its mining. Big data makes it easy to store. Hash tag (#) in twitter is use to categorize the topic so that at time of extraction meaningful or required data can be fetched out and trustworthiness will remain with users. Nowadays, every company wants its s...

Cover Page
Title Page
Copyright Page
Dedication
Preface
Acknowledgement
Table of Contents
Chapter 1: Big Data-Introduction and Demand
Chapter 2: NoSQL Data Management
Chapter 3: Basics of Hadoop
Chapter 4: Hadoop Installation (Step by Step)
Chapter 5: MapReduce Applications
Chapter 6: Hadoop Related Tools-I (Hbase & Cassandra)
Chapter 7: Hadoop Related Tools-II (PigLatin & HiveQL)
Chapter 8: Practical & Research based Topics
Appendix: Hadoop Commands
Chapter wise Questions
Previous Year Question Paper

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Big Data and Hadoop- Learn by Example an online PDF/ePUB?

Yes, you can access Big Data and Hadoop- Learn by Example by Mayank Bhushan in PDF and/or ePUB format, as well as other popular books in Computer Science & Databases. We have over 1.5 million books available in our catalogue for you to explore.