Big Data and Hadoop- Learn by Example
eBook - ePub

Big Data and Hadoop- Learn by Example

Learn by example

Mayank Bhushan

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Big Data and Hadoop- Learn by Example

Learn by example

Mayank Bhushan

Book details
Book preview
Table of contents
Citations

About This Book

A Practical Guide to learn HIVE, PIG, SQOOP Key Features

  • Overview Of Big Data
  • Basics of Hadoop
  • Hadoop Distributed File System
  • HBase, MapReduce
  • HIVE: The Data-ware House Of Hadoop
  • PIG: The Higher Level Programming Environment
  • SQOOP: Importing Data From Heterogeneous Sources
  • Flume, Ozzie, Zookeeper & Big Data Stream Mining.


Description
The book contains the latest trend in IT industry 'Big Data and Hadoop'. It explain about how big is 'Big Data' and why everybody is trying to implement this into their IT project.
It includes research work of various topics, theoretical and practical approach, each component of the architecture is described along with current industry trends.
Big Data and Hadoop taken together are a new skill as per the industry standards. Readers will get a compact book along with the industry experience, and would be a reference to help readers. What You Will Learn

  • Big Data, No SQL Data Management
  • Basics of Hadoop, its Installation
  • MapReduce Applications
  • Hadoop Related Tools

  • Who This Book is For
    This book promises to be a very good starting point for beginners and an asset to advanced users too. Difficult concepts of Big Data-Hadoop is given in an easy and practical way, so that students can able to understand it in an efficient manner. Table of Contents
    1. Big Data- Introduction and Demand
    2. No SQL Data Management
    3. Basics of Hadoop
    4. Hadoop Installation (Step-by-Step)
    5. MapReduce Applications
    6. Hadoop Related Tools-I (Hbase & Cassandra)
    7. Hadoop Related Tools -II (PigLatin & HiveQL)
    8. Practical & Research based Topics
    9. Appendix: Hadoop Commands
    10. Chapter wise Questions
    11. Previous Year Question Paper About the Author
    Mayank Bhushan, Asst. Prof. in ABES Engineering College, Ghaziabad with an experience of more than eight years of teaching. He has done his graduation and post graduation in Computer Science & Engineering. He holds global certification on Big Data Analytics and Salesforce-Cloud computing, certification from Indian Institute of Technology Kharagpur on Linux platform.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Big Data and Hadoop- Learn by Example an online PDF/ePUB?
Yes, you can access Big Data and Hadoop- Learn by Example by Mayank Bhushan in PDF and/or ePUB format, as well as other popular books in Informatik & Datenbanken. We have over one million books available in our catalogue for you to explore.

Information

Year
2018
ISBN
9789386551993

CHAPTER 1

Big Data-Introduction and Demand

ā€œā€¦Data is useless without the skill to analyse it.ā€
-Jeanne Harris, senior executive at Accenture Institute for High Performance,
ā€œTaking a hunch, you have about the world and pursuing it in a structural, mathematical way to understand something new about the world.ā€
-Hilary Mason American data scientist and the founder of technology start-up Fast Forward Labs

1.1 Big Data

In today's scenario, we all are surrounded by bulk of data. We as human also an example of big data as we are surrounded by devices and generating data every minute.
ā€œI spend most of my time assuming the world is not ready for the technology revolution that will be happening to them soon,ā€
Eric Schmidt Executive Chairman Google
In the matter of fact, if we compare present situation to past scenario we can find that we are creating as much information in just two days as we did up-to 2003. That means we are creating five Exabyte of data in every two days.
Real problem is that the user generated data which they are producing continuously. At the time of data analysis, we have challenges to store and analysis those data.
ā€œThe real issue is user-generated content,ā€
Schmidt
Mostly it helps Google for analysis the data and sell data analytics to companies who required it. We are producing data only the rough mobile as we already logged in when we buy system:
  1. Map: that collect data of our travelling.
  2. App: that gather information about our mood swings and record activity in which we involve most of the time.
  3. E-Commerce sites: It also collect information of our requirement and show whatever we are supposed to buy.
  4. Emails: It produce data of our requirement depend upon the conversation as all conversation generally filtered through companies that own mailing addresses.
During the past few decades, technologies like remote sensing, geographical data systems, and world positioning systems of map have remodelled the approach of distribution of human population across the world. For that scenario, we need to map those population data to meaningful survey that is performing by big companies. As a result, spatially careful changes across scales of days, weeks, or months, or maybe year to year, area unit tough to assess and limit the applying of human population maps in things within which timely data is needed, like disasters, conflicts, or epidemics. Information being collected on daily basis by mobile network suppliers across the planet, the prospect of having the ability to map up to date and ever-changing human population distributions over comparatively short intervals exist, paving the approach for brand new applications and a close to period of time understanding the patterns and processes in human science.
Some of the facts related to exponential data production are:
  1. Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009.
  2. In 2012, 2.5 quintillion bytes of data were generated daily, and 90% of current data worldwide originated in the past two years.
  3. Facebook alone stores, accesses, and analyses 30 + PB of user-generated data.
  4. In 2008, Google was processing 20,000 TB of data daily.
  5. Walmart processes over 1 million customer transactions, thus generating data more than 2.5 PB as an estimate.
  6. More than 5 billion people worldwide call, text, tweet, and browse on mobile devices.
  7. The amount of e-mail accounts created worldwide is expected to increase from 3.3 billion in 2012 to over 4.3 billion by late 2016 at an average annual rate of 6% over the next four years. In 2012, a total of 89 billion e-mails were sent and received daily, and this value is expected to increase at an average annual rate of 13% over the next four years to exceed 143 billion by the end of 2016.
  8. Boston.com reported that in 2013, approximately 507 billion e-mails were sent daily. Currently, an e-mail is sent every 3.5 Ɨ 10ā€7 seconds. Thus, the volume of data increases per second because of rapid data generation.
  9. By 2020, enterprise data is expected to total 40 ZB, as per International Data Corporation.
  10. The New York Stock Exchange generating about one terabytes of data for new trade.
Based on this estimation, business-to-consumer (B2C) and internet-business-to-business (B2B) transactions will amount to 450 billion per day.
All are the facts that are sufficient to prove that world is generating large amount of data that is not structured. That case leads to innovation or thinking that can provide solution for solving those issues.
Big data is the one which is use to deal with current scenario. Big data is the concept for handling unstructured and structured data other than traditional way.
Number
Symbol
In Binary
Bit
B
1 bit
Nibble/Nybble
Nibble
4 bits
Byte
B
8 bits
KiloByte
KB
1024 B
MegaByte
MB
1024 KB
GigaByte
GB
1024 MB
TeraByte
TB
1024 GB
PetaByte
PB
1024 TB
ExaByte
EB
1024 PB
ZettaByte
ZB
1024 EB
YottaByte
YB
1024 ZB
SabiByte
SB
1024 YB
JobiByte
JB
1024 SB
Table 1.1: Introduction of data
Table 1.1 is showing flow of data from bottom to top. In today's scenario, any type of data is possible to store and processing.

1.1.1 Characteristics of Big Data

Big data is data which gives the capacity to think beyond the traditional database system. Since that data can be used in Big data, it may be structured or unstructured data with huge amount of capacity, it requires fast movement, fast storage, fast processing other than conventional database techniques. These requirements of processing of data demand tools that can perform functions fast and meaningful that are difficult by any traditional database tools. Properties of Big data provide next generation way to handle the situation and provide easy and efficient way to handle data for organization. As we all see around, that there are lot of devices which are continuously generating data with exponential increment and all human being digging themselves into social networking. These types of unstructured and structured data are creating challenges of storing and processing data.
Every day, world is creating 2.5 quintillion bytes of data that is 90% of the data in the world today that was created in the last two years alone and sources of those data from sensors, videos, post, twitter, WhatsApp, Facebook and many more digital sites of many users.

Big data Vs Traditional techniques of databases

Traditional Database ā€Schema on Writeā€
Big Data ā€œSchema on Readā€
There is need to create schema before data is loaded into database.
Data firstly copied to HDFS after than transformation needed.
Load operator perform explicitly to transform database.
Only required columns are extracted to perform operations.
It uses scale-in property for the enhancement of data at server side.
There is use of scale-out property to enrich data any time.
There are 3 V's that defined its characteristics in very clear manner.
Fig. 1.1: 3 V's of Big Data
Fig. 1.1 showing 3 initial V's on which big data is dependent. Volume refers to any large amount of data which need storage for analytics of data. As data is increasing exponentially so up to YB of data processing can be possible. Companies can think of it now with solution. The volume of data is growing. Consultants predict that the amount of information within the world can grow up to 25 ZB in 2020 that is with the exponential rate of increment.
An article could be a few weight unit bytes, a sound file could be a few megabytes whereas a full-length pic could be a few gigabytes. Additional sources of information area unit are adding on continuous basis. For any company, this time all the information generated is for not only by companies' employees but also by its machine as well like CCTV cameras, punching machines or sensible sensors etc.
More sources of information with a bigger size of data mix to extend the amount of information that needs to be analysed. If we look around there is no cost of GB of data in commodity systems. Soon all will be replaced by TB's of data.
Velocity refers to the speed of data that is exponential increases. Data is increasingly accelerating the velocity at which it is created and at which it is integrated. We have moved from batch to a real-time business.
At starting there is trend to analyse data in batch processing since amount of data is large, that simply means that there is need to submit data on server and wait for its processing. It is obvious that result will get delay. With latest source of data there is different type of data producing by machines which can be handle by Big data easily. The data is now processed into the server in real time scenario, in a continuous fashion; delivery of data output also depends on delay of sources omitting data.
It is not guarantee that data comes at machine in bulk it might be slow some time. So, when there is need to handle pace variance of data flow techniques there is easy and accurate solution by Big data.
Variety shows for different type of input that required for information extraction. Fact says that 80% of the world's data is unstructured while we have options in traditional data handle techniques. Text (SMS), photo, audio, video, web, GPS data, sensor data, relational data bases, documents, pdf, flash, etc. are the data that are flowing and required control to store and process it. Facebook, emails etc. have no control over input that can be provided by any user. The variety of data sources continues to increase. It includes:
  • Internet data (i.e., click stream, social media, and social networking links)
  • Primary research (i.e., surveys, experiments, observations)
  • Secondary research (i.e., competitive and marketplace data, industry reports, consumer data, business data)
  • Location data (i.e., mobile device data, geospatial data, GPS)
  • Image data (i.e., video, satellite image, surveillance)
  • Supply chain data (i.e., EDI, vendor catalogues and pricing, quality information)
  • Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
Fig. 1.2: Additional V's
There are two additional V's (Fig. 1.2) that are useful to take attention of user in showing characteristics of Big data. As all we can find out messiness of data around like Twitter hash tag, smiley with text etc. All these data are very typical to handle when there is need of its mining. Big data makes it easy to store. Hash tag (#) in twitter is use to categorize the topic so that at time of extraction meaningful or required data can be fetched out and trustworthiness will remain with users. Nowadays, every company wants its s...

Table of contents

Citation styles for Big Data and Hadoop- Learn by Example

APA 6 Citation

Bhushan, M. (2020). Big Data and Hadoop ([edition unavailable]). BPB Publications. Retrieved from https://www.perlego.com/book/2028263/big-data-and-hadoop-learn-by-example-pdf (Original work published 2020)

Chicago Citation

Bhushan, Mayank. (2020) 2020. Big Data and Hadoop. [Edition unavailable]. BPB Publications. https://www.perlego.com/book/2028263/big-data-and-hadoop-learn-by-example-pdf.

Harvard Citation

Bhushan, M. (2020) Big Data and Hadoop. [edition unavailable]. BPB Publications. Available at: https://www.perlego.com/book/2028263/big-data-and-hadoop-learn-by-example-pdf (Accessed: 15 October 2022).

MLA 7 Citation

Bhushan, Mayank. Big Data and Hadoop. [edition unavailable]. BPB Publications, 2020. Web. 15 Oct. 2022.