Big Data with Hadoop MapReduce
eBook - ePub

Big Data with Hadoop MapReduce

A Classroom Approach

  1. 406 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Big Data with Hadoop MapReduce

A Classroom Approach

About this book

The authors provide an understanding of big data and MapReduce by clearly presenting the basic terminologies and concepts. They have employed over 100 illustrations and many worked-out examples to convey the concepts and methods used in big data, the inner workings of MapReduce, and single node/multi-node installation on physical/virtual machines. This book covers almost all the necessary information on Hadoop MapReduce for most online certification exams. Upon completing this book, readers will find it easy to understand other big data processing tools such as Spark, Storm, etc.

Ultimately, readers will be able to:

• understand what big data is and the factors that are involved

• understand the inner workings of MapReduce, which is essential for certification exams

• learn the features and weaknesses of MapReduce

• set up Hadoop clusters with 100s of physical/virtual machines

• create a virtual machine in AWS

• write MapReduce with Eclipse in a simple way

• understand other big data processing tools and their applications

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Big Data with Hadoop MapReduce by Rathinaraja Jeyaraj,Ganeshkumar Pugalendhi,Anand Paul in PDF and/or ePUB format, as well as other popular books in Informatique & Guides de certification en informatique. We have over one million books available in our catalogue for you to explore.

CHAPTER 1

Big Data

A journey of a thousand miles begins with a single step.
—Lao Tzu

INTRODUCTION

Big data has dramatically changed the way in businesses, management, and research sectors. It is considered to be an emerging fourth scientific paradigm called “data science.” Let us have a quick review of the emergence of science over centuries.
Empirical Science – The proof of concept is based on experience and evidence verifiable rather than pure theory or logic.
Theoretical Science – The proof of concept is theoretically derived (Newton’s law, Kepler’s law, etc.) rather than conducting experiments for many complex problems, as creating evidence is difficult. It was also infeasible deriving thousands of pages.
Computational Science – Deriving equations over 1000s of pages for solving problems like weather prediction, protein structure evaluation, genome analysis, solving puzzle, games, human-computer interaction such as conversation was typically taking huge time. Application of specialized computer systems to solve such problems is called computational science. As part of this, a mathematical model is developed and programed to feed into the computer along with the input. This deals with calculation-intensive tasks (which are not humanly possible to calculate in a short time).
Data Science – Deals with data-intensive (massive data) computing. Data science aims to deal with big data analytics comprehensively to discover unknown, hidden pattern/trend/association/relationship or any other useful, understandable, and actionable information (insight/knowledge) that leads to decision making.

1.1 BIG DATA

New technologies, devices, and social applications exponentially increase the volume of digital data every year. The size of digital data created till 2003 was 4000 million GB, which would fill an entire football ground if piled up in disks. The same quantity was created in every two days in 2011, and every 10 minutes in 2013. This continues to proliferate. The data is meaningful and useful when processed. “Big data refers to a collection of datasets that are huge or flow large enough or with diverse types of data or any of these combinations that outpace our traditional storage (RDBMS), computing, and algorithm ability to store, process, analyze, and understand with a cost-effective way.” How big “big data” is? In simple terms, any amount of data that is beyond storage capacity, computing, and algorithm ability of a machine is called big data. Example:
• 10 GB high definition video could be a big data for smartphones but not for high-end desktops.
• Rendering video from 100 GB 3D graphics data could be a big data for laptop/desktop machines but not for high-end servers.
A decade back, the size was the first, and at times, the only dimension that indicated big data. Therefore, we might tend to conclude as follows:
Big (huge) + data (volume + velocity + variety) → huge data volume + huge data velocity + huge data variety
However, the volume is one of the factors that chokes the system capability. Other factors can individually hold the neck of computers. Even though the last equation is true, volume, velocity, and variety need not be combined to say a dataset is big data. Anyone of the factors (volume or velocity or variety) is enough to say a field is facing big data problems if it chokes the system capability. From the definition, “big data” not only emerged just from storage capacity (volume) point of view, but also from “processing capability and algorithm ability” of a machine. Because hardware processing capability and algorithm ability determine how much amount of data a computer can process in a specified amount of time. Therefore, some definitions focus on what data is, while others focus on what data does.

Some interesting facts on big data

The International Digital Corporation (IDC) is a market research firm that monitors and measures the data created worldwide. It reports that
every year, data created is almost doubled.
• over 16 ZB was created in 2016.
• over 163 ZB will be created by 2020.
• in today’s digital data world, 90% were created in the last couple of years, in which 95% of data is in semi/unstructured form, and merely less than 5% belongs to structured form of data.

1.1.1 BIG DATA SOURCES

Anything capable of producing digital data contribute to data accumulation. However, the way data generated in the last 40 years has changed completely. For example,
before 1980 – devices were generating data.
1980–2000 – employees generated data as an end user.
since 2000 – people started contributing data via social applications, e-mails, etc.
after 2005 – every hardware, software, application generated log data.
It is hard to find any activity that does not generate data on the Internet. We are letting somebody else watch us and monitor our activities over the Internet. Figure 1.1 [1] illustrates what happened in every 60 seconds in the digital world in 2017 by Internet-based companies.
• YouTube users upload 400 hours of new video and watch 700,000 hours of videos.
• 3.8 million searches are done on Google.
• Over 243,000 images are uploaded, and 70,000 hours of video are watched on Facebook.
• 350,000 tweets are generated on Twitter.
• Over 65,000 images are uploaded on Instagram.
• More than 210,000 snaps are sent on Snapchat.
• 120 new users are joining LinkedIn.
• 156 Million E-mails are exchanged.
• 29 million messages, over 175,000 video messages, and 1 million images are processed in WhatsApp every day.
• Videos of 87,000 hours are watched on Netflix.
• Over 25,000 posts are shared on Tumblr.
• 500,000 applications are downloaded.
Over 80 new domains are registered.
• Minimum of 1,000,000 swi...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. About the Authors
  6. A Message from Kaniyan
  7. Table of Contents
  8. Abbreviations
  9. Preface
  10. Dedication and Acknowledgment
  11. Introduction
  12. 1. Big Data
  13. 2. Hadoop Framework
  14. 3. Hadoop 1.2.1 Installation
  15. 4. Hadoop Ecosystem
  16. 5. Hadoop 2.7.0
  17. 6. Hadoop 2.7.0 Installation
  18. 7. Data Science
  19. APPENDIX A: Public Datasets
  20. APPENDIX B: MapReduce Exercise
  21. APPENDIX C: Case Study: Application Development for NYSE Dataset
  22. Web References
  23. Index