Tera-Tom Genius Series - Hadoop Architecture and SQL
eBook - ePub

Tera-Tom Genius Series - Hadoop Architecture and SQL

Tom Coffing, Jason Nolander

Share book
  1. 546 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Tera-Tom Genius Series - Hadoop Architecture and SQL

Tom Coffing, Jason Nolander

Book details
Book preview
Table of contents
Citations

About This Book

Hadoop is one of the most exciting technologies to ever emerge and is transforming the computer industry. Although Hadoop is designed to process MapReduce queries, it has evolved into accepting SQL, and then converts that SQL to MapReduce. This has opened the door for millions of customers who want to take advantage of their SQL knowledge to query Hadoop systems. Tera-Tom Genius Series - Hadoop Architecture and SQL, written by Tom Coffing and Jason Nolander, details the architecture of Hadoop and the SQL commands available. This book is perfect for anyone who wants to query Hadoop with SQL. It educates readers on how to create tables, how the data is distributed, and how the system processes the data. In addition, it includes hundreds of pages of SQL examples and explanations. The Authors Tera-Tom Coffing, who has written over 75 successful books on Data Warehousing, and Jason Nolander, who has over 20 years of financial industry experience, have written a book that is sure to be your "go to" book on Hadoop.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Tera-Tom Genius Series - Hadoop Architecture and SQL an online PDF/ePUB?
Yes, you can access Tera-Tom Genius Series - Hadoop Architecture and SQL by Tom Coffing, Jason Nolander in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Warehousing. We have over one million books available in our catalogue for you to explore.

Information

Year
2016
ISBN
9781940540375

Chapter 1 – The Concepts of Hadoop

“Let me once again explain the rules. Hadoop Rules!”
- Tera-Tom Coffing

What is Hadoop All About?

image
Hadoop is all about lower costs and better value! Hadoop leverages inexpensive commodity hardware servers and inexpensive disk storage. In all previous systems the servers were in the same location, but Hadoop allows for the servers to be scattered around the world. The disks are called JBOD (Just a Bunch of Disks) because they are just unsophisticated disks attached to the commodity hardware. This approach enables incredible capabilities while keeping costs down.

There is a Named Node and Up to 4000 Data Nodes

image
Hadoop is all about parallel processing, full table scans, unstructured data and commodity hardware. There is a single server that is called a “Named Node”. Its job is to keep track of all of the data files on the “Data Nodes”. The named node sends out a heartbeat each minute and the data nodes respond, or they are deemed dead. The Named Node holds a master directory of all databases created, delegates which tables reside on which data nodes, and directs where each data block of a table resides.

The Named Node's Directory Tree

image
The named node keeps the directory tree (seen above) of all files in the Hadoop Distributed File System (HDFS), and tracks where across the cluster the file data is kept. It also sends out a heartbeat and keeps track of the health of the data nodes. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate data nodes. The named node acts as the host and the data nodes read and write the data as requested.

The Data Nodes

image
The named node sends out a heartbeat each minute and the data nodes respond, or they are deemed dead. The data nodes read and write the data that they are assigned. They also make a copy of each block of data they have and send it to two other nodes in the cluster as a backup in case they are deemed dead or they have a disk failure. There are three copies of every block in a Hadoop cluster as a failsafe mechanism. The data nodes also send a block report to the named node.

Hive MetaStore

Hadoop places data in files on commodity hardware that can be
structured or unstructured. Data stored does not have to be defined.
image
The Hive MetaStore stores table definitions and metadata.This allows
users to define table structures on data as applications need them.
Hive has the Hive Metastore store for all table definitions and related metadata. Hive uses an Object Relational Mapper (ORM) to access relational databases, referred to as ORM. Valid Hive metastore database are growing and currently consist of MySQL, Oracle, PostgreSQL and Derby.

Data Layout and Protection – Step 1

image
The Named Node holds a master directory of all databases created, delegates which tables reside on which data nodes, and directs where each data block of a table resides. Watch exactly what happens when the Sales_Table and others are built. The first step is that the named node has determined that the Sales_Table has one block of data and that it will be written to node 1. It is written.

Data Layout and Protection – Step 2

image
Data node 1 has written a block of the Sales_Table to its disk. Data node 1 will now communicate directly with two other data nodes in order to backup its Sales_Table block in case of a disaster. The block is copied to two other data nodes.

Data Layout and Protection – Step 3

image
At a timed interval, all of the data nodes will provide a current block report to the named node. The named node will place this in its directory tree. The Sales_Table block is now stored in triplicate, just in case there is a disaster or a disk failure.

Data Layout and Protection – Step 4

image
When the named node sent out a heartbeat to check on all of the nodes, node 1 failed to report and it was deemed dead. The named node sends out a message to data nodes 2 and 4 and one of them will have the block copied to another node. The block reports are sent back to the named node and the named node updates its system tree.

How are Blocks Distributed Amongst the Cluster?

image
The table above is 1 GB. By default, the system put 16
(64 MB) blocks across the cluster, which equals 1 GB
Size of data matters. If you have a table with less than 64 MB of data, then it will only be stored in one block (replicated twice for disaster recovery). If the default block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes the named node to manage an enormous amount of metadata. That is why Apache Hadoop defaults the block size to 64 MB and in the Cloudera Hadoop the default is 128 MB. Large blocks are distributed a...

Table of contents