Pig Design Patterns
eBook - ePub

Pig Design Patterns

  1. 310 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Pig Design Patterns

About this book

In Detail

Pig Design Patterns is a comprehensive guide that will enable readers to readily use design patterns that simplify the creation of complex data pipelines in various stages of data management. This book focuses on using Pig in an enterprise context, bridging the gap between theoretical understanding and practical implementation. Each chapter contains a set of design patterns that pose and then solve technical challenges that are relevant to the enterprise use cases.

The book covers the journey of Big Data from the time it enters the enterprise to its eventual use in analytics, in the form of a report or a predictive model. By the end of the book, readers will appreciate Pig's real power in addressing each and every problem encountered when creating an analytics-based data product. Each design pattern comes with a suggested solution, analyzing the trade-offs of implementing the solution in a different way, explaining how the code works, and the results.

Approach

A comprehensive practical guide that walks you through the multiple stages of data management in enterprise and gives you numerous design patterns with appropriate code examples to solve frequent problems in each of these stages. The chapters are organized to mimick the sequential data flow evidenced in Analytics platforms, but they can also be read independently to solve a particular group of problems in the Big Data life cycle.

Who this book is for

The experienced developer who is already familiar with Pig and is looking for a use case standpoint where they can relate to the problems of data ingestion, profiling, cleansing, transforming, and egressing data encountered in the enterprises. Knowledge of Hadoop and Pig is necessary for readers to grasp the intricacies of Pig design patterns better.

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Pig Design Patterns


Table of Contents

Pig Design Patterns
Credits
Foreword
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
Motivation for this book
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Third-party libraries
Datasets
Errata
Piracy
Questions
1. Setting the Context for Design Patterns in Pig
Understanding design patterns
The scope of design patterns in Pig
Hadoop demystified – a quick reckoner
The enterprise context
Common challenges of distributed systems
The advent of Hadoop
Hadoop under the covers
Understanding the Hadoop Distributed File System
HDFS design goals
Working of HDFS
Understanding MapReduce
Understanding how MapReduce works
The MapReduce internals
Pig – a quick intro
Understanding the rationale of Pig
Understanding the relevance of Pig in the enterprise
Working of Pig – an overview
Firing up Pig
The use case
Code listing
The dataset
Understanding Pig through the code
Pig's extensibility
Operators used in code
The EXPLAIN operator
Understanding Pig's data model
Primitive types
Complex types
The relevance of schemas
Summary
2. Data Ingest and Egress Patterns
The context of data ingest and egress
Types of data in the enterprise
Ingest and egress patterns for multistructured data
Considerations for log ingestion
The Apache log ingestion pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Code for the CommonLogLoader class
Code for the CombinedLogLoader class
Results
Additional information
The Custom log ingestion pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The image ingress and egress pattern
Background
Motivation
Use cases
Pattern implementation
The image Ingress Implementation
The image egress implementation
Code snippets
The image ingress
Pig script
Image to a sequence UDF snippet
The image egress
Pig script
Sequence to an image UDF
Results
Additional information
The ingress and egress patterns for the NoSQL data
MongoDB ingress and egress patterns
Background
Motivation
Use cases
Pattern implementation
The ingress implementation
The egress implementation
Code snippets
The ingress code
The egress code
Results
Additional information
The HBase ingress and egress pattern
Background
Motivation
Use cases
Pattern implementation
The ingress implementation
The egress implementation
Code snippets
The ingress code
The egress code
Results
Additional information
The ingress and egress patterns for structured data
The Hive ingress and egress patterns
Background
Motivation
Use cases
Pattern implementation
The ingress implementation
The egress implementation
Code snippets
The ingress Code
Importing data using RCFile
Importing data using HCatalog
The egress code
Results
Additional information
The ingress and egress patterns for semi-structured data
The mainframe ingestion pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
XML ingest and egress patterns
Background
Motivation
Motivation for ingesting raw XML
Motivation for ingesting binary XML
Motivation for egression of XML
Use cases
Pattern implementation
The implementation of the XML raw ingestion
The implementation of the XML binary ingestion
Code snippets
The XML raw ingestion code
The XML binary ingestion code
The XML egress code
Pig script
The XML storage
Results
Additional information
JSON ingress and egress patterns
Background
Motivation
Use cases
Pattern implementation
The ingress implementation
The egress implementation
Code snippets
The ingress code
The code for simple JSON
The code for nested JSON
The egress code
Results
Additional information
Summary
3. Data Profiling Patterns
Data profiling for Big Data
Big Data profiling dimensions
Sampling considerations for profiling Big Data
Sampling support in Pig
Rationale for using Pig in data profiling
The data type inference pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Pig script
Java UDF
Results
Additional information
The basic statistical profiling pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Pig script
Macro
Results
Additional information
The pattern-matching pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Pig script
Macro
Results
Additional information
The string profiling pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Pig script
Macro
Results
Additional information
The unstructured text profiling pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Pig script
Java UDF for stemming
Java UDF for generating TF-IDF
Results
Additional information
Summary
4. Data Validation and Cleansing Patterns
Data validation and cleansing for Big Data
Choosing Pig for validation and cleansing
The constraint validation and cleansing design pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The regex validation and cleansing design pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The corrupt data validation and cleansing design pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The unstructured text data validation and cleansing design pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
Summary
5. Data Transformation Patterns
Data transformation processes
The structured-to-hierarchical transformation pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The data normalization pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The data integration pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The aggregation pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The data generalization pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
Summary
6. Understanding Data Reduction Patterns
Data reduction – a quick introduction
Data reduction considerations for Big Data
Dimensionality reduction – the Principal Component Analysis design pattern
Background
Motivation
Use cases
Pattern implementation
Limitations of PCA implementation
Code snippets
Results
Additional information
Numerosity reduction – the histogram design pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
Numerosity reduction – sampling design pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
Numerosity reduction – clustering design pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
Summary
7. Advanced Patterns and Future Work
The clustering pattern
Background
Motivation
Use cases
Pattern implementation
Code snippets
Results
Additional information
The topic discovery pattern
Background
Motivation
Use cases
Pattern im...

Table of contents

  1. Pig Design Patterns

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Pig Design Patterns by Pradeep Pasupuleti in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.