eBook - ePub

Data Lakehouse in Action

Name: Data Lakehouse in Action
ISBN: 9781801815109

Pradeep Menon,

206 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Lakehouse in Action

Pradeep Menon,

About this book

Propose a new scalable data architecture paradigm, Data Lakehouse, that addresses the limitations of current data architecture patternsKey Features• Understand how data is ingested, stored, served, governed, and secured for enabling data analytics• Explore a practical way to implement Data Lakehouse using cloud computing platforms like Azure• Combine multiple architectural patterns based on an organization's needs and maturity levelBook DescriptionThe Data Lakehouse architecture is a new paradigm that enables large-scale analytics. This book will guide you in developing data architecture in the right way to ensure your organization's success.The first part of the book discusses the different data architectural patterns used in the past and the need for a new architectural paradigm, as well as the drivers that have caused this change. It covers the principles that govern the target architecture, the components that form the Data Lakehouse architecture, and the rationale and need for those components. The second part deep dives into the different layers of Data Lakehouse. It covers various scenarios and components for data ingestion, storage, data processing, data serving, analytics, governance, and data security. The book's third part focuses on the practical implementation of the Data Lakehouse architecture in a cloud computing platform. It focuses on various ways to combine the Data Lakehouse pattern to realize macro-patterns, such as Data Mesh and Data Hub-Spoke, based on the organization's needs and maturity level. The frameworks introduced will be practical and organizations can readily benefit from their application.By the end of this book, you'll clearly understand how to implement the Data Lakehouse architecture pattern in a scalable, agile, and cost-effective manner.What you will learn• Understand the evolution of the Data Architecture patterns for analytics• Become well versed in the Data Lakehouse pattern and how it enables data analytics• Focus on methods to ingest, process, store, and govern data in a Data Lakehouse architecture• Learn techniques to serve data and perform analytics in a Data Lakehouse architecture• Cover methods to secure the data in a Data Lakehouse architecture• Implement Data Lakehouse in a cloud computing platform such as Azure• Combine Data Lakehouse in a macro-architecture pattern such as Data MeshWho this book is forThis book is for data architects, big data engineers, data strategists and practitioners, data stewards, and cloud computing practitioners looking to become well-versed with modern data architecture patterns to enable large-scale analytics. Basic knowledge of data architecture and familiarity with data warehousing concepts are required.

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2022

Edition

eBook ISBN

9781801815109

Topic

Informatica

Subtopic

Modellazione e design di dati

PART 1: Architectural Patterns for Analytics

This section describes the evolution of data architecture patterns for analytics. It addresses the challenges posed by different architectural patterns and establishes a new paradigm, that is, the data lakehouse. An overview of the data lakehouse architecture is also provided, which includes coverage of the principles that govern the target architecture, the components that form the data lakehouse architecture, the rationale and need for those components, and the architectural principles adopted to make a data lake scalable and robust.

This section comprises the following chapters:

Chapter 1, Introducing the Evolution of Data Analytics Patterns
Chapter 2, The Data Lakehouse Architecture Overview

Chapter 1: Introducing the Evolution of Data Analytics Patterns

Data analytics is an ever-changing field. A little history will help you appreciate the strides in this field and how data architectural patterns have evolved to fulfill the ever-changing need for analytics.

First, let's start with some definitions:

What is analytics? Analytics is defined as any action that converts data into insights.
What is data architecture? Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.

Analytics and the data architecture that enables analytics goes a long way. Let's now explore some of the patterns that have evolved over the last few decades.

This chapter explores the genesis of data growth and explains the need for a new paradigm in data architecture. This chapter starts by examining the predominant paradigm, the enterprise data warehouse, popular in the 1990s and 2000s. It explores the challenges associated with this paradigm and then covers the drivers that caused an explosion in data. It further examines the rise of a new paradigm, the data lake, and its challenges. Furthermore, this chapter ends by advocating the need for a new paradigm, the data lakehouse. It clarifies the key benefits delivered by a well-architected data lakehouse.

We'll cover all of this in the following topics:

Discovering the enterprise data warehouse era
Exploring the five factors of change
Investigating the data lake era
Introducing the data lakehouse paradigm

Discovering the enterprise data warehouse era

The Enterprise Data Warehouse (EDW) pattern, popularized by Ralph Kimball and Bill Inmon, was predominant in the 1990s and 2000s. The needs of this era were relatively straightforward (at least compared to the current context). The focus was predominantly on optimizing database structures to satisfy reporting requirements. Analytics was synonymous with reporting. Machine learning was a specialized field and was not ubiquitous in enterprises.

A typical EDW pattern is depicted in the following figure:

Figure 1.1 – A typical EDW pattern

As shown in Figure 1.1, the pattern entailed source systems composed of databases or flat-file structures. The data sources are predominantly structured, that is, rows and columns. A process called Extract-Transform-Load (ETL) first extracts the data from the source systems. Then, the process transforms the data into a shape and form that is conducive for analysis. Once the data is transformed, it is loaded into an EDW. From there, the subsets of data are then populated to downstream data marts. Data marts can be conceived of as mini data warehouses that cater to the business requirements of a specific department.

As you can imagine, this pattern primarily was focused on the following:

Creating a data structure that is optimized for storage and modeled for reporting
Focusing on the reporting requirements of the business
Harnessing the structured data into actionable insights

Every coin has two sides. The EDW pattern is not an exception. It has its pros and it has its cons. This pattern has survived the test of time. It was widespread and well adopted because of the following key advantages:

Since most of the analytical requirements were related to reporting, this pattern effectively addressed many organizations' reporting requirements.
Large enterprise data models were able to structure an organization's data into logical and physical models. This pattern gave a structure to manage the organization's data in a modular and efficient manner.
Since this pattern catered only to structured data, the technology required to harness structured data was evolved and readily available. Relational Database Management Systems (RDBMSes) evolved and were juxtaposed appropriately to harness its features for reporting.

However, it also had its own set of challenges that surfaced as the data volumes grew and new data formats started emerging. A few challenges associated with the EDW pattern are as follows:

This pattern was not as agile as the changing business requirements wanted it to be. Any change in the reporting requirement had to go through a long-winded process of data model changes, ETL code changes, and respective changes to the reporting system. Often, the ETL process was a specialized skill and became a bottleneck for reducing data to insight turnover time. The nature of analytics is unique. The more you see the output, the more you demand. Many EDW projects were deemed a failure. The failure was not from a technical perspective, but from a business perspective. Operationally, the design changes required to cater to these fast-evolving requirements were too difficult to handle.
As the data volumes grew, this pattern proved too cost prohibitive. Massive parallel-processing database technologies started evolving that specialized in data warehouse workloads. The cost of maintaining these databases was prohibitive as well. It involved expensive software prices, frequent hardware refreshes, and a substantial staffing cost. The return on investment was no longer justifiable.
As the format of data started evolving, the challenges associated with the EDW became more evident. Database technologies were developed to cater to semi-structured data (JSON). However, the fundamental concept was still RDBMS-based. The underlying technology was not able to effectively cater to these new types of data. There was more value in analyzing data that was not structured. The sheer variety of data was too complex for EDWs to handle.
The EDW was focused predominantly on Business Intelligence (BI). It facilitated the creation of scheduled reports, ad hoc data analysis, and self-service BI. Although it catered to most of the personas who performed analysis, it was not conducive to AI/ML use cases. The data in the EDW was already cleansed and structured with a razor-sharp focus on reporting. This left little room for a data scientist (statistical modelers at that time) to explore data and create a new hypothesis. In short, the EDW was primarily focused on BI.

While the EDW pattern was becoming mainstream, a perfect storm was flourishing that changed the landscape. The following section will focus on five different factors that came together to change the data architecture pattern for good.

Exploring the five factors of change

The year 2007 changed the world as we know it; the day Steve Jobs took the stage and announced the iPhone launch was a turning point in the age of data. That day brewed the perfect "data" storm.

A perfect storm is a meteorological event that occurs as a result of a rare combination of factors. In the world of data evolution, such a perfect storm occurred in the last decade, one that has catapulted data as a strategic enterprise asset. Five ingredients caused the perfect "data" storm.

Figure 1.2 – Ingredients of the perfect "data" storm

As depicted in Figure 1.2, there were five fac...

Data Lakehouse in Action
Contributors
Preface
PART 1: Architectural Patterns for Analytics
Chapter 1: Introducing the Evolution of Data Analytics Patterns
Chapter 2: The Data Lakehouse Architecture Overview
PART 2: Data Lakehouse Component Deep Dive
Chapter 3: Ingesting and Processing Data in a Data Lakehouse
Chapter 4: Storing and Serving Data in a Data Lakehouse
Chapter 5: Deriving Insights from a Data Lakehouse
Chapter 6: Applying Data Governance in the Data Lakehouse
Chapter 7: Applying Data Security in a Data Lakehouse
PART 3: Implementing and Governing a Data Lakehouse
Chapter 8: Implementing a Data Lakehouse on Microsoft Azure
Chapter 9: Scaling the Data Lakehouse Architecture
Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Data Lakehouse in Action by Pradeep Menon in PDF and/or ePUB format, as well as other popular books in Informatica & Modellazione e design di dati. We have over one million books available in our catalogue for you to explore.

Data Lakehouse in Action

Data Lakehouse in Action

About this book

Trusted by 375,005 students

Information

PART 1: Architectural Patterns for Analytics

Chapter 1: Introducing the Evolution of Data Analytics Patterns

Discovering the enterprise data warehouse era

Exploring the five factors of change

Table of contents

Frequently asked questions