eBook - ePub

Data Lakehouse in Action

Name: Data Lakehouse in Action
Author: Pradeep Menon

Pradeep Menon

Condividi libro

206 pagine
English
ePUB (disponibile sull'app)
Disponibile su iOS e Android

eBook - ePub

Data Lakehouse in Action

Pradeep Menon

Dettagli del libro

Anteprima del libro

Indice dei contenuti

Citazioni

Informazioni sul libro

Propose a new scalable data architecture paradigm, Data Lakehouse, that addresses the limitations of current data architecture patternsKey Features• Understand how data is ingested, stored, served, governed, and secured for enabling data analytics• Explore a practical way to implement Data Lakehouse using cloud computing platforms like Azure• Combine multiple architectural patterns based on an organization's needs and maturity levelBook DescriptionThe Data Lakehouse architecture is a new paradigm that enables large-scale analytics. This book will guide you in developing data architecture in the right way to ensure your organization's success.The first part of the book discusses the different data architectural patterns used in the past and the need for a new architectural paradigm, as well as the drivers that have caused this change. It covers the principles that govern the target architecture, the components that form the Data Lakehouse architecture, and the rationale and need for those components. The second part deep dives into the different layers of Data Lakehouse. It covers various scenarios and components for data ingestion, storage, data processing, data serving, analytics, governance, and data security. The book's third part focuses on the practical implementation of the Data Lakehouse architecture in a cloud computing platform. It focuses on various ways to combine the Data Lakehouse pattern to realize macro-patterns, such as Data Mesh and Data Hub-Spoke, based on the organization's needs and maturity level. The frameworks introduced will be practical and organizations can readily benefit from their application.By the end of this book, you'll clearly understand how to implement the Data Lakehouse architecture pattern in a scalable, agile, and cost-effective manner.What you will learn• Understand the evolution of the Data Architecture patterns for analytics• Become well versed in the Data Lakehouse pattern and how it enables data analytics• Focus on methods to ingest, process, store, and govern data in a Data Lakehouse architecture• Learn techniques to serve data and perform analytics in a Data Lakehouse architecture• Cover methods to secure the data in a Data Lakehouse architecture• Implement Data Lakehouse in a cloud computing platform such as Azure• Combine Data Lakehouse in a macro-architecture pattern such as Data MeshWho this book is forThis book is for data architects, big data engineers, data strategists and practitioners, data stewards, and cloud computing practitioners looking to become well-versed with modern data architecture patterns to enable large-scale analytics. Basic knowledge of data architecture and familiarity with data warehousing concepts are required.

Domande frequenti

Come faccio ad annullare l'abbonamento?

È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui

È possibile scaricare libri? Se sì, come?

Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui

Che differenza c'è tra i piani?

Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.

Cos'è Perlego?

Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.

Perlego supporta la sintesi vocale?

Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.

Data Lakehouse in Action è disponibile online in formato PDF/ePub?

Sì, puoi accedere a Data Lakehouse in Action di Pradeep Menon in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Computer Science e Data Modelling & Design. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Editore

Packt Publishing

Anno

2022

ISBN

9781801815109

Edizione

Argomento

Computer Science

Categoria

Data Modelling & Design

PART 1: Architectural Patterns for Analytics

This section describes the evolution of data architecture patterns for analytics. It addresses the challenges posed by different architectural patterns and establishes a new paradigm, that is, the data lakehouse. An overview of the data lakehouse architecture is also provided, which includes coverage of the principles that govern the target architecture, the components that form the data lakehouse architecture, the rationale and need for those components, and the architectural principles adopted to make a data lake scalable and robust.

This section comprises the following chapters:

Chapter 1, Introducing the Evolution of Data Analytics Patterns
Chapter 2, The Data Lakehouse Architecture Overview

Chapter 1: Introducing the Evolution of Data Analytics Patterns

Data analytics is an ever-changing field. A little history will help you appreciate the strides in this field and how data architectural patterns have evolved to fulfill the ever-changing need for analytics.

First, let's start with some definitions:

What is analytics? Analytics is defined as any action that converts data into insights.
What is data architecture? Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.

Analytics and the data architecture that enables analytics goes a long way. Let's now explore some of the patterns that have evolved over the last few decades.

This chapter explores the genesis of data growth and explains the need for a new paradigm in data architecture. This chapter starts by examining the predominant paradigm, the enterprise data warehouse, popular in the 1990s and 2000s. It explores the challenges associated with this paradigm and then covers the drivers that caused an explosion in data. It further examines the rise of a new paradigm, the data lake, and its challenges. Furthermore, this chapter ends by advocating the need for a new paradigm, the data lakehouse. It clarifies the key benefits delivered by a well-architected data lakehouse.

We'll cover all of this in the following topics:

Discovering the enterprise data warehouse era
Exploring the five factors of change
Investigating the data lake era
Introducing the data lakehouse paradigm

Discovering the enterprise data warehouse era

The Enterprise Data Warehouse (EDW) pattern, popularized by Ralph Kimball and Bill Inmon, was predominant in the 1990s and 2000s. The needs of this era were relatively straightforward (at least compared to the current context). The focus was predominantly on optimizing database structures to satisfy reporting requirements. Analytics was synonymous with reporting. Machine learning was a specialized field and was not ubiquitous in enterprises.

A typical EDW pattern is depicted in the following figure:

Figure 1.1 – A typical EDW pattern

As shown in Figure 1.1, the pattern entailed source systems composed of databases or flat-file structures. The data sources are predominantly structured, that is, rows and columns. A process called Extract-Transform-Load (ETL) first extracts the data from the source systems. Then, the process transforms the data into a shape and form that is conducive for analysis. Once the data is transformed, it is loaded into an EDW. From there, the subsets of data are then populated to downstream data marts. Data marts can be conceived of as mini data warehouses that cater to the business requirements of a specific department.

As you can imagine, this pattern primarily was focused on the following:

Creating a data structure that is optimized for storage and modeled for reporting
Focusing on the reporting requirements of the business
Harnessing the structured data into actionable insights

Every coin has two sides. The EDW pattern is not an exception. It has its pros and it has its cons. This pattern has survived the test of time. It was widespread and well adopted because of the following key advantages:

Since most of the analytical requirements were related to reporting, this pattern effectively addressed many organizations' reporting requirements.
Large enterprise data models were able to structure an organization's data into logical and physical models. This pattern gave a structure to manage the organization's data in a modular and efficient manner.
Since this pattern catered only to structured data, the technology required to harness structured data was evolved and readily available. Relational Database Management Systems (RDBMSes) evolved and were juxtaposed appropriately to harness its features for reporting.

However, it also had its own set of challenges that surfaced as the data volumes grew and new data formats started emerging. A few challenges associated with the EDW pattern are as follows:

This pattern was not as agile as the changing business requirements wanted it to be. Any change in the reporting requirement had to go through a long-winded process of data model changes, ETL code changes, and respective changes to the reporting system. Often, the ETL process was a specialized skill and became a bottleneck for reducing data to insight turnover time. The nature of analytics is unique. The more you see the output, the more you demand. Many EDW projects were deemed a failure. The failure was not from a technical perspective, but from a business perspective. Operationally, the design changes required to cater to these fast-evolving requirements were too difficult to handle.
As the data volumes grew, this pattern proved too cost prohibitive. Massive parallel-processing database technologies started evolving that specialized in data warehouse workloads. The cost of maintaining these databases was prohibitive as well. It involved expensive software prices, frequent hardware refreshes, and a substantial staffing cost. The return on investment was no longer justifiable.
As the format of data started evolving, the challenges associated with the EDW became more evident. Database technologies were developed to cater to semi-structured data (JSON). However, the fundamental concept was still RDBMS-based. The underlying technology was not able to effectively cater to these new types of data. There was more value in analyzing data that was not structured. The sheer variety of data was too complex for EDWs to handle.
The EDW was focused predominantly on Business Intelligence (BI). It facilitated the creation of scheduled reports, ad hoc data analysis, and self-service BI. Although it catered to most of the personas who performed analysis, it was not conducive to AI/ML use cases. The data in the EDW was already cleansed and structured with a razor-sharp focus on reporting. This left little room for a data scientist (statistical modelers at that time) to explore data and create a new hypothesis. In short, the EDW was primarily focused on BI.

While the EDW pattern was becoming mainstream, a perfect storm was flourishing that changed the landscape. The following section will focus on five different factors that came together to change the data architecture pattern for good.

Exploring the five factors of change

The year 2007 changed the world as we know it; the day Steve Jobs took the stage and announced the iPhone launch was a turning point in the age of data. That day brewed the perfect "data" storm.

A perfect storm is a meteorological event that occurs as a result of a rare combination of factors. In the world of data evolution, such a perfect storm occurred in the last decade, one that has catapulted data as a strategic enterprise asset. Five ingredients caused the perfect "data" storm.

Figure 1.2 – Ingredients of the perfect "data" storm

As depicted in Figure 1.2, there were five fac...