Data Lakehouse in Action
eBook - ePub

Data Lakehouse in Action

Pradeep Menon

Compartir libro
  1. 206 páginas
  2. English
  3. ePUB (apto para móviles)
  4. Disponible en iOS y Android
eBook - ePub

Data Lakehouse in Action

Pradeep Menon

Detalles del libro
Vista previa del libro
Índice
Citas

Información del libro

Propose a new scalable data architecture paradigm, Data Lakehouse, that addresses the limitations of current data architecture patternsKey Features• Understand how data is ingested, stored, served, governed, and secured for enabling data analytics• Explore a practical way to implement Data Lakehouse using cloud computing platforms like Azure• Combine multiple architectural patterns based on an organization's needs and maturity levelBook DescriptionThe Data Lakehouse architecture is a new paradigm that enables large-scale analytics. This book will guide you in developing data architecture in the right way to ensure your organization's success.The first part of the book discusses the different data architectural patterns used in the past and the need for a new architectural paradigm, as well as the drivers that have caused this change. It covers the principles that govern the target architecture, the components that form the Data Lakehouse architecture, and the rationale and need for those components. The second part deep dives into the different layers of Data Lakehouse. It covers various scenarios and components for data ingestion, storage, data processing, data serving, analytics, governance, and data security. The book's third part focuses on the practical implementation of the Data Lakehouse architecture in a cloud computing platform. It focuses on various ways to combine the Data Lakehouse pattern to realize macro-patterns, such as Data Mesh and Data Hub-Spoke, based on the organization's needs and maturity level. The frameworks introduced will be practical and organizations can readily benefit from their application.By the end of this book, you'll clearly understand how to implement the Data Lakehouse architecture pattern in a scalable, agile, and cost-effective manner.What you will learn• Understand the evolution of the Data Architecture patterns for analytics• Become well versed in the Data Lakehouse pattern and how it enables data analytics• Focus on methods to ingest, process, store, and govern data in a Data Lakehouse architecture• Learn techniques to serve data and perform analytics in a Data Lakehouse architecture• Cover methods to secure the data in a Data Lakehouse architecture• Implement Data Lakehouse in a cloud computing platform such as Azure• Combine Data Lakehouse in a macro-architecture pattern such as Data MeshWho this book is forThis book is for data architects, big data engineers, data strategists and practitioners, data stewards, and cloud computing practitioners looking to become well-versed with modern data architecture patterns to enable large-scale analytics. Basic knowledge of data architecture and familiarity with data warehousing concepts are required.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?
Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.
¿Cómo descargo los libros?
Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.
¿En qué se diferencian los planes de precios?
Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.
¿Qué es Perlego?
Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.
¿Perlego ofrece la función de texto a voz?
Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.
¿Es Data Lakehouse in Action un PDF/ePUB en línea?
Sí, puedes acceder a Data Lakehouse in Action de Pradeep Menon en formato PDF o ePUB, así como a otros libros populares de Computer Science y Data Modelling & Design. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Año
2022
ISBN
9781801815109
Edición
1

PART 1: Architectural Patterns for Analytics

This section describes the evolution of data architecture patterns for analytics. It addresses the challenges posed by different architectural patterns and establishes a new paradigm, that is, the data lakehouse. An overview of the data lakehouse architecture is also provided, which includes coverage of the principles that govern the target architecture, the components that form the data lakehouse architecture, the rationale and need for those components, and the architectural principles adopted to make a data lake scalable and robust.
This section comprises the following chapters:
  • Chapter 1, Introducing the Evolution of Data Analytics Patterns
  • Chapter 2, The Data Lakehouse Architecture Overview

Chapter 1: Introducing the Evolution of Data Analytics Patterns

Data analytics is an ever-changing field. A little history will help you appreciate the strides in this field and how data architectural patterns have evolved to fulfill the ever-changing need for analytics.
First, let's start with some definitions:
  • What is analytics? Analytics is defined as any action that converts data into insights.
  • What is data architecture? Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.
Analytics and the data architecture that enables analytics goes a long way. Let's now explore some of the patterns that have evolved over the last few decades.
This chapter explores the genesis of data growth and explains the need for a new paradigm in data architecture. This chapter starts by examining the predominant paradigm, the enterprise data warehouse, popular in the 1990s and 2000s. It explores the challenges associated with this paradigm and then covers the drivers that caused an explosion in data. It further examines the rise of a new paradigm, the data lake, and its challenges. Furthermore, this chapter ends by advocating the need for a new paradigm, the data lakehouse. It clarifies the key benefits delivered by a well-architected data lakehouse.
We'll cover all of this in the following topics:
  • Discovering the enterprise data warehouse era
  • Exploring the five factors of change
  • Investigating the data lake era
  • Introducing the data lakehouse paradigm

Discovering the enterprise data warehouse era

The Enterprise Data Warehouse (EDW) pattern, popularized by Ralph Kimball and Bill Inmon, was predominant in the 1990s and 2000s. The needs of this era were relatively straightforward (at least compared to the current context). The focus was predominantly on optimizing database structures to satisfy reporting requirements. Analytics was synonymous with reporting. Machine learning was a specialized field and was not ubiquitous in enterprises.
A typical EDW pattern is depicted in the following figure:
Figure 1.1 – A typical EDW pattern
Figure 1.1 – A typical EDW pattern
As shown in Figure 1.1, the pattern entailed source systems composed of databases or flat-file structures. The data sources are predominantly structured, that is, rows and columns. A process called Extract-Transform-Load (ETL) first extracts the data from the source systems. Then, the process transforms the data into a shape and form that is conducive for analysis. Once the data is transformed, it is loaded into an EDW. From there, the subsets of data are then populated to downstream data marts. Data marts can be conceived of as mini data warehouses that cater to the business requirements of a specific department.
As you can imagine, this pattern primarily was focused on the following:
  • Creating a data structure that is optimized for storage and modeled for reporting
  • Focusing on the reporting requirements of the business
  • Harnessing the structured data into actionable insights
Every coin has two sides. The EDW pattern is not an exception. It has its pros and it has its cons. This pattern has survived the test of time. It was widespread and well adopted because of the following key advantages:
  • Since most of the analytical requirements were related to reporting, this pattern effectively addressed many organizations' reporting requirements.
  • Large enterprise data models were able to structure an organization's data into logical and physical models. This pattern gave a structure to manage the organization's data in a modular and efficient manner.
  • Since this pattern catered only to structured data, the technology required to harness structured data was evolved and readily available. Relational Database Management Systems (RDBMSes) evolved and were juxtaposed appropriately to harness its features for reporting.
However, it also had its own set of challenges that surfaced as the data volumes grew and new data formats started emerging. A few challenges associated with the EDW pattern are as follows:
  • This pattern was not as agile as the changing business requirements wanted it to be. Any change in the reporting requirement had to go through a long-winded process of data model changes, ETL code changes, and respective changes to the reporting system. Often, the ETL process was a specialized skill and became a bottleneck for reducing data to insight turnover time. The nature of analytics is unique. The more you see the output, the more you demand. Many EDW projects were deemed a failure. The failure was not from a technical perspective, but from a business perspective. Operationally, the design changes required to cater to these fast-evolving requirements were too difficult to handle.
  • As the data volumes grew, this pattern proved too cost prohibitive. Massive parallel-processing database technologies started evolving that specialized in data warehouse workloads. The cost of maintaining these databases was prohibitive as well. It involved expensive software prices, frequent hardware refreshes, and a substantial staffing cost. The return on investment was no longer justifiable.
  • As the format of data started evolving, the challenges associated with the EDW became more evident. Database technologies were developed to cater to semi-structured data (JSON). However, the fundamental concept was still RDBMS-based. The underlying technology was not able to effectively cater to these new types of data. There was more value in analyzing data that was not structured. The sheer variety of data was too complex for EDWs to handle.
  • The EDW was focused predominantly on Business Intelligence (BI). It facilitated the creation of scheduled reports, ad hoc data analysis, and self-service BI. Although it catered to most of the personas who performed analysis, it was not conducive to AI/ML use cases. The data in the EDW was already cleansed and structured with a razor-sharp focus on reporting. This left little room for a data scientist (statistical modelers at that time) to explore data and create a new hypothesis. In short, the EDW was primarily focused on BI.
While the EDW pattern was becoming mainstream, a perfect storm was flourishing that changed the landscape. The following section will focus on five different factors that came together to change the data architecture pattern for good.

Exploring the five factors of change

The year 2007 changed the world as we know it; the day Steve Jobs took the stage and announced the iPhone launch was a turning point in the age of data. That day brewed the perfect "data" storm.
A perfect storm is a meteorological event that occurs as a result of a rare combination of factors. In the world of data evolution, such a perfect storm occurred in the last decade, one that has catapulted data as a strategic enterprise asset. Five ingredients caused the perfect "data" storm.
Figure 1.2 – Ingredients of the perfect
Figure 1.2 – Ingredients of the perfect "data" storm
As depicted in Figure 1.2, there were five fac...

Índice