Data Lakehouse in Action
Pradeep Menon
- 206 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
Data Lakehouse in Action
Pradeep Menon
About This Book
Propose a new scalable data architecture paradigm, Data Lakehouse, that addresses the limitations of current data architecture patternsKey Featuresā¢ Understand how data is ingested, stored, served, governed, and secured for enabling data analyticsā¢ Explore a practical way to implement Data Lakehouse using cloud computing platforms like Azureā¢ Combine multiple architectural patterns based on an organization's needs and maturity levelBook DescriptionThe Data Lakehouse architecture is a new paradigm that enables large-scale analytics. This book will guide you in developing data architecture in the right way to ensure your organization's success.The first part of the book discusses the different data architectural patterns used in the past and the need for a new architectural paradigm, as well as the drivers that have caused this change. It covers the principles that govern the target architecture, the components that form the Data Lakehouse architecture, and the rationale and need for those components. The second part deep dives into the different layers of Data Lakehouse. It covers various scenarios and components for data ingestion, storage, data processing, data serving, analytics, governance, and data security. The book's third part focuses on the practical implementation of the Data Lakehouse architecture in a cloud computing platform. It focuses on various ways to combine the Data Lakehouse pattern to realize macro-patterns, such as Data Mesh and Data Hub-Spoke, based on the organization's needs and maturity level. The frameworks introduced will be practical and organizations can readily benefit from their application.By the end of this book, you'll clearly understand how to implement the Data Lakehouse architecture pattern in a scalable, agile, and cost-effective manner.What you will learnā¢ Understand the evolution of the Data Architecture patterns for analyticsā¢ Become well versed in the Data Lakehouse pattern and how it enables data analyticsā¢ Focus on methods to ingest, process, store, and govern data in a Data Lakehouse architectureā¢ Learn techniques to serve data and perform analytics in a Data Lakehouse architectureā¢ Cover methods to secure the data in a Data Lakehouse architectureā¢ Implement Data Lakehouse in a cloud computing platform such as Azureā¢ Combine Data Lakehouse in a macro-architecture pattern such as Data MeshWho this book is forThis book is for data architects, big data engineers, data strategists and practitioners, data stewards, and cloud computing practitioners looking to become well-versed with modern data architecture patterns to enable large-scale analytics. Basic knowledge of data architecture and familiarity with data warehousing concepts are required.
Frequently asked questions
Information
PART 1: Architectural Patterns for Analytics
- Chapter 1, Introducing the Evolution of Data Analytics Patterns
- Chapter 2, The Data Lakehouse Architecture Overview
Chapter 1: Introducing the Evolution of Data Analytics Patterns
- What is analytics? Analytics is defined as any action that converts data into insights.
- What is data architecture? Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.
- Discovering the enterprise data warehouse era
- Exploring the five factors of change
- Investigating the data lake era
- Introducing the data lakehouse paradigm
Discovering the enterprise data warehouse era
- Creating a data structure that is optimized for storage and modeled for reporting
- Focusing on the reporting requirements of the business
- Harnessing the structured data into actionable insights
- Since most of the analytical requirements were related to reporting, this pattern effectively addressed many organizations' reporting requirements.
- Large enterprise data models were able to structure an organization's data into logical and physical models. This pattern gave a structure to manage the organization's data in a modular and efficient manner.
- Since this pattern catered only to structured data, the technology required to harness structured data was evolved and readily available. Relational Database Management Systems (RDBMSes) evolved and were juxtaposed appropriately to harness its features for reporting.
- This pattern was not as agile as the changing business requirements wanted it to be. Any change in the reporting requirement had to go through a long-winded process of data model changes, ETL code changes, and respective changes to the reporting system. Often, the ETL process was a specialized skill and became a bottleneck for reducing data to insight turnover time. The nature of analytics is unique. The more you see the output, the more you demand. Many EDW projects were deemed a failure. The failure was not from a technical perspective, but from a business perspective. Operationally, the design changes required to cater to these fast-evolving requirements were too difficult to handle.
- As the data volumes grew, this pattern proved too cost prohibitive. Massive parallel-processing database technologies started evolving that specialized in data warehouse workloads. The cost of maintaining these databases was prohibitive as well. It involved expensive software prices, frequent hardware refreshes, and a substantial staffing cost. The return on investment was no longer justifiable.
- As the format of data started evolving, the challenges associated with the EDW became more evident. Database technologies were developed to cater to semi-structured data (JSON). However, the fundamental concept was still RDBMS-based. The underlying technology was not able to effectively cater to these new types of data. There was more value in analyzing data that was not structured. The sheer variety of data was too complex for EDWs to handle.
- The EDW was focused predominantly on Business Intelligence (BI). It facilitated the creation of scheduled reports, ad hoc data analysis, and self-service BI. Although it catered to most of the personas who performed analysis, it was not conducive to AI/ML use cases. The data in the EDW was already cleansed and structured with a razor-sharp focus on reporting. This left little room for a data scientist (statistical modelers at that time) to explore data and create a new hypothesis. In short, the EDW was primarily focused on BI.