Data Lakes
Anne Laurent, Dominique Laurent, Cédrine Madera, Anne Laurent, Dominique Laurent, Cédrine Madera
- English
- ePUB (mobile friendly)
- Available on iOS & Android
Data Lakes
Anne Laurent, Dominique Laurent, Cédrine Madera, Anne Laurent, Dominique Laurent, Cédrine Madera
About This Book
The concept of a data lake is less than 10 years old, but they are already hugely implemented within large companies. Their goal is to efficiently deal with ever-growing volumes of heterogeneous data, while also facing various sophisticated user needs. However, defining and building a data lake is still a challenge, as no consensus has been reached so far. Data Lakes presents recent outcomes and trends in the field of data repositories. The main topics discussed are the data-driven architecture of a data lake; the management of metadata – supplying key information about the stored data, master data and reference data; the roles of linked data and fog computing in a data lake ecosystem; and how gravity principles apply in the context of data lakes. A variety of case studies are also presented, thus providing the reader with practical examples of data lake management.
Frequently asked questions
Information
1
Introduction to Data Lakes: Definitions and Discussions
1.1. Introduction to data lakes
1.2. Literature review and discussion
- – storing data, in their native form, at low cost. Low cost is achieved because (1) data servers are cheap (typically based on the standard X86 technology) and (2) no data transformation, cleaning and preparation is required (thus avoiding very costly steps);
- – storing various types of data, such as blobs, data from relational DBMSs, semi-structured data or multimedia data;
- – transforming the data only on exploitation. This makes it possible to reduce the cost of data modeling and integrating, as done in standard data warehouse design. This feature is known as the schema-on-read approach;
- – requiring specific analysis tools to use the data. This is required because data lakes store row data;
- – allowing for identifying or eliminating data;
- – providing users with information on data provenance, such as the data source, the history of changes or data versioning.
- 1) only Apache Hadoop technology is considered;
- 2) criteria for preventing the movement of the data are not taken into account;
- 3) data governance is decoupled from data lakes;
- 4) data lakes are seen as data warehouse “killers”.