
- 520 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
Cloud Scale Analytics with Azure Data Services
About this book
A practical guide to implementing a scalable and fast state-of-the-art analytical data estateKey Featuresโข Store and analyze data with enterprise-grade security and auditingโข Perform batch, streaming, and interactive analytics to optimize your big data solutions with easeโข Develop and run parallel data processing programs using real-world enterprise scenariosBook DescriptionAzure Data Lake, the modern data warehouse architecture, and related data services on Azure enable organizations to build their own customized analytical platform to fit any analytical requirements in terms of volume, speed, and quality. This book is your guide to learning all the features and capabilities of Azure data services for storing, processing, and analyzing data (structured, unstructured, and semi-structured) of any size. You will explore key techniques for ingesting and storing data and perform batch, streaming, and interactive analytics. The book also shows you how to overcome various challenges and complexities relating to productivity and scaling. Next, you will be able to develop and run massive data workloads to perform different actions. Using a cloud-based big data-modern data warehouse-analytics setup, you will also be able to build secure, scalable data estates for enterprises. Finally, you will not only learn how to develop a data warehouse but also understand how to create enterprise-grade security and auditing big data programs. By the end of this Azure book, you will have learned how to develop a powerful and efficient analytical platform to meet enterprise needs.What you will learnโข Implement data governance with Azure servicesโข Use integrated monitoring in the Azure Portal and integrate Azure Data Lake Storage into the Azure Monitorโข Explore the serverless feature for ad-hoc data discovery, logical data warehousing, and data wranglingโข Implement networking with Synapse Analytics and Spark poolsโข Create and run Spark jobs with Databricks clustersโข Implement streaming using Azure Functions, a serverless runtime environment on Azureโข Explore the predefined ML services in Azure and use them in your appWho this book is forThis book is for data architects, ETL developers, or anyone who wants to get well-versed with Azure data services to implement an analytical data estate for their enterprise. The book will also appeal to data scientists and data analysts who want to explore all the capabilities of Azure data services, which can be used to store, process, and analyze any kind of data. A beginner-level understanding of data analysis and streaming will be required.
Frequently asked questions
- Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
- Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Information
Section 1: Data Warehousing and Considerations Regarding Cloud Computing
- Chapter 1, Balancing the Benefits of Data Lakes over Data Warehouses
- Chapter 2, Connecting Requirements and Technology
Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses
- Distinguishing between Data Warehouses and Data Lakes
- Understanding the opportunities of modern cloud computing
- Exploring the benefits of AI and ML
- Answering the question
Distinguishing between Data Warehouses and Data Lakes
Understanding Data Warehouse patterns
- Star-Join/Snowflake: This is probably the most famous method for Data Warehouse modeling. Fact tables are put in the center of the model, while Dimension tables are arranged around them, inheriting their Primary Key into the Fact table. In the Star-Join method, we find a lot of redundancy in the tables since as all the Dimension data, including all hierarchical information (such as Product Group -> Product SubCategory -> Product Category) regarding a certain artifact (Product, Customer, and so on), is stored in one table. In a Snowflake schema, hierarchies are spread in additional tables per hierarchy level and are linked over relationships with each other. This, when expressed in a graph, turns out to show a kind of snowflake pattern.
- Data Vault: This is a newer method that reflects the rising structural volatility of data sources, which offers higher flexibility and speed for developing. Entities that need to be analyzed are stored over Hubs, Satellites, and Links. Hubs simply reflect the presence of an entity by storing its ID and some audit information such as its data source, create times, and so on. Each hub can have one or more Satellite(s). These Satellites store all the descriptive information about the entity. If we need to change the system and add new information about an entity, we can add another Satellite to the model, reflecting just the new data. This will bring the benefit of non-destructive deployments to the productive system in the rollout. In the Data Vault, the Customer data will be stored in one Hub (the CustomerID and audit columns) and one or more Satellites (the rest of the customer information). The structure in the model is finally brought by Links. They are provided with all the Primary Keys of the Hubs and, again, some metadata. Additionally, Links can have Satellites of their own that describe the relationships of the Link content. Therefore, the connection between a Customer and the Products they bought will be reflected in a Link, where the Customer-Hub-Key and the Product-Hub-Key are stored together with audit columns. A Link Satellite can, for example, reflect some characteristics of the relationship, such as the amount bought, the date, or a discount. Finally, we can even add a Star-Join-View schema to abstract all the tables of the Data Vault and make it easier to understand for the users.
- 3rd Normal form: This is the typical database modeling technique that is also used in (and first created for) so-called Online Transactional Processing (OLTP) databases. Artifacts are broken up into their atomic information and are spread over several tables so that no redundancy is stored in either table. The Product information of a system might be split in separate tables for the product's name, color, size, price information, and many more. To derive all the information of a 3rd Normal Form model, a lot of joining is necessary.
Investigating ETL/ELT
Understanding Data Warehouse layers
- Landing or staging area: Data is imported into this layer in its rawest format. Nothing is changed on its way to this area and only audit information is added to track down all the loads.
- QS, transient, or cleansing area: This is where the work is done. You will only find a few projects where data is consistent. Values, sometimes even mandatory ones from the source, may be missing, content might be formatted incorrectly or even corrupted, and so on. In this zone, all the issues with ...
Table of contents
- Cloud Scale Analytics with Azure Data Services
- Contributors
- Preface
- Section 1: Data Warehousing and Considerations Regarding Cloud Computing
- Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses
- Chapter 2: Connecting Requirements and Technology
- Section 2: The Storage Layer
- Chapter 3: Understanding the Data Lake Storage Layer
- Chapter 4: Understanding Synapse SQL Pools and SQL Options
- Section 3: Cloud-Scale Data Integration and Data Transformation
- Chapter 5: Integrating Data into Your Modern Data Warehouse
- Chapter 6: Using Synapse Spark Pools
- Chapter 7: Using Databricks Spark Clusters
- Chapter 8: Streaming Data into Your MDWH
- Chapter 9: Integrating Azure Cognitive Services and Machine Learning
- Chapter 10: Loading the Presentation Layer
- Section 4: Data Presentation, Dashboarding, and Distribution
- Chapter 11: Developing and Maintaining the Presentation Layer
- Chapter 12: Distributing Data
- Chapter 13: Introducing Industry Data Models
- Chapter 14: Establishing Data Governance
- Other Books You May Enjoy