eBook - ePub

Cloud Scale Analytics with Azure Data Services

Name: Cloud Scale Analytics with Azure Data Services
Author: Patrik Borosch

Patrik Borosch,

520 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Cloud Scale Analytics with Azure Data Services

Patrik Borosch,

About this book

A practical guide to implementing a scalable and fast state-of-the-art analytical data estateKey Features• Store and analyze data with enterprise-grade security and auditing• Perform batch, streaming, and interactive analytics to optimize your big data solutions with ease• Develop and run parallel data processing programs using real-world enterprise scenariosBook DescriptionAzure Data Lake, the modern data warehouse architecture, and related data services on Azure enable organizations to build their own customized analytical platform to fit any analytical requirements in terms of volume, speed, and quality. This book is your guide to learning all the features and capabilities of Azure data services for storing, processing, and analyzing data (structured, unstructured, and semi-structured) of any size. You will explore key techniques for ingesting and storing data and perform batch, streaming, and interactive analytics. The book also shows you how to overcome various challenges and complexities relating to productivity and scaling. Next, you will be able to develop and run massive data workloads to perform different actions. Using a cloud-based big data-modern data warehouse-analytics setup, you will also be able to build secure, scalable data estates for enterprises. Finally, you will not only learn how to develop a data warehouse but also understand how to create enterprise-grade security and auditing big data programs. By the end of this Azure book, you will have learned how to develop a powerful and efficient analytical platform to meet enterprise needs.What you will learn• Implement data governance with Azure services• Use integrated monitoring in the Azure Portal and integrate Azure Data Lake Storage into the Azure Monitor• Explore the serverless feature for ad-hoc data discovery, logical data warehousing, and data wrangling• Implement networking with Synapse Analytics and Spark pools• Create and run Spark jobs with Databricks clusters• Implement streaming using Azure Functions, a serverless runtime environment on Azure• Explore the predefined ML services in Azure and use them in your appWho this book is forThis book is for data architects, ETL developers, or anyone who wants to get well-versed with Azure data services to implement an analytical data estate for their enterprise. The book will also appeal to data scientists and data analysts who want to explore all the capabilities of Azure data services, which can be used to store, process, and analyze any kind of data. A beginner-level understanding of data analysis and streaming will be required.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.

Yes, you can access Cloud Scale Analytics with Azure Data Services by Patrik Borosch in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2021

eBook ISBN

9781800562141

Edition

Topic

Computer Science

Subtopic

Data Modelling & Design

Index

Computer Science

Section 1: Data Warehousing and Considerations Regarding Cloud Computing

This section will examine the question of whether data warehouses are still required given the rise of the enterprise data lake and provides a brief overview of the trends and development on the market of data and AI. As cloud computing adds flexible and scalable services to AI, there are no more limits in terms of source formats and volumes that can be processed for AI requirements, and given that AI and machine learning are on everybody's mind at the moment, the book attempts to ask what all this entails and where we are heading. In addition, we'll take a technology-agnostic look at the components that make up a successful analytical system. From an agnostic viewpoint, we will try to find the right Azure services to build a modern data warehouse.

This section comprises the following chapters:

Chapter 1, Balancing the Benefits of Data Lakes over Data Warehouses
Chapter 2, Connecting Requirements and Technology

Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses

Is the Data Warehouse dead with the advent of Data Lakes? There is disagreement everywhere about the need for Data Warehousing in a modern data estate. With the rise of Data Lakes and Big Data technology, many people use other, newer technologies compared to databases for their analytical efforts. Establishing a data-driven company seems to be possible without all those narrow definitions and planned structures, the ETL/ELT, and all the indexing for performance. But when we examine the technology carefully, when we compare the requirements that are formulated in analytical projects, free of prejudice to the functionality that the chosen services or software packages can deliver, we often find gaps on both ends. This chapter discusses the capabilities of Data Warehousing and Data Lakes and introduces the concept of the Modern Data Warehouse.

With all the innovations that have been brought to us in the last few years, such as faster hardware, new technologies, and new dogmas such as the Data Lake, older concepts and methods are being questioned and challenged. In this chapter, I would like to explore the evolution of the analytical world and try to answer the question, is the Data Warehouse really obsolete?

We'll find out by covering the following topics:

Distinguishing between Data Warehouses and Data Lakes
Understanding the opportunities of modern cloud computing
Exploring the benefits of AI and ML
Answering the question

Distinguishing between Data Warehouses and Data Lakes

There are several definitions of Data Warehousing on the internet. The narrower ones characterize a warehouse as the database and the model used in the database; the wider descriptions look at the term as a method, and a suitable collection of all organizational and technological components that make up a BI solution. They talk about everything from the Extract Transform Load tool (ETL tool) to the database, the model, and, of course, the reporting and dashboarding solution.

Understanding Data Warehouse patterns

When we look at the Data Warehousing method in general, at its heart, we find a database that offers a certain table structure. We almost always find two main types of artifacts in the database: Facts and Dimensions.

Facts provide all the measurable information that we want to analyze; for example, the quantities of products sold per customer, per region, per sales representative, and per time. Facts are normally quite narrow objects, but with a lot of rows stored.

In the Dimensions, we will find all the descriptive information that can be linked to the Facts for analysis. Every piece of information that a user puts on their report or dashboard to aggregate and group the fact data, filter it, and view it is collected in the Dimensions. All the data related to customer information, such as Product, Contract, Address, and so on, that might need to be analyzed and correlated is stored here. Typically, these objects are stored as tables in the database and are joined using their key columns. Dimensions are normally wide objects, sometimes with controlled redundancy, that look at the given modeling method.

Three main methods for modeling the Facts and Dimensions within a Data Warehouse database have crystalized over the years of its evolution:

Star-Join/Snowflake: This is probably the most famous method for Data Warehouse modeling. Fact tables are put in the center of the model, while Dimension tables are arranged around them, inheriting their Primary Key into the Fact table. In the Star-Join method, we find a lot of redundancy in the tables since as all the Dimension data, including all hierarchical information (such as Product Group -> Product SubCategory -> Product Category) regarding a certain artifact (Product, Customer, and so on), is stored in one table. In a Snowflake schema, hierarchies are spread in additional tables per hierarchy level and are linked over relationships with each other. This, when expressed in a graph, turns out to show a kind of snowflake pattern.
Data Vault: This is a newer method that reflects the rising structural volatility of data sources, which offers higher flexibility and speed for developing. Entities that need to be analyzed are stored over Hubs, Satellites, and Links. Hubs simply reflect the presence of an entity by storing its ID and some audit information such as its data source, create times, and so on. Each hub can have one or more Satellite(s). These Satellites store all the descriptive information about the entity. If we need to change the system and add new information about an entity, we can add another Satellite to the model, reflecting just the new data. This will bring the benefit of non-destructive deployments to the productive system in the rollout. In the Data Vault, the Customer data will be stored in one Hub (the CustomerID and audit columns) and one or more Satellites (the rest of the customer information). The structure in the model is finally brought by Links. They are provided with all the Primary Keys of the Hubs and, again, some metadata. Additionally, Links can have Satellites of their own that describe the relationships of the Link content. Therefore, the connection between a Customer and the Products they bought will be reflected in a Link, where the Customer-Hub-Key and the Product-Hub-Key are stored together with audit columns. A Link Satellite can, for example, reflect some characteristics of the relationship, such as the amount bought, the date, or a discount. Finally, we can even add a Star-Join-View schema to abstract all the tables of the Data Vault and make it easier to understand for the users.
3rd Normal form: This is the typical database modeling technique that is also used in (and first created for) so-called Online Transactional Processing (OLTP) databases. Artifacts are broken up into their atomic information and are spread over several tables so that no redundancy is stored in either table. The Product information of a system might be split in separate tables for the product's name, color, size, price information, and many more. To derive all the information of a 3rd Normal Form model, a lot of joining is necessary.

Investigating ETL/ELT

But how does data finally land in the Data Warehouse database? The process and the related tools are named Extract, Transform, Load (ETL), but depending on the sequence we implement it in, it may be referred to as ELT. You'll find several possible ways to implement data loading into a Data Warehouse. This can be done by implementing specialized ETL tools such as Azure Data Factory in the cloud, SQL Server Integration services (SSIS), Informatica, Talend, or IBM Data Stage, for example.

The biggest advantage of these tools is the availability of wide catalogues of ready-to-use source and target connectors. They can connect directly to a source, query the needed data, and even transform it while being transported to the target. In the end, data is loaded into the Data Warehouse database. Other advantages include its graphical interfaces, where complex logic can be implemented on a "point-and-click" basis, which is very easy to understand and maintain.

There are other options as well. Data is often pushed by source applications and a direct connection for the data extraction process is not wanted at all. Many times, files are provided that are stored somewhere near the Data Warehouse database and then need to be imported. Maybe there is no ETL tool available. Since nearly every database nowadays provides loader tools, the import can be accomplished using those tools in a scripted environment. Once the data has made its way to the database tables, the transformational steps are done using Stored Procedures that then move the data through different DWH stages or layers to the final Core Data Warehouse.

Understanding Data Warehouse layers

Talking about the Data Warehouse layers, we nearly always find several steps that are processed before the data is provided for reporting or dashboarding. Typically, there are at least the following stages or layers:

Landing or staging area: Data is imported into this layer in its rawest format. Nothing is changed on its way to this area and only audit information is added to track down all the loads.
QS, transient, or cleansing area: This is where the work is done. You will only find a few projects where data is consistent. Values, sometimes even mandatory ones from the source, may be missing, content might be formatted incorrectly or even corrupted, and so on. In this zone, all the issues with ...

Cloud Scale Analytics with Azure Data Services
Contributors
Preface
Section 1: Data Warehousing and Considerations Regarding Cloud Computing
Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses
Chapter 2: Connecting Requirements and Technology
Section 2: The Storage Layer
Chapter 3: Understanding the Data Lake Storage Layer
Chapter 4: Understanding Synapse SQL Pools and SQL Options
Section 3: Cloud-Scale Data Integration and Data Transformation
Chapter 5: Integrating Data into Your Modern Data Warehouse
Chapter 6: Using Synapse Spark Pools
Chapter 7: Using Databricks Spark Clusters
Chapter 8: Streaming Data into Your MDWH
Chapter 9: Integrating Azure Cognitive Services and Machine Learning
Chapter 10: Loading the Presentation Layer
Section 4: Data Presentation, Dashboarding, and Distribution
Chapter 11: Developing and Maintaining the Presentation Layer
Chapter 12: Distributing Data
Chapter 13: Introducing Industry Data Models
Chapter 14: Establishing Data Governance
Other Books You May Enjoy