eBook - ePub

Data Engineering on Azure

Name: Data Engineering on Azure
Author: Vlad Riscutia

Vlad Riscutia

Buch teilen

336 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Data Engineering on Azure

Vlad Riscutia

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Build a data platform to the industry-leading standards set by Microsoft's own infrastructure. Summary
In Data Engineering on Azure you will learn how to: Pick the right Azure services for different data scenarios
Manage data inventory
Implement production quality data modeling, analytics, and machine learning workloads
Handle data governance
Using DevOps to increase reliability
Ingesting, storing, and distributing data
Apply best practices for compliance and access control Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft's own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. About the technology
Build secure, stable data platforms that can scale to loads of any size. When a project moves from the lab into production, you need confidence that it can stand up to real-world challenges. This book teaches you to design and implement cloud-based data infrastructure that you can easily monitor, scale, and modify. About the book
In Data Engineering on Azure you'll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you'll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms. What's inside Data inventory and data governance
Assure data quality, compliance, and distribution
Build automated pipelines to increase reliability
Ingest, store, and distribute data
Production-quality data modeling, analytics, and machine learning About the reader
For data engineers familiar with cloud computing and DevOps. About the author
Vlad Riscutia is a software architect at Microsoft. Table of Contents 1 Introduction
PART 1 INFRASTRUCTURE
2 Storage
3 DevOps
4 Orchestration
PART 2 WORKLOADS
5 Processing
6 Analytics
7 Machine learning
PART 3 GOVERNANCE
8 Metadata
9 Data quality
10 Compliance
11 Distributing data

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Data Engineering on Azure als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Data Engineering on Azure von Vlad Riscutia im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Informatique & Cloud Computing. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Manning

Jahr

2021

ISBN

9781638356912

Thema

Informatique

Thema

Cloud Computing

1 Introduction

This chapter covers

Defining data engineering
Anatomy of a data platform
Benefits of the cloud
Getting started with Azure
Overview of an Azure data platform

With the advent of cloud computing, the amount of data generated every moment reached an unprecedented scale. The discipline of data science flourishes in this environment, deriving knowledge and insights from massive amounts of data. As data science becomes critical to business, its processes must be treated with the same rigor as other components of business IT. For example, software engineering teams today embrace DevOps to develop and operate services with 99.99999% availability guarantees. Data engineering brings a similar rigor to data science, so data-centric processes run reliably, smoothly, and in a compliant way.

For the past few years, I’ve had the privilege of being a software architect for Microsoft’s Customer Growth and Analytics team. Our team’s motto is “Using Azure to understand Azure.” We connect many datapoints across the Microsoft business to better understand our customers and to empower teams across the company. Privacy is important to us, so we never look at our customers’ data, but we do have access to telemetry from Azure, commercial transactions, and other operational pipelines. This gives us a unique perspective on Azure in understanding how customers can get the most value from our offerings.

As a few examples, we help marketing, sales, support, finance, operations, and business planning with key insights, while simultaneously providing operational excellence recommendations to our customers through Azure Advisor. While our data science and machine learning (ML) teams focus on the insights, our data engineering teams ensure we can operate at the scale of an Azure business with high reliability because any outage in our platform can impact our customers or our business.

Our data platform is fully built on Azure, and we are working closely with service teams to preview features and give product feedback. This book is inspired by some of our learnings over the years. The technologies presented are close to what my team uses on a day-to-day basis.

1.1 What is data engineering?

This book is about practical data engineering in a production environment, so let’s start by defining data engineering. But to define data engineering, we first need to talk about data science.

“Data is the new oil,” as the saying goes. In a connected world, more and more data is available for analysis, inference, and ML. The field of data science deals with extracting knowledge and insights from data. Many times, these insights prove invaluable to a business. Consider a scenario like the movies Netflix recommends to a customer. The better the recommendations, the more likely to retain a customer.

While many data science projects start as exploratory, once these show real value, they need to be supported in an ongoing, reliable fashion. In the software engineering world, this is the equivalent of taking a research, proof-of-concept, or hackathon project and graduating it into a fully production-ready solution. While a hack or a prototype can take many shortcuts and focus on “the meat of the problem” it addresses, a production-ready system does not cut any corners. This is where the engineering part of software engineering comes into play, providing the rigor to build and run a reliable system. This includes a plethora of concerns like architecture and design, performance, security, accessibility, telemetry, debuggability, extensibility, and so on.

Definition Data engineering is the part of data science that deals with the practical applications of collecting and analyzing data. It aims to bring engineering rigor to the process of building and supporting reliable data systems.

The ML part of data science deals with building a model. In the Netflix scenario, the data model recommends, based on your viewing history, which movies you are likely to enjoy next. The data engineering part of the discipline deals with building a system that continuously gathers and cleans up the viewing history, then runs the model at scale on the data of all users and distributes the results to the recommendation user interface. All of this provided in an automated fashion with monitoring and alerting build around each step of the process.

Data engineering deals with building and operating big data platforms to support all data science scenarios. There are various other terms used for some of these aspects: DataOps refers to moving data in a data system, MLOps refers to running ML at scale as in our Netflix example. (ML combined with DevOps is also known as MLOps.) Our definition of data engineering encompasses all of these and looks at how we can implement DevOps for data science.

1.2 Who this book is for

This is a book for data scientists, software engineers, and software architects turned data engineers and tasked with building a data platform to support analytics and/or ML at scale. You should know what the cloud is, have some experience working with data and code, and not mind using a shell. We’ll touch on the basics of all of these, but the focus for this book will be on data platform building.

Data engineering is surprisingly similar to software engineering and frustratingly different. While we can leverage a lot of the lessons from the software engineering world, as we will see in this book, there is a unique set of challenges we will have to address. Some of the common themes are making sure everything is tracked in source control, automatic deployments, monitoring, and alerting. A key difference between data and code is that code is static: once the bugs are worked out, a piece of code is expected to work consistently and reliably. On the other hand, data moves continuously into and out of a data platform, and it is likely failures will occur due to various external reasons. Governance is another major topic that is specific to data: access control, cataloguing, privacy, and regulatory concerns are a big part of a data platform.

The main theme of the book is bringing some of the lessons learned from software engineering over the past few decades to the data space so you can build a data platform exhibiting the properties of a solid software solution: scale, reliability, security, and so on. This book tackles some of these challenges, goes over patterns and best practices, and provides examples of how these could be applied in the Azure cloud. For the examples, we will use the Azure CLI (CLI stands for command-line interface), KQL (the Kusto Query Language), and a little bit of Python. The focus won’t be on the services themselves though. Instead, we will focus on data engineering challenges (and solutions) in a production environment.

1.3 What is a data platform?

Just as many data science projects start as an exploration of a data space and what insights can be derived from the data, many data science teams start in a similar exploratory fashion. A small team comes up with some good insights at first, and then as the team grows, so do the needs of the underlying platform supporting the team.

What first used to be an ad hoc process now requires automation. Once there were just two data scientists on the team, so who got to see which data was not as much of a concern as it is now, when there are 100 data scientists, some interns, and some external vendors. What used to be a monthly email is now a live system integrated with the company’s website. Different scenarios that used to be achieved through different means must now be supported by a robust data platform.

definition A data platform is a software solution for collecting, processing, managing, and sharing data for strategic business purposes.

Let’s look at an analogy to software engineering. You can write code on your laptop (for example, a web service like GIPHY) that, when given some keywords, returns a set of topical animations. Even if the code does exactly what it is meant to, that doesn’t mean it can scale to a production environment. If you want to host the same service at web scale and expect that anyone around the world can access it at any time, there is an additional set of concerns to consider: performance, scaling to millions of users, low latency, a failover solution in case things go wrong, a way to deploy an update without downtime, and so on. We can call the first part, writing code on your laptop, software development or coding. The second part, operatin...