eBook - ePub

Data Engineering on Azure

Name: Data Engineering on Azure
Author: Vlad Riscutia

Vlad Riscutia

Share book

336 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Engineering on Azure

Vlad Riscutia

Book details

Book preview

Table of contents

Citations

About This Book

Build a data platform to the industry-leading standards set by Microsoft's own infrastructure. Summary
In Data Engineering on Azure you will learn how to: Pick the right Azure services for different data scenarios
Manage data inventory
Implement production quality data modeling, analytics, and machine learning workloads
Handle data governance
Using DevOps to increase reliability
Ingesting, storing, and distributing data
Apply best practices for compliance and access control Data Engineering on Azure reveals the data management patterns and techniques that support Microsoft's own massive data infrastructure. Author Vlad Riscutia, a data engineer at Microsoft, teaches you to bring an engineering rigor to your data platform and ensure that your data prototypes function just as well under the pressures of production. You'll implement common data modeling patterns, stand up cloud-native data platforms on Azure, and get to grips with DevOps for both analytics and machine learning. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. About the technology
Build secure, stable data platforms that can scale to loads of any size. When a project moves from the lab into production, you need confidence that it can stand up to real-world challenges. This book teaches you to design and implement cloud-based data infrastructure that you can easily monitor, scale, and modify. About the book
In Data Engineering on Azure you'll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you'll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms. What's inside Data inventory and data governance
Assure data quality, compliance, and distribution
Build automated pipelines to increase reliability
Ingest, store, and distribute data
Production-quality data modeling, analytics, and machine learning About the reader
For data engineers familiar with cloud computing and DevOps. About the author
Vlad Riscutia is a software architect at Microsoft. Table of Contents 1 Introduction
PART 1 INFRASTRUCTURE
2 Storage
3 DevOps
4 Orchestration
PART 2 WORKLOADS
5 Processing
6 Analytics
7 Machine learning
PART 3 GOVERNANCE
8 Metadata
9 Data quality
10 Compliance
11 Distributing data

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Data Engineering on Azure an online PDF/ePUB?

Yes, you can access Data Engineering on Azure by Vlad Riscutia in PDF and/or ePUB format, as well as other popular books in Informatique & Cloud Computing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Manning

Year

2021

ISBN

9781638356912

Topic

Informatique

Subtopic

Cloud Computing

1 Introduction

This chapter covers

Defining data engineering
Anatomy of a data platform
Benefits of the cloud
Getting started with Azure
Overview of an Azure data platform

With the advent of cloud computing, the amount of data generated every moment reached an unprecedented scale. The discipline of data science flourishes in this environment, deriving knowledge and insights from massive amounts of data. As data science becomes critical to business, its processes must be treated with the same rigor as other components of business IT. For example, software engineering teams today embrace DevOps to develop and operate services with 99.99999% availability guarantees. Data engineering brings a similar rigor to data science, so data-centric processes run reliably, smoothly, and in a compliant way.

For the past few years, I’ve had the privilege of being a software architect for Microsoft’s Customer Growth and Analytics team. Our team’s motto is “Using Azure to understand Azure.” We connect many datapoints across the Microsoft business to better understand our customers and to empower teams across the company. Privacy is important to us, so we never look at our customers’ data, but we do have access to telemetry from Azure, commercial transactions, and other operational pipelines. This gives us a unique perspective on Azure in understanding how customers can get the most value from our offerings.

As a few examples, we help marketing, sales, support, finance, operations, and business planning with key insights, while simultaneously providing operational excellence recommendations to our customers through Azure Advisor. While our data science and machine learning (ML) teams focus on the insights, our data engineering teams ensure we can operate at the scale of an Azure business with high reliability because any outage in our platform can impact our customers or our business.

Our data platform is fully built on Azure, and we are working closely with service teams to preview features and give product feedback. This book is inspired by some of our learnings over the years. The technologies presented are close to what my team uses on a day-to-day basis.

1.1 What is data engineering?

This book is about practical data engineering in a production environment, so let’s start by defining data engineering. But to define data engineering, we first need to talk about data science.

“Data is the new oil,” as the saying goes. In a connected world, more and more data is available for analysis, inference, and ML. The field of data science deals with extracting knowledge and insights from data. Many times, these insights prove invaluable to a business. Consider a scenario like the movies Netflix recommends to a customer. The better the recommendations, the more likely to retain a customer.

While many data science projects start as exploratory, once these show real value, they need to be supported in an ongoing, reliable fashion. In the software engineering world, this is the equivalent of taking a research, proof-of-concept, or hackathon project and graduating it into a fully production-ready solution. While a hack or a prototype can take many shortcuts and focus on “the meat of the problem” it addresses, a production-ready system does not cut any corners. This is where the engineering part of software engineering comes into play, providing the rigor to build and run a reliable system. This includes a plethora of concerns like architecture and design, performance, security, accessibility, telemetry, debuggability, extensibility, and so on.

Definition Data engineering is the part of data science that deals with the practical applications of collecting and analyzing data. It aims to bring engineering rigor to the process of building and supporting reliable data systems.

The ML part of data science deals with building a model. In the Netflix scenario, the data model recommends, based on your viewing history, which movies you are likely to enjoy next. The data engineering part of the discipline deals with building a system that continuously gathers and cleans up the viewing history, then runs the model at scale on the data of all users and distributes the results to the recommendation user interface. All of this provided in an automated fashion with monitoring and alerting build around each step of the process.

Data engineering deals with building and operating big data platforms to support all data science scenarios. There are various other terms used for some of these aspects: DataOps refers to moving data in a data system, MLOps refers to running ML at scale as in our Netflix example. (ML combined with DevOps is also known as MLOps.) Our definition of data engineering encompasses all of these and looks at how we can implement DevOps for data science.

1.2 Who this book is for

This is a book for data scientists, software engineers, and software architects turned data engineers and tasked with building a data platform to support analytics and/or ML at scale. You should know what the cloud is, have some experience working with data and code, and not mind using a shell. We’ll touch on the basics of all of these, but the focus for this book will be on data platform building.

Data engineering is surprisingly similar to software engineering and frustratingly different. While we can leverage a lot of the lessons from the software engineering world, as we will see in this book, there is a unique set of challenges we will have to address. Some of the common themes are making sure everything is tracked in source control, automatic deployments, monitoring, and alerting. A key difference between data and code is that code is static: once the bugs are worked out, a piece of code is expected to work consistently and reliably. On the other hand, data moves continuously into and out of a data platform, and it is likely failures will occur due to various external reasons. Governance is another major topic that is specific to data: access control, cataloguing, privacy, and regulatory concerns are a big part of a data platform.

The main theme of the book is bringing some of the lessons learned from software engineering over the past few decades to the data space so you can build a data platform exhibiting the properties of a solid software solution: scale, reliability, security, and so on. This book tackles some of these challenges, goes over patterns and best practices, and provides examples of how these could be applied in the Azure cloud. For the examples, we will use the Azure CLI (CLI stands for command-line interface), KQL (the Kusto Query Language), and a little bit of Python. The focus won’t be on the services themselves though. Instead, we will focus on data engineering challenges (and solutions) in a production environment.

1.3 What is a data platform?

Just as many data science projects start as an exploration of a data space and what insights can be derived from the data, many data science teams start in a similar exploratory fashion. A small team comes up with some good insights at first, and then as the team grows, so do the needs of the underlying platform supporting the team.

What first used to be an ad hoc process now requires automation. Once there were just two data scientists on the team, so who got to see which data was not as much of a concern as it is now, when there are 100 data scientists, some interns, and some external vendors. What used to be a monthly email is now a live system integrated with the company’s website. Different scenarios that used to be achieved through different means must now be supported by a robust data platform.

definition A data platform is a software solution for collecting, processing, managing, and sharing data for strategic business purposes.

Let’s look at an analogy to software engineering. You can write code on your laptop (for example, a web service like GIPHY) that, when given some keywords, returns a set of topical animations. Even if the code does exactly what it is meant to, that doesn’t mean it can scale to a production environment. If you want to host the same service at web scale and expect that anyone around the world can access it at any time, there is an additional set of concerns to consider: performance, scaling to millions of users, low latency, a failover solution in case things go wrong, a way to deploy an update without downtime, and so on. We can call the first part, writing code on your laptop, software development or coding. The second part, operatin...