eBook - ePub

Distributed Data Systems with Azure Databricks

Name: Distributed Data Systems with Azure Databricks
Author: Alan Bernardo Palacio

Create, deploy, and manage enterprise data pipelines

Alan Bernardo Palacio

414 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Distributed Data Systems with Azure Databricks

Create, deploy, and manage enterprise data pipelines

Alan Bernardo Palacio

Book details

Book preview

Table of contents

Citations

About This Book

Quickly build and deploy massive data pipelines and improve productivity using Azure Databricks

Key Features

Get to grips with the distributed training and deployment of machine learning and deep learning models
Learn how ETLs are integrated with Azure Data Factory and Delta Lake
Explore deep learning and machine learning models in a distributed computing infrastructure

Book Description

Microsoft Azure Databricks helps you to harness the power of distributed computing and apply it to create robust data pipelines, along with training and deploying machine learning and deep learning models. Databricks' advanced features enable developers to process, transform, and explore data. Distributed Data Systems with Azure Databricks will help you to put your knowledge of Databricks to work to create big data pipelines.

The book provides a hands-on approach to implementing Azure Databricks and its associated methodologies that will make you productive in no time. Complete with detailed explanations of essential concepts, practical examples, and self-assessment questions, you'll begin with a quick introduction to Databricks core functionalities, before performing distributed model training and inference using TensorFlow and Spark MLlib. As you advance, you'll explore MLflow Model Serving on Azure Databricks and implement distributed training pipelines using HorovodRunner in Databricks.

Finally, you'll discover how to transform, use, and obtain insights from massive amounts of data to train predictive models and create entire fully working data pipelines. By the end of this MS Azure book, you'll have gained a solid understanding of how to work with Databricks to create and manage an entire big data pipeline.

What you will learn

Create ETLs for big data in Azure Databricks
Train, manage, and deploy machine learning and deep learning models
Integrate Databricks with Azure Data Factory for extract, transform, load (ETL) pipeline creation
Discover how to use Horovod for distributed deep learning
Find out how to use Delta Engine to query and process data from Delta Lake
Understand how to use Data Factory in combination with Databricks
Use Structured Streaming in a production-like environment

Who this book is for

This book is for software engineers, machine learning engineers, data scientists, and data engineers who are new to Azure Databricks and want to build high-quality data pipelines without worrying about infrastructure. Knowledge of Azure Databricks basics is required to learn the concepts covered in this book more effectively. A basic understanding of machine learning concepts and beginner-level Python programming knowledge is also recommended.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Distributed Data Systems with Azure Databricks an online PDF/ePUB?

Yes, you can access Distributed Data Systems with Azure Databricks by Alan Bernardo Palacio in PDF and/or ePUB format, as well as other popular books in Computer Science & Entreprise Applications. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2021

ISBN

9781838642693

Edition

Topic

Computer Science

Subtopic

Entreprise Applications

Index

Computer Science

Section 1: Introducing Databricks

This section introduces Databricks for new users and discusses its functionalities as well as the advantages that we have while dealing with massive amounts of data.

This section contains the following chapters:

Chapter 1, Introduction to Azure Databricks
Chapter 2, Creating an Azure Databricks Workspace

Chapter 1: Introduction to Azure Databricks

Modern information systems work with massive amounts of data, with a constant flow that increases every day at an exponential rate. This flow comes from different sources, including sales information, transactional data, social media, and more. Organizations have to work with this information in processes that include transformation and aggregation to develop applications that seek to extract value from this data.

Apache Spark was developed to process this massive amount of data. Azure Databricks is built on top of Apache Spark, abstracting most of the complexities of implementing it, and with all the benefits that come with integration with other Azure services. This book aims to provide an introduction to Azure Databricks and explore the applications it has in modern data pipelines to transform, visualize, and extract insights from large amounts of data in a distributed computation environment.

In this introductory chapter, we will explore these topics:

Introducing Apache Spark
Introducing Azure Databricks
Discovering core concepts and terminology
Interacting with the Azure Databricks workspace
Using Azure Databricks notebooks
Exploring data management
Exploring computation management
Exploring authentication and authorization

These concepts will help us to later understand all of the aspects of the execution of our jobs in Azure Databricks and to move easily between all its assets.

Technical requirements

To understand the topics presented in this book, you must be familiar with data science and data engineering terms, and have a good understanding of Python, which is the main programming language used in this book, although we will also use SQL to make queries on views and tables.

In terms of the resources required, to execute the steps in this section and those presented in this book, you will require an Azure account as well as an active subscription. Bear in mind that this is a service that is paid, so you will have to introduce your credit card details to create an account. When you create a new account, you will receive a certain amount of free credit, but there are certain options that are limited to premium users. Always remember to stop all the services if you are not using them.

Introducing Apache Spark

To work with the huge amount of information available to modern consumers, Apache Spark was created. It is a distributed, cluster-based computing system and a highly popular framework used for big data, with capabilities that provide speed and ease of use, and includes APIs that support the following use cases:

Easy cluster management
Data integration and ETL procedures
Interactive advanced analytics
ML and deep learning
Real-time data processing

It can run very quickly on large datasets thanks to its in-memory processing design that allows it to run with very few read/write disk operations. It has a SQL-like interface and its object-oriented design makes it very easy to understand and write code for; it also has a large support community.

Despite its numerous benefits, Apache Spark has its limitations. These limitations include the following:

Users need to provide a database infrastructure to store the information to work with.
The in-memory processing feature makes it fast to run, but also implies that it has high memory requirements.
It isn't well suited for real-time analytics.
It has an inherent complexity with a significant learning curve.
Because of its open source nature, it lacks dedicated training and customer support.

Let's look at the solution to these issues: Azure Databricks.

Introducing Azure Databricks

With these and other limitations in mind, Databricks was designed. It is a cloud-based platform that uses Apache Spark as a backend and builds on top of it, to add features including the following:

Highly reliable data pipelines
Data science at scale

Distributed Data Systems with Azure Databricks
Contributors
Preface
Section 1: Introducing Databricks
Chapter 1: Introduction to Azure Databricks
Chapter 2: Creating an Azure Databricks Workspace
Section 2: Data Pipelines with Databricks
Chapter 3: Creating ETL Operations with Azure Databricks
Chapter 4: Delta Lake with Azure Databricks
Chapter 5: Introducing Delta Engine
Chapter 6: Introducing Structured Streaming
Section 3: Machine and Deep Learning with Databricks
Chapter 7: Using Python Libraries in Azure Databricks
Chapter 8: Databricks Runtime for Machine Learning
Chapter 9: Databricks Runtime for Deep Learning
Chapter 10: Model Tracking and Tuning in Azure Databricks
Chapter 11: Managing and Serving Models with MLflow and MLeap
Chapter 12: Distributed Deep Learning in Azure Databricks
Other Books You May Enjoy