Distributed Data Systems with Azure Databricks
eBook - ePub

Distributed Data Systems with Azure Databricks

Create, deploy, and manage enterprise data pipelines

Alan Bernardo Palacio

Share book
  1. 414 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Distributed Data Systems with Azure Databricks

Create, deploy, and manage enterprise data pipelines

Alan Bernardo Palacio

Book details
Book preview
Table of contents

About This Book

Quickly build and deploy massive data pipelines and improve productivity using Azure Databricks

Key Features

  • Get to grips with the distributed training and deployment of machine learning and deep learning models
  • Learn how ETLs are integrated with Azure Data Factory and Delta Lake
  • Explore deep learning and machine learning models in a distributed computing infrastructure

Book Description

Microsoft Azure Databricks helps you to harness the power of distributed computing and apply it to create robust data pipelines, along with training and deploying machine learning and deep learning models. Databricks' advanced features enable developers to process, transform, and explore data. Distributed Data Systems with Azure Databricks will help you to put your knowledge of Databricks to work to create big data pipelines.

The book provides a hands-on approach to implementing Azure Databricks and its associated methodologies that will make you productive in no time. Complete with detailed explanations of essential concepts, practical examples, and self-assessment questions, you'll begin with a quick introduction to Databricks core functionalities, before performing distributed model training and inference using TensorFlow and Spark MLlib. As you advance, you'll explore MLflow Model Serving on Azure Databricks and implement distributed training pipelines using HorovodRunner in Databricks.

Finally, you'll discover how to transform, use, and obtain insights from massive amounts of data to train predictive models and create entire fully working data pipelines. By the end of this MS Azure book, you'll have gained a solid understanding of how to work with Databricks to create and manage an entire big data pipeline.

What you will learn

  • Create ETLs for big data in Azure Databricks
  • Train, manage, and deploy machine learning and deep learning models
  • Integrate Databricks with Azure Data Factory for extract, transform, load (ETL) pipeline creation
  • Discover how to use Horovod for distributed deep learning
  • Find out how to use Delta Engine to query and process data from Delta Lake
  • Understand how to use Data Factory in combination with Databricks
  • Use Structured Streaming in a production-like environment

Who this book is for

This book is for software engineers, machine learning engineers, data scientists, and data engineers who are new to Azure Databricks and want to build high-quality data pipelines without worrying about infrastructure. Knowledge of Azure Databricks basics is required to learn the concepts covered in this book more effectively. A basic understanding of machine learning concepts and beginner-level Python programming knowledge is also recommended.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Distributed Data Systems with Azure Databricks an online PDF/ePUB?
Yes, you can access Distributed Data Systems with Azure Databricks by Alan Bernardo Palacio in PDF and/or ePUB format, as well as other popular books in Informatica & Applicazioni per aziende. We have over one million books available in our catalogue for you to explore.



Section 1: Introducing Databricks

This section introduces Databricks for new users and discusses its functionalities as well as the advantages that we have while dealing with massive amounts of data.
This section contains the following chapters:
  • Chapter 1, Introduction to Azure Databricks
  • Chapter 2, Creating an Azure Databricks Workspace

Chapter 1: Introduction to Azure Databricks

Modern information systems work with massive amounts of data, with a constant flow that increases every day at an exponential rate. This flow comes from different sources, including sales information, transactional data, social media, and more. Organizations have to work with this information in processes that include transformation and aggregation to develop applications that seek to extract value from this data.
Apache Spark was developed to process this massive amount of data. Azure Databricks is built on top of Apache Spark, abstracting most of the complexities of implementing it, and with all the benefits that come with integration with other Azure services. This book aims to provide an introduction to Azure Databricks and explore the applications it has in modern data pipelines to transform, visualize, and extract insights from large amounts of data in a distributed computation environment.
In this introductory chapter, we will explore these topics:
  • Introducing Apache Spark
  • Introducing Azure Databricks
  • Discovering core concepts and terminology
  • Interacting with the Azure Databricks workspace
  • Using Azure Databricks notebooks
  • Exploring data management
  • Exploring computation management
  • Exploring authentication and authorization
These concepts will help us to later understand all of the aspects of the execution of our jobs in Azure Databricks and to move easily between all its assets.

Technical requirements

To understand the topics presented in this book, you must be familiar with data science and data engineering terms, and have a good understanding of Python, which is the main programming language used in this book, although we will also use SQL to make queries on views and tables.
In terms of the resources required, to execute the steps in this section and those presented in this book, you will require an Azure account as well as an active subscription. Bear in mind that this is a service that is paid, so you will have to introduce your credit card details to create an account. When you create a new account, you will receive a certain amount of free credit, but there are certain options that are limited to premium users. Always remember to stop all the services if you are not using them.

Introducing Apache Spark

To work with the huge amount of information available to modern consumers, Apache Spark was created. It is a distributed, cluster-based computing system and a highly popular framework used for big data, with capabilities that provide speed and ease of use, and includes APIs that support the following use cases:
  • Easy cluster management
  • Data integration and ETL procedures
  • Interactive advanced analytics
  • ML and deep learning
  • Real-time data processing
It can run very quickly on large datasets thanks to its in-memory processing design that allows it to run with very few read/write disk operations. It has a SQL-like interface and its object-oriented design makes it very easy to understand and write code for; it also has a large support community.
Despite its numerous benefits, Apache Spark has its limitations. These limitations include the following:
  • Users need to provide a database infrastructure to store the information to work with.
  • The in-memory processing feature makes it fast to run, but also implies that it has high memory requirements.
  • It isn't well suited for real-time analytics.
  • It has an inherent complexity with a significant learning curve.
  • Because of its open source nature, it lacks dedicated training and customer support.
Let's look at the solution to these issues: Azure Databricks.

Introducing Azure Databricks

With these and other limitations in mind, Databricks was designed. It is a cloud-based platform that uses Apache Spark as a backend and builds on top of it, to add features including the following:
  • Highly reliable data pipelines
  • Data science at scale
  • ...

Table of contents