eBook - ePub

Data Engineering with Google Cloud Platform

Name: Data Engineering with Google Cloud Platform
ISBN: 9781800565067

A practical guide to operationalizing scalable data analytics systems on GCP

Adi Wijaya,

440 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Engineering with Google Cloud Platform

A practical guide to operationalizing scalable data analytics systems on GCP

Adi Wijaya,

About this book

Build and deploy your own data pipelines on GCP, make key architectural decisions, and gain the confidence to boost your career as a data engineer

Key Features

Understand data engineering concepts, the role of a data engineer, and the benefits of using GCP for building your solution
Learn how to use the various GCP products to ingest, consume, and transform data and orchestrate pipelines
Discover tips to prepare for and pass the Professional Data Engineer exam

Book Description

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards. Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP. By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

What you will learn

Load data into BigQuery and materialize its output for downstream consumption
Build data pipeline orchestration using Cloud Composer
Develop Airflow jobs to orchestrate and automate a data warehouse
Build a Hadoop data lake, create ephemeral clusters, and run jobs on the Dataproc cluster
Leverage Pub/Sub for messaging and ingestion for event-driven systems
Use Dataflow to perform ETL on streaming data
Unlock the power of your data with Data Studio
Calculate the GCP cost estimation for your end-to-end data solutions

Who this book is for

This book is for data engineers, data analysts, and anyone looking to design and manage data processing pipelines using GCP. You'll find this book useful if you are preparing to take Google's Professional Data Engineer exam. Beginner-level understanding of data science, the Python programming language, and Linux commands is necessary. A basic understanding of data processing and cloud computing, in general, will help you make the most out of this book.

]]>

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2022

Topic

Computer Science

eBook ISBN

9781800565067

Subtopic

Data Mining

Index

Computer Science

Section 1: Getting Started with Data Engineering with GCP

This part will talk about the purpose, value, and concepts of big data and cloud computing and how GCP products are relevant to data engineering. You will learn about a data engineer's core responsibilities, how they differ from those of a data scientist, and how to facilitate the flow of data through an organization to derive insights.

This section comprises the following chapters:

Chapter 1, Fundamentals of Data Engineering
Chapter 2, Big Data Capabilities on GCP

Chapter 1: Fundamentals of Data Engineering

Years ago, when I first entered the data science world, I used to think data was clean. Clean in terms of readiness, available in one place, and ready for fun data science purposes. I was so excited to experiment with machine learning models, finding unusual patterns in data and playing around with clean data. But after years of experience working with data, I realized that data science in big organizations isn't straightforward.

Eighty percent of the effort goes into collecting, cleaning, and transforming the data. If you have had any experience in working with data, I am sure you've noticed something similar. But the good news is, we know that almost all processes can be automated using proper planning, designing, and engineering skills. That was the point where I realized that data engineering will be the most critical role from that day to the future of the data science world.

To develop a successful data ecosystem in any organization, the most crucial part is how they design the data architecture. If the organization fails to make the best decision on the data architecture, the future process will be painful. Here are some common examples: the system is not scalable, querying data is slow, business users don't trust your data, the infrastructure cost is very high, and data is leaked. There is so much more that can go wrong without proper data engineering practice.

In this chapter, we are going to learn the fundamental knowledge behind data engineering. The goal is to introduce you to common terms that are often used in this field and will be mentioned often in the later chapters.

In particular, we will be covering the following topics:

Understanding the data life cycle
Know the roles of a data engineer before starting
Foundational concepts for data engineering

Understanding the data life cycle

The first principle to learn to become a data engineer is understanding the data life cycle. If you've worked with data, you must know that data doesn't stay in one place; it moves from one storage to another, from one database to other databases. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user:

Who will consume the data?
What data sources should I use?
Where should I store the data?
When should the data arrive?
Why does the data need to be stored in this place?
How should the data be processed?

To answer all those questions, we'll start by looking back a little bit at the history of data technologies.

Understanding the need for a data warehouse

Data warehouse is not a new concept; I believe you've at least heard of it. In fact, the terminology is no longer appealing. In my experience, no one gets excited when talking about data warehouses in the 2020s. Especially when compared to terminologies such as big data, cloud computing, and artificial intelligence.

So, why do we need to know about data warehouses? The answer to that is because almost every single data engineering challenge from the old times to these days is conceptually the same. The challenges are always about moving data from the data source to other environments so the business can use it to get information. The difference from time to time is only about the how and newer technologies. If we understand why people needed data warehouses in historical times, we will have a better foundation to understand the data engineering space and, more specifically, the data life cycle.

Data warehouses were first developed in the 1980s to transform data from operational systems to decision-making support systems. The key principle of a data warehouse is combining data from many different sources to a single location and then transforming it into a format the data warehouse can process and store.

For example, in the financial industry, say a bank wants to know how many credit card customers also have mortgages. It is a simple enough question, yet it's not that easy to answer. Why?

Most traditional banks that I have worked with had different operating systems for each of their products, including a specific system for credit cards and specific systems for mortgages, saving products, websites, customer service, and many other systems. So, in order to answer the question, data from multiple systems needs to be stored in one place first.

See the following diagram on how each department is independent:

Figure 1.1 – Data silos

Often, independence not only applies to the organization structure but also to the data. When data is located in different places, it's called data silos. This is very common in large organizations where each department has different goals, responsibilities, and priorities.

In summary, what we need to understand from the data warehouse concept is the following:

Data silos have always occurred in large organizations, even back in the 1980s.
Data comes from many operating systems.
In order to process the data, we need to store the data in one place.

What does a typical data warehouse stack look like?

This diagram represents the four logical building blocks in a data warehouse, which are Storage, Compute, Schema, and SQL Interface:

Figure 1.2 – Data warehouse main components

Data warehouse products are mostly able to store and process data seamlessly and the user can use the SQL language to access the data in tables with a structured schema format. It is basic knowledge, but an important point to be aware of is that the four logical building blocks in the data warehouse are designed as one monolithic software that evolved over the later years and was the start of the data lake.

Getting familiar with the differences between a data warehouse and a data lake

Fast forward to 2008, when an open source data technology named Hadoop was first published, and people started to use the data lake terminology. If you try to find the definition of data lake on the internet, it will mostly be described as a centralized repository that allows you to store all your structured and unstructured data.

So, what is the difference between a data lake and a data warehouse? Both have the same idea to store data in centralized storage. Is it simply that a data lake stores unstructured data and a data warehouse doesn't?

What if ...

Data Engineering with Google Cloud Platform
Contributors
Preface
Section 1: Getting Started with Data Engineering with GCP
Chapter 1: Fundamentals of Data Engineering
Chapter 2: Big Data Capabilities on GCP
Section 2: Building Solutions with GCP Components
Chapter 3: Building a Data Warehouse in BigQuery
Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer
Chapter 5: Building a Data Lake Using Dataproc
Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow
Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio
Chapter 8: Building Machine Learning Solutions on Google Cloud Platform
Section 3: Key Strategies for Architecting Top-Notch Data Pipelines
Chapter 9: User and Project Management in GCP
Chapter 10: Cost Strategy in GCP
Chapter 11: CI/CD on Google Cloud Platform for Data Engineers
Chapter 12: Boosting Your Confidence as a Data Engineer
Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Data Engineering with Google Cloud Platform by Adi Wijaya in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Data Engineering with Google Cloud Platform

A practical guide to operationalizing scalable data analytics systems on GCP

Data Engineering with Google Cloud Platform

A practical guide to operationalizing scalable data analytics systems on GCP

About this book

Key Features

Book Description

What you will learn

Who this book is for

Trusted by 375,005 students

Information

Section 1: Getting Started with Data Engineering with GCP

Chapter 1: Fundamentals of Data Engineering

Understanding the data life cycle

Understanding the need for a data warehouse

Getting familiar with the differences between a data warehouse and a data lake

Table of contents

Frequently asked questions