eBook - ePub

Data Engineering with Google Cloud Platform

Name: Data Engineering with Google Cloud Platform
Author: Adi Wijaya

Adi Wijaya

Share book

440 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Engineering with Google Cloud Platform

Adi Wijaya

Book details

Book preview

Table of contents

Citations

About This Book

Build and deploy your own data pipelines on GCP, make key architectural decisions, and gain the confidence to boost your career as a data engineerKey Features• Understand data engineering concepts, the role of a data engineer, and the benefits of using GCP for building your solution• Learn how to use the various GCP products to ingest, consume, and transform data and orchestrate pipelines• Discover tips to prepare for and pass the Professional Data Engineer examBook DescriptionWith this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards. Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP. By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.What you will learn• Load data into BigQuery and materialize its output for downstream consumption• Build data pipeline orchestration using Cloud Composer• Develop Airflow jobs to orchestrate and automate a data warehouse• Build a Hadoop data lake, create ephemeral clusters, and run jobs on the Dataproc cluster• Leverage Pub/Sub for messaging and ingestion for event-driven systems• Use Dataflow to perform ETL on streaming data• Unlock the power of your data with Data Studio• Calculate the GCP cost estimation for your end-to-end data solutionsWho this book is forThis book is for data engineers, data analysts, and anyone looking to design and manage data processing pipelines using GCP. You'll find this book useful if you are preparing to take Google's Professional Data Engineer exam. Beginner-level understanding of data science, the Python programming language, and Linux commands is necessary. A basic understanding of data processing and cloud computing, in general, will help you make the most out of this book.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Data Engineering with Google Cloud Platform an online PDF/ePUB?

Yes, you can access Data Engineering with Google Cloud Platform by Adi Wijaya in PDF and/or ePUB format, as well as other popular books in Informatik & Datenmodellierung- & design. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2022

ISBN

9781800565067

Edition

Topic

Informatik

Subtopic

Datenmodellierung- & design

Section 1: Getting Started with Data Engineering with GCP

This part will talk about the purpose, value, and concepts of big data and cloud computing and how GCP products are relevant to data engineering. You will learn about a data engineer's core responsibilities, how they differ from those of a data scientist, and how to facilitate the flow of data through an organization to derive insights.

This section comprises the following chapters:

Chapter 1, Fundamentals of Data Engineering
Chapter 2, Big Data Capabilities on GCP

Chapter 1: Fundamentals of Data Engineering

Years ago, when I first entered the data science world, I used to think data was clean. Clean in terms of readiness, available in one place, and ready for fun data science purposes. I was so excited to experiment with machine learning models, finding unusual patterns in data and playing around with clean data. But after years of experience working with data, I realized that data science in big organizations isn't straightforward.

Eighty percent of the effort goes into collecting, cleaning, and transforming the data. If you have had any experience in working with data, I am sure you've noticed something similar. But the good news is, we know that almost all processes can be automated using proper planning, designing, and engineering skills. That was the point where I realized that data engineering will be the most critical role from that day to the future of the data science world.

To develop a successful data ecosystem in any organization, the most crucial part is how they design the data architecture. If the organization fails to make the best decision on the data architecture, the future process will be painful. Here are some common examples: the system is not scalable, querying data is slow, business users don't trust your data, the infrastructure cost is very high, and data is leaked. There is so much more that can go wrong without proper data engineering practice.

In this chapter, we are going to learn the fundamental knowledge behind data engineering. The goal is to introduce you to common terms that are often used in this field and will be mentioned often in the later chapters.

In particular, we will be covering the following topics:

Understanding the data life cycle
Know the roles of a data engineer before starting
Foundational concepts for data engineering

Understanding the data life cycle

The first principle to learn to become a data engineer is understanding the data life cycle. If you've worked with data, you must know that data doesn't stay in one place; it moves from one storage to another, from one database to other databases. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user:

Who will consume the data?
What data sources should I use?
Where should I store the data?
When should the data arrive?
Why does the data need to be stored in this place?
How should the data be processed?

To answer all those questions, we'll start by looking back a little bit at the history of data technologies.

Understanding the need for a data warehouse

Data warehouse is not a new concept; I believe you've at least heard of it. In fact, the terminology is no longer appealing. In my experience, no one gets excited when talking about data warehouses in the 2020s. Especially when compared to terminologies such as big data, cloud computing, and artificial intelligence.

So, why do we need to know about data warehouses? The answer to that is because almost every single data engineering challenge from the old times to these days is conceptually the same. The challenges are always about moving data from the data source to other environments so the business can use it to get information. The difference from time to time is only about the how and newer technologies. If we understand why people needed data warehouses in historical times, we will have a better foundation to understand the data engineering space and, more specifically, the data life cycle.

Data warehouses were first developed in the 1980s to transform data from operational systems to decision-making support systems. The key principle of a data warehouse is combining data from many different sources to a single location and then transforming it into a format the data warehouse can process and store.

For example, in the financial industry, say a bank wants to know how many credit card customers also have mortgages. It is a simple enough question, yet it's not that easy to answer. Why?

Most traditional banks that I have worked with had different operating systems for each of their products, including a specific system for credit cards and specific systems for mortgages, saving products, websites, customer service, and many other systems. So, in order to answer the question, data from multiple systems needs to be stored in one place first.

See the following diagram on how each department is independent:

Figure 1.1 – Data silos

Often, independence not only applies to the organization structure but also to the data. When data is located in different places, it's called data silos. This is very common in large organizations where each department has different goals, responsibilities, and priorities.

In summary, what we need to understand from the data warehouse concept is the following:

Data silos have always occurred in large organizations, even back in the 1980s.
Data comes from many operating systems.
In order to process the data, we need to store the data in one place.

What does a typical data warehouse stack look like?

This diagram represents the four logical building blocks in a data warehouse, which are Storage, Compute, Schema, and SQL Interface:

Figure 1.2 – Data warehouse main components

Data warehouse products are mostly able to store and process data seamlessly and the user can use the SQL language to access the data in tables with a structured schema format. It is basic knowledge, but an important point to be aware of is that the four logical building blocks in the data warehouse are designed as one monolithic software that evolved over the later years and was the start of the data lake.

Getting familiar with the differences between a data warehouse and a data lake

Fast forward to 2008, when an open source data technology named Hadoop was first published, and people started to use the data lake terminology. If you try to find the definition of data lake on the internet, it will mostly be described as a centralized repository that allows you to store all your structured and unstructured data.

So, what is the difference between a data lake and a data warehouse? Both have the same idea to store data in centralized storage. Is it simply that a data lake stores unstructured data and a data warehouse doesn't?

What if ...