Section 1: Getting Started with Data Engineering with GCP
This part will talk about the purpose, value, and concepts of big data and cloud computing and how GCP products are relevant to data engineering. You will learn about a data engineer's core responsibilities, how they differ from those of a data scientist, and how to facilitate the flow of data through an organization to derive insights.
This section comprises the following chapters:
- Chapter 1, Fundamentals of Data Engineering
- Chapter 2, Big Data Capabilities on GCP
Chapter 1: Fundamentals of Data Engineering
Years ago, when I first entered the data science world, I used to think data was clean. Clean in terms of readiness, available in one place, and ready for fun data science purposes. I was so excited to experiment with machine learning models, finding unusual patterns in data and playing around with clean data. But after years of experience working with data, I realized that data science in big organizations isn't straightforward.
Eighty percent of the effort goes into collecting, cleaning, and transforming the data. If you have had any experience in working with data, I am sure you've noticed something similar. But the good news is, we know that almost all processes can be automated using proper planning, designing, and engineering skills. That was the point where I realized that data engineering will be the most critical role from that day to the future of the data science world.
To develop a successful data ecosystem in any organization, the most crucial part is how they design the data architecture. If the organization fails to make the best decision on the data architecture, the future process will be painful. Here are some common examples: the system is not scalable, querying data is slow, business users don't trust your data, the infrastructure cost is very high, and data is leaked. There is so much more that can go wrong without proper data engineering practice.
In this chapter, we are going to learn the fundamental knowledge behind data engineering. The goal is to introduce you to common terms that are often used in this field and will be mentioned often in the later chapters.
In particular, we will be covering the following topics:
- Understanding the data life cycle
- Know the roles of a data engineer before starting
- Foundational concepts for data engineering
Understanding the data life cycle
The first principle to learn to become a data engineer is understanding the data life cycle. If you've worked with data, you must know that data doesn't stay in one place; it moves from one storage to another, from one database to other databases. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user:
- Who will consume the data?
- What data sources should I use?
- Where should I store the data?
- When should the data arrive?
- Why does the data need to be stored in this place?
- How should the data be processed?
To answer all those questions, we'll start by looking back a little bit at the history of data technologies.
Understanding the need for a data warehouse
Data warehouse is not a new concept; I believe you've at least heard of it. In fact, the terminology is no longer appealing. In my experience, no one gets excited when talking about data warehouses in the 2020s. Especially when compared to terminologies such as big data, cloud computing, and artificial intelligence.
So, why do we need to know about data warehouses? The answer to that is because almost every single data engineering challenge from the old times to these days is conceptually the same. The challenges are always about moving data from the data source to other environments so the business can use it to get information. The difference from time to time is only about the how and newer technologies. If we understand why people needed data warehouses in historical times, we will have a better foundation to understand the data engineering space and, more specifically, the data life cycle.
Data warehouses were first developed in the 1980s to transform data from operational systems to decision-making support systems. The key principle of a data warehouse is combining data from many different sources to a single location and then transforming it into a format the data warehouse can process and store.
For example, in the financial industry, say a bank wants to know how many credit card customers also have mortgages. It is a simple enough question, yet it's not that easy to answer. Why?
Most traditional banks that I have worked with had different operating systems for each of their products, including a specific system for credit cards and specific systems for mortgages, saving products, websites, customer service, and many other systems. So, in order to answer the question, data from multiple systems needs to be stored in one place first.
See the following diagram on how each department is independent:
Figure 1.1 – Data silos
Often, independence not only applies to the organization structure but also to the data. When data is located in different places, it's called data silos. This is very common in large organizations where each department has different goals, responsibilities, and priorities.
In summary, what we need to understand from the data warehouse concept is the following:
- Data silos have always occurred in large organizations, even back in the 1980s.
- Data comes from many operating systems.
- In order to process the data, we need to store the data in one place.
What does a typical data warehouse stack look like?
This diagram represents the four logical building blocks in a data warehouse, which are Storage, Compute, Schema, and SQL Interface:
Figure 1.2 – Data warehouse main components
Data warehouse products are mostly able to store and process data seamlessly and the user can use the SQL language to access the data in tables with a structured schema format. It is basic knowledge, but an important point to be aware of is that the four logical building blocks in the data warehouse are designed as one monolithic software that evolved over the later years and was the start of the data lake.
Getting familiar with the differences between a data warehouse and a data lake
Fast forward to 2008, when an open source data technology named Hadoop was first published, and people started to use the data lake terminology. If you try to find the definition of data lake on the internet, it will mostly be described as a centralized repository that allows you to store all your structured and unstructured data.
So, what is the difference between a data lake and a data warehouse? Both have the same idea to store data in centralized storage. Is it simply that a data lake stores unstructured data and a data warehouse doesn't?
What if ...