Section 1: AWS Data Engineering Concepts and Trends
To start with, we examine why data is so important to organizations today, and introduce foundational concepts of data engineering, including coverage of governance and security topics. We also learn about the AWS services that form part of the data engineer’s toolkit, and get hands-on with creating an AWS account and using services such as Amazon S3, AWS Lambda, and AWS Identity and Access Management (IAM).
This section comprises the following chapters:
- Chapter 1, An Introduction to Data Engineering
- Chapter 2, Data Management Architectures for Analytics
- Chapter 3, The AWS Data Engineer’s Toolkit
- Chapter 4, Data Cataloging, Security, and Governance
Chapter 1: An Introduction to Data Engineering
Data engineering is a fast-growing career path, and a role in high demand, as data becomes ever more critical to organizations of all sizes. For those that enjoy the challenge of putting together the "puzzle pieces" that build out complex data pipelines to ingest raw data, and to then transform and optimize that data for various data consumers, it can be a really rewarding career.
In this chapter, we look at the many ways that data has become an important and valuable corporate asset. We also review some of the challenges that organizations face as they deal with increasing volumes of data, and how data engineers can use cloud-based services to help overcome these challenges. We then set the foundations for the rest of the hands-on activities in this book by providing step-by-step details on creating a new Amazon Web Services (AWS) account.
Throughout this book, we are going to cover a number of topics that teach the fundamentals of developing data engineering pipelines on AWS, but we'll get started in this chapter with these topics:
- The rise of big data as a corporate asset
- The challenges of ever-growing datasets
- The role of the data engineer as a big data enabler
- The benefits of the cloud when building big data analytic solutions
- Hands-on - create or access an AWS account for following along with the hands-on activities in this book
Technical requirements
You can find the code files of this chapter in the GitHub repository using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS/tree/main/Chapter01
The rise of big data as a corporate asset
You don't need to look too far or too hard these days to hear about how big data and data analytics are transforming organizations and having an impact on society as a whole. We hear about how companies such as TikTok analyze large quantities of data to make personalized recommendations about which clip to show a user next. Also, we know how Amazon recommends products a customer may be interested in based on their purchase history. We read headlines about how big data could revolutionize the healthcare industry, or how stock pickers turn to big data to find the next breakout stock performer when the markets are down.
The most valuable companies in the US today are companies that are masters of managing huge data assets effectively, with the top five most valuable companies in Q4 2021 being the following:
- Microsoft
- Apple
- Alphabet (Google)
- Amazon
- Tesla
For a long time, it was companies that managed natural gas and oil resources, such as ExxonMobil, that were high on the list of the most valuable companies on the US stock exchange. Today, ExxonMobil will often not even make the list of the top 30 companies. It is no wonder that the number of job listings for people with skillsets related to big data is on the rise.
There is also no doubt that data, when harnessed correctly and optimized for maximum analytic value, can be a game-changer for an organization. At the same time, those companies that are unable to effectively utilize their data assets risk losing a competitive advantage to others that do have a comprehensive data strategy and effective analytic and machine learning programs.
Organizations today tend to be in one of the following three states:
- They have an effective data analytics and machine learning program that differentiates them from their competitors.
- They are conducting proof of concept projects to evaluate how analytic and machine learning programs can help them achieve a competitive advantage.
- Their leaders are having sleepless nights worrying about how their competitors are using analytics and machine learning programs to achieve a competitive advantage over them.
No matter where an organization currently is in their data journey, if they have been in existence for a while, they have likely faced a number of common data-related challenges. Let's look at how organizations have typically handled the challenge of ever-growing datasets.
The challenges of ever-growing datasets
Organizations have many assets, such as physical assets, intellectual property, the knowledge of their employees, and trade secrets. But for too long, organizations did not fully recognize that they had another extremely valuable asset, and they failed to maximize the use of it—the vast quantities of data that they had gathered over time.
That is not to say that organizations ignored these data assets, but rather, due to the expense and complex nature of storing and managing this data, organizations tended to only keep a subset of data.
Initially, data may have been stored in a single database, but as organizations, and their data requirements, grew, the number of databases exponentially increased. Today, with the modern application development approach of microservices, companies commonly have hundreds, or even thousands, of databases. Faced with many data silos, organizations invested in data warehousing systems that would enable them to ingest data from multiple siloed databases into a central location for analytics. But due to the expense of these systems, there were limitations on how much data could be stored, and some datasets would either be excluded or only aggregate data would be loaded into the data warehouse. Data would also only be kept for a limited period of time as data storage for these systems was expensive, and therefore it was not economical to keep historical data for long periods. There was also a lack of widely available tools and compute power to enable the analysis of extremely large, comprehensive datasets.
As an organization continued to grow, multiple data warehouses and data marts would be implemented for different business units or groups, and organizations still lacked a centralized, single-source-of-truth repository for their data. Organizations were also faced with new types of data, such as semi-structured or even unstructured data, and analyzing these datasets with traditional tooling was a challenge.
As a result, new technologies were invented that were able to better work with very large datasets and different data types. Hadoop was a technology created in the early 2000s at Yahoo as part of a search engine project that wanted to index 1 billion web pages. Over the next few years, Hadoop, and the underlying MapReduce technology, became a popular way for all types of companies to store and process much larger datasets. However, running a Hadoop cluster was a complex and expensive operation requiring specialized skills.
The next evolution for big data processing was the development of Spark (later taken on as an Apache project and now known as Apache Spark), a new processing framework for working with big data. Spark showed significant increases in performance when working with large datasets due to the fact that it did most processing in memory...