Section 1 ā Operational/OLTP Databases
Databases are used in everyday applications such as webshops, CRM systems, and financial systems. These are our Line Of Business (LOB) applications. In this section, we will learn how to choose between a SQL database and a Cosmos DB database. We will also learn how to design a database for optimal performance and scalability.
This section comprises the following chapters:
- Chapter 1, Introduction to Databases
- Chapter 2, Entity Analysis
- Chapter 3, Normalizing Data
- Chapter 4, Provisioning and Implementing an Azure SQL DB
- Chapter 5, Designing a NoSQL Database
- Chapter 6, Provisioning and Implementing an Azure Cosmos DB Database
Chapter 1: Introduction to Databases
Data has become increasingly important over the last few years. Almost all applications use data, whether the application is a Customer Relationship Management (CRM) system at work or a social media app on your phone. All that data is stored in databases. Since the 1980s, almost all those databases have been relational databases. Nowadays, with the advent of big data, there are different ways to store and process huge amounts of data. Some of them can be classified as so-called NoSQL databases. NoSQL stands for "not only" SQL. This means that we are seeing other types of databases emerge and being used alongside relational databases. NoSQL databases are important in the area of big data. The "SQL" in NoSQL stands for Structured Query Language. This is the programming language of relational databases and has become the "equivalent" of relational databases.
In this chapter, you will learn the basics of databases. A lot of the theory discussed in this chapter stems from relational databases, although the majority is applicable to other database systems as well.
We will discuss the following topics in this chapter:
- Overview of relational databases
- Introduction to Structured Query Language
- Impact of intended usage patterns on database design
- Understanding relational theory
- Keys
- Types of workload
Overview of relational databases
Databases hadn't yet been invented when we first started programming computer applications. All data had to be stored in files. Oftentimes, those files were simple comma-separated value files (CSV files). An example of a CSV file can be seen in the following screenshot:
Figure 1.1 ā Person.csv
As you can see, it is just some data without anything else.
Files
Using files to store data for use in applications entailed a number of issues. After trying file formats other than CSV files, developers started using databases instead of plain files. Plain files or flat files are files with just data stored in them. Figure 1.1 is an example of a flat file. Let's look into the issues that using flat files posed.
From the header of the screenshot in Figure 1.1, it is clear that the file is called Person.csv. We may infer that the data in the files represents persons. However, it is not clear whether those people are patients, customers, employees, or even someone completely different. Furthermore, you cannot ascertain that extra information from the file or its content.
Drawbacks
The use of these types of flat files to store data comes with three drawbacks:
- You cannot infer from the file itself what the data is about.
- It is not flexible from a programming perspective and is bad for performance when working with the data.
- It is (almost) impossible for multiple persons to work with flat files simultaneously.
We will now examine each of these drawbacks in turn.
Drawback 1 ā You cannot infer from the file itself what the data is about
It is clear from looking at the screenshot that each line has two commas, meaning that there are three columns per row. The second column very likely holds a first name. This is a reasonable assumption based on our knowledge of names, although you may require a knowledge of Dutch names to make this assumption. The third column is more difficult to guess. It could be the year of birth of the person in question, but it could also be a postal code or perhaps a monthly salary.
The file only stores the actual data and not the metadata. It may be that you can guess what the values mean, but you cannot infer it from the file itself. Metadata is the data describing the data. Column names are an example of metadata. Column names such as PatientID, Patient_FirstName, and PostalCode would already render it more readable. That is why we often add those column names as a first row in flat files.
Note
Metadata is data that describes the "actual" data.
There is even more to ascertain regarding this data. You cannot perform calculations with postal codes, such as adding up two postal codes (it may be that you can, but it doesn't make any sense). A postal code is an alphanumeric code that you cannot perform computations with. When the last column in Figure 1.1 is a salary and not a postal code, you do want (and need) to be able to perform calculations on this column, for instance, to calculate an annual salary from the monthly salaries. In this case, the column would have been numerical. In other words, it would be beneficial to know a column's data type. Generally speaking, data can be numerical, alphanumerical (text), or dates. Nowadays, of course, there are a lot of variations, such as binary data for pictures.
With data stored in flat files, the data itself and the metadata are stored separately.
Today, we have overcome some of these issues by not using flat files but storing data as XML or as JSON files. Both file types allow you to store metadata with the actual data in the file itself. In the (recent) past, this was too expensive to do. Only recently has storage become cheap enough and compute power plentiful enough to work with text files by storing data and metadata in the way that JSON does.
Drawback 2 ā It is not flexible from a programming perspective and is bad for performance when working with the data
It gets nastier when we start using (old-fashioned) program code to work with the data. Suppose you need to know the postal code of the person called Janneke. Your code would now look something like this:
- Read a line.
- Read the second column.
- If the value you read equals Janneke, then return the third column.
- Repeat lines 1 to 3 until there are no more lines in the file.
With only two lines in the file, this is pretty fast. This code will become more problematic, however, when a file contains many, many rows. It will become really slow.
It gets even worse when someone changes the file structure. Suppose we add a new column, storing the patient's family name between the second and third columns. The code we just wrote will break because it assumes that the postal code is the third column. However, following the change, it is the fourth column. Retrieving the postal code should be independent of which column it actually is.
Drawback 3 ā It is (almost) impossible for multiple persons to work with flat files simultaneously
In most applications, there will be multiple users working with the data simultaneously. What if your webshop could only have one visitor at a time? A database should make it easy for multiple people or processes to work with the same data at the same time. In the case of relational databases, this is part of their core. A relational database has what is known as the ACID properties to cater to multi-user workloads. You will learn more about the ACID properties in Chapter 5, Designing a NoSQL Database. Without a database system, whether relational or not, multiple users working with the same data would not be i...