CHAPTER 1
Data Virtualization
Imagine you have a list of information. If the list is relatively small, you can read it from start to end, and summarize it extremely fast and without difficulty. And even if it is the size of a small book, it is still doable, although it will require more time and probably some summarizing techniques you already know. But as the size of the list grows, the amount of effort and time you must put into it will also increase … until you get to a point where your brain can't process all of the information. This is exactly the case with state-of-the-art Big Data, Internet of Things (IoT), data mining … you name it: there are massive data sets that need to be analyzed as fast as possible, but with a rational number of resources to keep the costs low. How can you access the information contained in these massive data sets using the tools and languages you already know, that is, without having to put a lot of effort in learning how to use them, and without having to build complex structures and processes, or moving vast amounts of data that will take insane amounts of time? Will you be able to do it several times a day?
Structure
In this chapter, you will learn the following topics:
- Filtering the information
- Link relational data with storage/file system data
- What you would have to do without data virtualization
- How data virtualization simplifies querying external data
- How learning PolyBase can help you irrespective of your role
Objectives
After studying this chapter, you will be able to do the following:
- Identify on which side of a computer communication network the information should be filtered
- Understand the importance of relational data
- Understand the importance of storage and file system data
- Understand the benefits of data virtualization
- Understand how PolyBase can help different roles
Filtering the information
You have two computers, A and B. In computer A, you have 1,000,000 entries, and in computer B, you have 1,000 entries; note that the entries are somehow related to each other. If you move all entries from computer A to B, it means this data will go through the network, consuming your network bandwidth, and not allowing other information to be transmitted among other computers on the network because of the reduced network space or the lack of it. This also means you must have enough memory in computer B to store these entries, not to mention you have duplicated your information. Finally, computer B needs to use its CPU, memory, and disk to process the entries and link them with its local entries. Which device has adequate memory to link them? The answer is not computer A, as you may be thinking. Computer B can process the data faster:
- If it has more memory,
- If it has additional or faster CPUs,
- If it has additional or faster disks, or
- If it is a distributed system.
So, you must achieve a balance between moving massive amounts of data through the network and processing the data in a computer with additional resources.
But what if you can move the 1,000 entries from computer B to A, do the filtering on computer A, and return only those entries to computer B? In that case, you will not saturate the network, duplicate information, or need more storage. Thus, it's not enough to compare the computing resources between two environments' it's also crucial to consider how to process the data efficiently and effectively, even testing different setups to find the one that provides the most benefits.
Link relational data with storage/file system data
You may ask why you need external data when you already have a relational database. Using a relational database, you can process daily business operations such as modifying the stored information through updates (for example, a customer that has moved to another city) and deletes (for example, a customer has cancelled a pre-order). However, you also need to consider insert operations (for example, new orders) and read operations (for example, reports), and how to guarantee data integrity between concurrent reads and writes; both fast reads and fast writes cannot be satisfied at the same time.
Despite these benefits, you must consider the several different types of database management systems available, each with advantages over the other. Moreover, you may not be able to migrate all of your data onto a single one, or the data may come from third-party software that is unsupported in a different database. And you may not have enough time and money to switch to another system or develop your own.
You may also ask why you need a relational database when you have storage and file system. The storage and file system offer the advantage of fast reads and writes, but at the expense of data integrity. Further, updating the information of a customer means updating hundreds of records, which is a slow and costly operation. Therefore, it is better suited for information generated sequentially (which won't contain customer information) and for archiving purposes (which may never need to be updated).
The storage and file system is used because it can be optimized for parallel processing, provide cost-effective distributed and scalable processing, allow unstructured information storage and retrieval, provide real-time analysis mechanisms, and support deep learning and streaming workloads.
While working on the field with real businesses, you will use components that are already purchased and licensed, and therefore knowing how to interconnect them is a must. Your customer has heterogeneous database management systems and storage and file system data; trying to change this is an enormous and costly operation that won't generate any value. It is possible that you store the customer information in a relational database, which must be transactional and concurrent, and documents, pre-orders, orders, etc. in a storage and file system data where you have fast storage and retrieval. The only way to know the relationship between both, and to extract information from one into another, is by linking them. While doing so, you want to keep the benefits each of these technologies provides.
What you would have to do without data virtualization
Let's say your relational data is stored in SQL Server (a relational database management system), and your storage and file system data is stored in HDFS (Hadoop File System). If you're familiar with SQL Server, you know you can create a linked server (a data connection) between both. Here is an article that describes how to create it: https://runops.wordpress.com/2015/10/17/create-sql-server-linked-server-to-hadoop/. Once that is established, you can retrieve the information from Hadoop into SQL Server to have it in the same format and link it with your relational data, but you won't be using the benefits of Hadoop; so it will probably end up being a long-running operation, consuming network bandwidth and memory.
Another possible way is to load the SQL Server data into Hadoop to have it in the same format, and then link the information and get the insight you wanted. However, this means you will need to learn about the distributed architecture (nodes) and the communication model between these nodes, how to load SQL Server data into it, how to write HiveQL (which is similar to T-SQL), how to write MapReduce jobs for summarizing information, and how to export this back into SQL Server. This will be a several months' project for each team member, and for new hires as well.
If your data is stored in another database, like Oracle, you could create a link to SQL Server and link the information within Oracle. Here is an article that describes how to create it: https://www.sqlservercentral.com/articles/perform-data-filtering-in-oracle-link-to-sql-server. However, this means you will need to learn about Oracle, how to connect to it, write PL/SQL queries, and have elevated permissions to create the link. Further, if you want to process the information back in SQL Server, you will need a way for it, write custom logic to link and integrate it at an application server, or create a complex setup for this.
Wouldn't it be great to be able to query any external information within SQL Server (a tool you already know) using T-SQL (a language you're familiar with), while utilizing the characteristics each external system offers, like parallel processing and fast storage and retrieval?
How data virtualization simplifies querying external data
PolyBase enables your SQL Server instance to read data from external sources through T-SQL statements that, first, specify the details at the moment of the table creation (for example, how the external data is structured), and then, query the external source as normal tables, irrespective of whether they're database management systems or a storage or file system. As the data in these external sources comes back in the form of tables, you can easily link them to your SQL Server data tables, and then combine both. And because PolyBase uses T-SQL for this purpose, you don't need any knowledge about the external source, or about how to configure or query it in its own language.
With PolyBase, you're also not required to install additional software in your external environment, and you won't need a separate ETL or import tool to link the data. And that's what data virtualization means: allowing the data to stay in its original location while virtually (not realistically) having it available in your SQL Server instance.
In the specific case of Hadoop, you can query unstructured information, and you can push the computation to be made remotely in the Hadoop server when it helps optimizing the overall performance. The decision to do the processing on Hadoop is based on statistics kept in SQL Server about the external table, but if the computation is chosen to be made in Hadoop, it automatically creates MapReduce jobs for the task without you knowing how to create them and leverages the distributed computational resources of Hadoop.
Moreover, if you need enhanced performance owing to the nature of your data, you can create SQL Server scale-out groups that enable parallel data transfer between each Hadoop node and each SQL Server instance, also allowing to operate on this external data using each instance's computing resources.
How learning PolyBase can help you irrespective of your role
Now that you know the benefits of PolyBase, you may be wondering why you need to learn it and how it can help you in your current role. The use cases are diverse, and I'm sure there's one that fits your organization. I'm citing only a few use cases here, but with these I hope to give you enough insight about how important this technology is and what it enables you to do, with the goal that this will help you do your job faster and easier and allow you to propose it within your organization for a situation where it fits well.
- As a database administrator (DBA), you have long-running processes that move information from one place to another, and that information is critical for the business decision support systems. When there is delay, or the process fails, your customer starts losing money and won't be willing to wait for the process to be restarted or lose a whole day of work. PolyBase can accelerate this process thanks to the parallel processing it offers.
- As a data engineer, you divide and sample the information from all data stores, which requires you to learn each data store system's basics and then gather the required information. PolyBase doesn't require you to know anything other than SQL Server and T-SQL, thus simplifying your job.
- As a data scientist, you perform exploratory data analysis before working on the whole data, which requires you to work on large amounts of information using large number of resources. PolyBase allows you to easily work on subsets of data using only SQL Server.
- As a developer, your main goal is to develop fast and efficient programs irrespective of where the data is located. PolyBase allows you to avoid using a linked server, which is slow.
- In a business intelligence (BI) role, you're more interested in the external data to be available than the details about how it works. PolyBase allows you to query the external data without moving all of it, and before all of it has been moved from one point to the other.
- In a machine learning (ML...