Hands-on Data Virtualization with Polybase
eBook - ePub

Hands-on Data Virtualization with Polybase

Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Hands-on Data Virtualization with Polybase

Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

About this book

Run queries and analysis on big data clusters across relational and non relational databases

Description
This book brings exciting coverage on establishing and managing data virtualization using polybase. This book teaches how to configure polybase on almost all relational and nonrelational databases. You will learn to set up the test environment for any tool or software instantly without hassle. You will practice how to design and build some of the high performing data warehousing solutions and that too in a few minutes of time.You will almost become an expert in connecting to all databases including hadoop, cassandra, MySQL, PostgreSQL, MariaDB and Oracle database. This book also brings exclusive coverage on how to build data clusters on Azure and using Azure Synapse Analytics. By the end of this book, you just don't administer the polybase for managing big data clusters but rather you learn to optimize and boost the performance for enabling data analytics and ease of data accessibility.

What you will learn
? Learn to configure Polybase and process Transact SQL queries with ease.
? Create a Docker container with SQL Server 2019 on Windows and Polybase.
? Establish SQL Server instance with any other software or tool using Polybase.
? Connect with Cassandra, MongoDB, MySQL, PostgreSQL, MariaDB, and IBM DB2.

Who this book is for
This book is for database developers and administrators familiar with the SQL language and command prompt. Managers and decision-makers will also find this book useful. No prior knowledge of any other technology or language is required.

Table of Contents
1. What is Data Virtualization (Polybase)
2. History of Polybase
3. Polybase current state
4. Differences with other technologies
5. Usage
6. Future
7. SQL Server
8. Hadoop Cloudera and Hortonworks
9. Windows Azure Storage Blob
10. Spark
11. From Azure Synapse Analytics
12. From Big Data Clusters
13. Oracle
14. Teradata
15. Cassandra
16. MongoDB
17. CosmosDB
18. MySQL
19. PostgreSQL
20. MariaDB
21. SAP HANA
22. IBM DB2
23. Excel

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Hands-on Data Virtualization with Polybase by Pablo Alejandro Echeverria Barrios in PDF and/or ePUB format, as well as other popular books in Computer Science & Desktop Applications. We have over one million books available in our catalogue for you to explore.

CHAPTER 1

Data Virtualization

Imagine you have a list of information. If the list is relatively small, you can read it from start to end, and summarize it extremely fast and without difficulty. And even if it is the size of a small book, it is still doable, although it will require more time and probably some summarizing techniques you already know. But as the size of the list grows, the amount of effort and time you must put into it will also increase … until you get to a point where your brain can't process all of the information. This is exactly the case with state-of-the-art Big Data, Internet of Things (IoT), data mining … you name it: there are massive data sets that need to be analyzed as fast as possible, but with a rational number of resources to keep the costs low. How can you access the information contained in these massive data sets using the tools and languages you already know, that is, without having to put a lot of effort in learning how to use them, and without having to build complex structures and processes, or moving vast amounts of data that will take insane amounts of time? Will you be able to do it several times a day?

Structure

In this chapter, you will learn the following topics:
  • Filtering the information
  • Link relational data with storage/file system data
  • What you would have to do without data virtualization
  • How data virtualization simplifies querying external data
  • How learning PolyBase can help you irrespective of your role

Objectives

After studying this chapter, you will be able to do the following:
  • Identify on which side of a computer communication network the information should be filtered
  • Understand the importance of relational data
  • Understand the importance of storage and file system data
  • Understand the benefits of data virtualization
  • Understand how PolyBase can help different roles

Filtering the information

You have two computers, A and B. In computer A, you have 1,000,000 entries, and in computer B, you have 1,000 entries; note that the entries are somehow related to each other. If you move all entries from computer A to B, it means this data will go through the network, consuming your network bandwidth, and not allowing other information to be transmitted among other computers on the network because of the reduced network space or the lack of it. This also means you must have enough memory in computer B to store these entries, not to mention you have duplicated your information. Finally, computer B needs to use its CPU, memory, and disk to process the entries and link them with its local entries. Which device has adequate memory to link them? The answer is not computer A, as you may be thinking. Computer B can process the data faster:
  • If it has more memory,
  • If it has additional or faster CPUs,
  • If it has additional or faster disks, or
  • If it is a distributed system.
So, you must achieve a balance between moving massive amounts of data through the network and processing the data in a computer with additional resources.
But what if you can move the 1,000 entries from computer B to A, do the filtering on computer A, and return only those entries to computer B? In that case, you will not saturate the network, duplicate information, or need more storage. Thus, it's not enough to compare the computing resources between two environments' it's also crucial to consider how to process the data efficiently and effectively, even testing different setups to find the one that provides the most benefits.

Link relational data with storage/file system data

You may ask why you need external data when you already have a relational database. Using a relational database, you can process daily business operations such as modifying the stored information through updates (for example, a customer that has moved to another city) and deletes (for example, a customer has cancelled a pre-order). However, you also need to consider insert operations (for example, new orders) and read operations (for example, reports), and how to guarantee data integrity between concurrent reads and writes; both fast reads and fast writes cannot be satisfied at the same time.
Despite these benefits, you must consider the several different types of database management systems available, each with advantages over the other. Moreover, you may not be able to migrate all of your data onto a single one, or the data may come from third-party software that is unsupported in a different database. And you may not have enough time and money to switch to another system or develop your own.
You may also ask why you need a relational database when you have storage and file system. The storage and file system offer the advantage of fast reads and writes, but at the expense of data integrity. Further, updating the information of a customer means updating hundreds of records, which is a slow and costly operation. Therefore, it is better suited for information generated sequentially (which won't contain customer information) and for archiving purposes (which may never need to be updated).
The storage and file system is used because it can be optimized for parallel processing, provide cost-effective distributed and scalable processing, allow unstructured information storage and retrieval, provide real-time analysis mechanisms, and support deep learning and streaming workloads.
While working on the field with real businesses, you will use components that are already purchased and licensed, and therefore knowing how to interconnect them is a must. Your customer has heterogeneous database management systems and storage and file system data; trying to change this is an enormous and costly operation that won't generate any value. It is possible that you store the customer information in a relational database, which must be transactional and concurrent, and documents, pre-orders, orders, etc. in a storage and file system data where you have fast storage and retrieval. The only way to know the relationship between both, and to extract information from one into another, is by linking them. While doing so, you want to keep the benefits each of these technologies provides.

What you would have to do without data virtualization

Let's say your relational data is stored in SQL Server (a relational database management system), and your storage and file system data is stored in HDFS (Hadoop File System). If you're familiar with SQL Server, you know you can create a linked server (a data connection) between both. Here is an article that describes how to create it: https://runops.wordpress.com/2015/10/17/create-sql-server-linked-server-to-hadoop/. Once that is established, you can retrieve the information from Hadoop into SQL Server to have it in the same format and link it with your relational data, but you won't be using the benefits of Hadoop; so it will probably end up being a long-running operation, consuming network bandwidth and memory.
Another possible way is to load the SQL Server data into Hadoop to have it in the same format, and then link the information and get the insight you wanted. However, this means you will need to learn about the distributed architecture (nodes) and the communication model between these nodes, how to load SQL Server data into it, how to write HiveQL (which is similar to T-SQL), how to write MapReduce jobs for summarizing information, and how to export this back into SQL Server. This will be a several months' project for each team member, and for new hires as well.
If your data is stored in another database, like Oracle, you could create a link to SQL Server and link the information within Oracle. Here is an article that describes how to create it: https://www.sqlservercentral.com/articles/perform-data-filtering-in-oracle-link-to-sql-server. However, this means you will need to learn about Oracle, how to connect to it, write PL/SQL queries, and have elevated permissions to create the link. Further, if you want to process the information back in SQL Server, you will need a way for it, write custom logic to link and integrate it at an application server, or create a complex setup for this.
Wouldn't it be great to be able to query any external information within SQL Server (a tool you already know) using T-SQL (a language you're familiar with), while utilizing the characteristics each external system offers, like parallel processing and fast storage and retrieval?

How data virtualization simplifies querying external data

PolyBase enables your SQL Server instance to read data from external sources through T-SQL statements that, first, specify the details at the moment of the table creation (for example, how the external data is structured), and then, query the external source as normal tables, irrespective of whether they're database management systems or a storage or file system. As the data in these external sources comes back in the form of tables, you can easily link them to your SQL Server data tables, and then combine both. And because PolyBase uses T-SQL for this purpose, you don't need any knowledge about the external source, or about how to configure or query it in its own language.
With PolyBase, you're also not required to install additional software in your external environment, and you won't need a separate ETL or import tool to link the data. And that's what data virtualization means: allowing the data to stay in its original location while virtually (not realistically) having it available in your SQL Server instance.
In the specific case of Hadoop, you can query unstructured information, and you can push the computation to be made remotely in the Hadoop server when it helps optimizing the overall performance. The decision to do the processing on Hadoop is based on statistics kept in SQL Server about the external table, but if the computation is chosen to be made in Hadoop, it automatically creates MapReduce jobs for the task without you knowing how to create them and leverages the distributed computational resources of Hadoop.
Moreover, if you need enhanced performance owing to the nature of your data, you can create SQL Server scale-out groups that enable parallel data transfer between each Hadoop node and each SQL Server instance, also allowing to operate on this external data using each instance's computing resources.

How learning PolyBase can help you irrespective of your role

Now that you know the benefits of PolyBase, you may be wondering why you need to learn it and how it can help you in your current role. The use cases are diverse, and I'm sure there's one that fits your organization. I'm citing only a few use cases here, but with these I hope to give you enough insight about how important this technology is and what it enables you to do, with the goal that this will help you do your job faster and easier and allow you to propose it within your organization for a situation where it fits well.
  • As a database administrator (DBA), you have long-running processes that move information from one place to another, and that information is critical for the business decision support systems. When there is delay, or the process fails, your customer starts losing money and won't be willing to wait for the process to be restarted or lose a whole day of work. PolyBase can accelerate this process thanks to the parallel processing it offers.
  • As a data engineer, you divide and sample the information from all data stores, which requires you to learn each data store system's basics and then gather the required information. PolyBase doesn't require you to know anything other than SQL Server and T-SQL, thus simplifying your job.
  • As a data scientist, you perform exploratory data analysis before working on the whole data, which requires you to work on large amounts of information using large number of resources. PolyBase allows you to easily work on subsets of data using only SQL Server.
  • As a developer, your main goal is to develop fast and efficient programs irrespective of where the data is located. PolyBase allows you to avoid using a linked server, which is slow.
  • In a business intelligence (BI) role, you're more interested in the external data to be available than the details about how it works. PolyBase allows you to query the external data without moving all of it, and before all of it has been moved from one point to the other.
  • In a machine learning (ML...

Table of contents

  1. Cover Page
  2. Title Page
  3. Copyright Page
  4. Dedication Page
  5. About the Author
  6. About the Reviewer
  7. Acknowledgement
  8. Preface
  9. Errata
  10. Table of Contents
  11. 1. Data Virtualization
  12. 2. History of PolyBase
  13. 3. PolyBase Current State
  14. 4. Difference between PolyBase and Other Technologies
  15. 5. Usage
  16. 6. Future of PolyBase
  17. 7. SQL Server
  18. 8. Hadoop Cloudera and Hortonworks
  19. 9. Azure Storage
  20. 10. Spark
  21. 11. Azure Synapse Analytics
  22. 12. Big Data Clusters
  23. 13. Oracle
  24. 14. Teradata
  25. 15. Cassandra
  26. 16. MongoDB
  27. 17. Cosmos DB
  28. 18. MySQL
  29. 19. PostgreSQL
  30. 20. MariaDB
  31. 21. SAP HANA
  32. 22. IBM Db2
  33. 23. Excel
  34. Index