Hands-On Data Science with SQL Server 2017
eBook - ePub

Hands-On Data Science with SQL Server 2017

Perform end-to-end data analysis to gain efficient data insight

Marek Chmel, Vladimír Mužný

Share book
  1. 506 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Hands-On Data Science with SQL Server 2017

Perform end-to-end data analysis to gain efficient data insight

Marek Chmel, Vladimír Mužný

Book details
Book preview
Table of contents

About This Book

Find, explore, and extract big data to transform into actionable insights

Key Features

  • Perform end-to-end data analysis—from exploration to visualization
  • Real-world examples, tasks, and interview queries to be a proficient data scientist
  • Understand how SQL is used for big data processing using HiveQL and SparkSQL

Book Description

SQL Server is a relational database management system that enables you to cover end-to-end data science processes using various inbuilt services and features.

Hands-On Data Science with SQL Server 2017 starts with an overview of data science with SQL to understand the core tasks in data science. You will learn intermediate-to-advanced level concepts to perform analytical tasks on data using SQL Server. The book has a unique approach, covering best practices, tasks, and challenges to test your abilities at the end of each chapter. You will explore the ins and outs of performing various key tasks such as data collection, cleaning, manipulation, aggregations, and filtering techniques. As you make your way through the chapters, you will turn raw data into actionable insights by wrangling and extracting data from databases using T-SQL. You will get to grips with preparing and presenting data in a meaningful way, using Power BI to reveal hidden patterns. In the concluding chapters, you will work with SQL Server integration services to transform data into a useful format and delve into advanced examples covering machine learning concepts such as predictive analytics using real-world examples.

By the end of this book, you will be in a position to handle the growing amounts of data and perform everyday activities that a data science professional performs.

What you will learn

  • Understand what data science is and how SQL Server is used for big data processing
  • Analyze incoming data with SQL queries and visualizations
  • Create, train, and evaluate predictive models
  • Make predictions using trained models and establish regular retraining courses
  • Incorporate data source querying into SQL Server
  • Enhance built-in T-SQL capabilities using SQLCLR
  • Visualize data with Reporting Services, Power View, and Power BI
  • Transform data with R, Python, and Azure

Who this book is for

Hands-On Data Science with SQL Server 2017 is intended for data scientists, data analysts, and big data professionals who want to master their skills learning SQL and its applications. This book will be helpful even for beginners who want to build their career as data science professionals using the power of SQL Server 2017. Basic familiarity with SQL language will aid with understanding the concepts covered in this book.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Hands-On Data Science with SQL Server 2017 an online PDF/ePUB?
Yes, you can access Hands-On Data Science with SQL Server 2017 by Marek Chmel, Vladimír Mužný in PDF and/or ePUB format, as well as other popular books in Informatique & Programmation en SQL. We have over one million books available in our catalogue for you to explore.



Data Exploration and Statistics with T-SQL

When creating our data science solutions, the data that we want to use in these tasks should be explored carefully. As we learned in Chapter 5, Data Transformation and Cleansing with T-SQL, when data is loaded into a desired format, we need to find the distributions and patterns within it. We should also use data exploration during the staging process to check and improve data quality.
In this chapter, we will learn how to use T-SQL language to get statistical results from our data. To do this, we will use the following techniques:
  • T-SQL aggregate queries: This section explains what the aggregate query is and which statistical measures it can show.
  • Ranking, framing, and windowing with T-SQL: Using framing and windowing helps to obtain results enriched by sorting or ranking. In this section, we will play with framing and windowing from the perspective of data exploration.
  • Running aggregates with T-SQL: This section will join knowledge from the aggregation queries section and the framing and windowing section to help us to create running aggregates or comparisons of values between rows.

Technical requirements

To play with all the scripts and projects in this chapter, SQL Server 2017 is required.
For all SQL scripts provided within this chapter, Management Studio version 17.5 or higher is recommended.
All scripts and other files are provided at https://github.com/PacktPublishing/Hands-On-Data-Science-with-SQL-Server-2017/tree/master/CH6_Data_Exploration_and_Statistics_with_T_SQL. Scripts on GitHub are also ordered accordingly with scripts reported throughout the chapter.

T-SQL aggregate queries

Data exploration and descriptive or comparative statistics are very important tasks that have to be done repeatedly and iteratively during every data science project. This gives us better insight into the data that we want to process throughout all projects. T-SQL aggregate queries are an important part of data exploration.
A T-SQL aggregate query is a kind of query that basically summarizes groups of records from underlying datasets and typically provides aggregated numeric values for each group of records generated from the dataset. Groups of records are not needed for every case or every assignment. Such aggregation queries give an aggregation of summarized values over whole underlying datasets.
The simplest aggregation query does not require grouping. With or without grouping, aggregate queries use special kinds of functions, called aggregate functions. Almost every aggregate function takes a column of values as a parameter, and a result of the function is a number. The only exception to this is a function called SUM(*), which simply counts records. SQL Server provides several aggregate functions, which will be covered in this chapter.

Common properties of aggregate functions

All aggregate functions have the same common behaviors:
  • Every aggregate function has one parameter of a typically numeric column that computes one value over that column. Two exceptions are made through the following functions:
    • COUNT(*): This function computes an amount of records.
    • STRING_AGG: This function concatenates strings from a column into one string so that it accepts the varchar or nvarchar columns as a parameter.
  • Almost every aggregate function except COUNT(*) ignores records in which the aggregated column contains NULL. This is because the COUNT(*) function does not work with certain columns, but with whole records.
  • Every aggregate function can be used in SELECT and HAVING clauses. In a SELECT clause, it provides scalar results. The HAVING clause serves as a conditional clause similar to the WHERE clause, but for the result of aggregations that are not yet known in a WHERE clause. We will cover more about the HAVING clause in the dedicated section later in this chapter.
Ignoring columns with NULL values by aggregate functions is a very important property of aggregate functions, and we must always consider this behavior. If we miss it, we can obtain inaccurate results from our queries.
Now it's time to introduce all aggregate functions using examples.

Aggregate functions

Now that we know the common properties of aggregate functions, let's start to describe and explore the aggregate functions one by one.


The simplest aggregate function is probably COUNT(*). This function just computes an amount of records. The following script sample demonstrates how to use this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail
The preceding statement goes over all records from the Sales.SalesOrderDetail table and counts how many rows there are. As seen in the sample, the aggregate function (or functions) can be written into a SELECT clause in the same way as any other expression. The rest of the query can be combined in the usual manner. For example, this may be the case if we want to compute the amount of records belonging only to product ID 710. The following sample statement shows this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail WHERE ProductID = 710
The result of the preceding query is 44 rows. This is because the query-only filters records when ProductID is equal to 710. In addition, the aggregation function only works with records for which the condition is true.
All aggregation functions work in the same way. Only records that meet whole WHERE predicates are processed using the aggregation function.
COUNT(*) is probably the simplest aggregate function provided by SQL Server. Sometimes it's changed with other aggregate functions, called COUNT. Even though both functions have the same name, the difference lies in the parameter. COUNT(*) uses the star (*) symbol to say that it just counts records, but the COUNT function uses certain columns to calculate a number of valid values. In other words, the COUNT fun...

Table of contents