eBook - ePub

Data Analysis with Python and PySpark

Name: Data Analysis with Python and PySpark
Author: Jonathan Rioux

Jonathan Rioux

Buch teilen

456 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Data Analysis with Python and PySpark

Jonathan Rioux

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines. In Data Analysis with Python and PySpark you will learn how to: Manage your data as it scales across multiple machines
Scale up your data programs with full confidence
Read and write data to and from a variety of sources and formats
Deal with messy data with PySpark's data manipulation functionality
Discover new data sets and perform exploratory data analysis
Build automated data pipelines that transform, summarize, and get insights from data
Troubleshoot common PySpark errors
Creating reliable long-running jobs Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you've learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required. About the technology
The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark's core engine with a Python-based API. It helps simplify Spark's steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem. About the book
Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You'll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that's Hadoop clusters, cloud data storage, or local data files. Once you've covered the fundamentals, you'll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code. What's inside Organizing your PySpark code
Managing your data, no matter the size
Scale up your data programs with full confidence
Troubleshooting common data pipeline problems
Creating reliable long-running jobsAbout the reader
Written for data scientists and data engineers comfortable with Python. About the author
As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.Table of Contents1 Introduction
PART 1 GET ACQUAINTED: FIRST STEPS IN PYSPARK
2 Your first data program in PySpark
3 Submitting and scaling your first PySpark program
4 Analyzing tabular data with pyspark.sql
5 Data frame gymnastics: Joining and grouping
PART 2 GET PROFICIENT: TRANSLATE YOUR IDEAS INTO CODE
6 Multidimensional data frames: Using PySpark with JSON data
7 Bilingual PySpark: Blending Python and SQL code
8 Extending PySpark with Python: RDD and UDFs
9 Big data is just a lot of small data: Using pandas UDFs
10 Your data under a different lens: Window functions
11 Faster PySpark: Understanding Spark's query planning
PART 3 GET CONFIDENT: USING MACHINE LEARNING WITH PYSPARK
12 Setting the stage: Preparing features for machine learning
13 Robust machine learning with ML Pipelines
14 Building custom ML transformers and estimators

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Data Analysis with Python and PySpark als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Data Analysis with Python and PySpark von Jonathan Rioux im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Computer Science & Data Mining. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Manning

Jahr

2022

ISBN

9781638350668

Thema

Computer Science

Thema

Data Mining

1 Introduction

This chapter covers

What PySpark is
Why PySpark is a useful tool for analytics
The versatility of the Spark platform and its limitations
PySpark’s way of processing data

According to pretty much every news outlet, data is everything, everywhere. It’s the new oil, the new electricity, the new gold, plutonium, even bacon! We call it powerful, intangible, precious, dangerous. At the same time, data itself is not enough: it is what you do with it that matters. After all, for a computer, any piece of data is a collection of zeroes and ones, and it is our responsibility, as users, to make sense of how it translates to something useful.

Just like oil, electricity, gold, plutonium, and bacon (especially bacon!), our appetite for data is growing. So much, in fact, that computers aren’t following. Data is growing in size and in complexity, yet consumer hardware has been stalling a little. RAM is hovering for most laptops at around 8 to 16 GB, and SSDs are getting prohibitively expensive past a few terabytes. Is the solution for the burgeoning data analyst to triple-mortgage their life to afford top-of-the-line hardware to tackle big data problems?

Here is where Apache Spark (which I’ll call Spark throughout the book) and its companion PySpark are introduced. They take a few pages of the supercomputer playbook—powerful, but manageable compute units meshed in a network of machines—and bring them to the masses. Add on top a powerful set of data structures ready for any work you’re willing to throw at them, and you have a tool that will grow (pun intended) with you.

A goal for this book is to provide you with the tools to analyze data using PySpark, whether you need to answer a quick data-driven question or build an ML model. It covers just enough theory to get you comfortable while giving you enough opportunities to practice. Most chapters contain a few exercises to anchor what you just learned. The exercises are all solved and explained in appendix A.

1.1 What is PySpark?

What’s in a name? Actually, quite a lot. Just by separating PySpark in two, you can already deduce that this will be related to Spark and Python. And you would be right!

At its core, PySpark can be summarized as being the Python API to Spark. While this is an accurate definition, it doesn’t give much unless you know the meaning of Python and Spark. Still, let’s break down the summary definition by first answering “What is Spark?” With that under our belt, we then will look at why Spark becomes especially powerful when combined with Python and its incredible array of analytical (and machine learning) libraries.

1.1.1 Taking it from the start: What is Spark?

According to the authors of the software, Apache Spark™, which I’ll call Spark throughout this book, is a “unified analytics engine for large-scale data processing” (see https://spark.apache.org/). This is a very accurate, if a little dry, definition. As a mental image, we can compare Spark to an analytics factory. The raw material—here, data—comes in, and data, insights, visualizations, models, you name it, comes out.

Just like a factory will often gain more capacity by increasing its footprint, Spark can process an increasingly vast amount of data by scaling out (across multiple smaller machines) instead of scaling up (adding more resources, such as CPU, RAM, and disk space, to a single machine). RAM, unlike most things in this world, gets more expensive the more you buy (e.g., one stick of 128 GB is more than the price of two sticks of 64 GB). This means that, instead of buying thousands of dollars of RAM to accommodate your data set, you’ll rely on multiple computers, splitting the job between them. In a world where two modest computers are less costly than one large one, scaling out is less expensive than scaling up, which keeps more money in your pockets.

In the cloud, prices will often be more consequential. For instance, as of January 2022, a 16-Core/128-GB RAM machine can be about twice the cost of an 8 Core/64 GB of RAM machine. As the data size grows, Spark can help control costs by scaling the number of workers and executors for a given job. As an example, if you have a data transformation job on a modest data set (a few TB), you can limit yourself to a lower number—say, five—machines, scaling up to 60 when you want to do machine learning. Some vendors, such as Databricks (see appendix B), offer auto-scaling, meaning that they increase and decrease the number of machines during a job depending on the pressure on the cluster. The implementation of auto-scaling/cost controlling is 100% vendor-dependent. (Check out chapter 11 for an introduction to the resources making up a Spark cluster, as well as their purpose.)

A single computer can crash or behave unpredictably at times. If instead of one you have one hundred, the chance that at least one of them goes down is now much higher.¹ Spark therefore has a lot of hoops to manage, scale, and babysit so that you can focus on what you want, which is to work with data.

This is, in fact, one of the key things about Spark: it’s a good tool because of what you can do with it, but especially because of what you don’t have to do with it. Spark provides a powerful API (application programming interface, the set of functions, classes, and variables provided for you to interact with) that makes it look like you’re working with a cohesive source of data while also working hard in the background to optimize your program to use all the power available. You don’t have to be an expert in the arcane art of distributed computing; you just need to be familiar with the language you’ll use to build your program.

1.1.2 PySpark = Spark + Python

PySpark provides an entry point to Python in the computational model of Spark. Spark itself is coded in Scala.² The authors did a great job of providing a coherent interface between languages while preserving the idiosyncrasies of each language where appropriate. It will, therefore, be quite easy for a Scala/Spark programmer to read your PySpark program, as well as for a fellow Python programmer who hasn’t jumped into the deep end (yet).

Python is a dynamic, general-purpose language, available on many platforms and for a variety of tasks. Its versatility and expressiveness make it an especially good fit for PySpark. The language is one of the most popular for a variety of domains, and currently it is a major force in data analysis and science. The syntax is easy to learn and read, and the number of libraries available means that you’ll often find one (or more!) that’s just the right fit for your problem.

PySpark provides access not only to the core Spark API but also to a set of bespoke functionality to scale out regular Python code, as well as pandas transformations. In Python’s data analysis ecosystem, pandas is the de facto data frame library for memory-bound data frames (the entire data frame needs to reside on a single machine’s memory). It’s not a matter of PySpark or pandas now, but PySpark and pandas. Chapters 8 and 9 are dedicated to combining Python, pandas, and PySpark in one happy program. For those really committed to the pandas syntax (or if you have a large pandas program you want to scale to PySpark), Koalas (now called pyspark.pandas and part of Spark as of version 3.2.0; https://koalas.readthedocs.io/) provides a pandas-like porcelain on top of PySpark. If you are starting a new Spark program in Python, I recommend using the PySpark syntax—covered thoroughly in this book—reserving Koalas for when you want to ease the transition from pandas to PySpark. Your program will work faster and, in my opinion, will read better.

1.1.3 Why PySpark?

There is no shortage of libraries and frameworks to work with data. Why should one spend their time learning PySpark specifically?

PySpark has a lot of advantages for modern data workloads. It sits at the intersection of fast, expressive, and versatile. This section covers the many advantages of PySpark, why its value proposition goes beyond just “Spark, with Python,” and when it is better to reach for another tool.

PySpark is fast

If you search for “big data” in a search engine, there is a very good chance that Hadoop will come up within the first few results. There is a good reason for this: Hadoop popularized the famous MapReduce framework that Google pioneered in 2004 and inspired ...