eBook - ePub

Learning PySpark

Name: Learning PySpark
Author: Tomasz Drabas, Denny Lee

Tomasz Drabas, Denny Lee

Buch teilen

274 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Learning PySpark

Tomasz Drabas, Denny Lee

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0

About This Book

Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0
Develop and deploy efficient, scalable real-time Spark solutions
Take your understanding of using Spark with Python to the next level with this jump start guide

Who This Book Is For

If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory.

What You Will Learn

Learn about Apache Spark and the Spark 2.0 architecture
Build and interact with Spark DataFrames using Spark SQL
Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively
Read, transform, and understand data and use it to train machine learning models
Build machine learning models with MLlib and ML
Learn how to submit your applications programmatically using spark-submit
Deploy locally built applications to a cluster

In Detail

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark.

You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command.

By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Style and approach

This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Learning PySpark als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Learning PySpark von Tomasz Drabas, Denny Lee im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Ciencia de la computación & Minería de datos. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Packt Publishing

Jahr

2017

ISBN

9781786463708

Auflage

Thema

Ciencia de la computación

Thema

Minería de datos

Learning PySpark

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Understanding Spark

What is Apache Spark?

Spark Jobs and APIs

Execution process

Resilient Distributed Dataset

DataFrames

Datasets

Catalyst Optimizer

Project Tungsten

Spark 2.0 architecture

Unifying Datasets and DataFrames

Introducing SparkSession

Tungsten phase 2

Structured Streaming

Continuous applications

Summary

2. Resilient Distributed Datasets

Internal workings of an RDD

Creating RDDs

Schema

Reading from files

Lambda expressions

Global versus local scope

Transformations

The .map(...) transformation

The .filter(...) transformation

The .flatMap(...) transformation

The .distinct(...) transformation

The .sample(...) transformation

The .leftOuterJoin(...) transformation

The .repartition(...) transformation

Actions

The .take(...) method

The .collect(...) method

The .reduce(...) method

The .count(...) method

The .saveAsTextFile(...) method

The .foreach(...) method

Summary

3. DataFrames

Python to RDD communications

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Creating DataFrames

Generating our own JSON data

Creating a DataFrame

Creating a temporary table

Simple DataFrame queries

DataFrame API query

SQL query

Interoperating with RDDs

Inferring the schema using reflection

Programmatically specifying the schema

Querying with the DataFrame API

Number of rows

Running filter statements

Querying with SQL

Number of rows

Running filter statements using the where Clauses

DataFrame scenario – on-time flight performance

Preparing the source datasets

Joining flight performance and airports

Visualizing our flight-performance data

Spark Dataset API

Summary

4. Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Duplicates

Missing observations

Outliers

Getting familiar with your data

Descriptive statistics

Correlations

Visualization

Histograms

Interactions between features

Summary

5. Introducing MLlib

Overview of the package

Loading and transforming the data

Getting to know your data

Descriptive statistics

Correlations

Statistical testing

Creating the final dataset

Creating an RDD of LabeledPoints

Splitting into training and testing

Predicting infant survival

Logistic regression in MLlib

Selecting only the most predictable features

Random forest in MLlib

Summary

6. Introducing the ML Package

Overview of the package

Transformer

Estimators

Classification

Regression

Clustering

Pipeline

Predicting the chances of infant survival with ML

Loading the data

Creating transformers

Creating an estimator

Creating a pipeline

Fitting the model

Evaluating the performance of the model

Saving the model

Parameter hyper-tuning

Grid search

Train-validation splitting

Other features of PySpark ML in action

Feature extraction

NLP - related feature extractors

Discretizing continuous variables

Standardizing continuous variables

Classification

Clustering

Finding clusters in the births dataset

Topic mining

Regression

Summary

7. GraphFrames

Introducing GraphFrames

Installing GraphFrames

Creating a library

Preparing your flights dataset

Building the graph

Executing simple queries

Determining the number of airports and trips

Determining the longest delay in this dataset

Determining the number of delayed versus on-time/early flights

What flights departing Seattle are most likely to have significant delays?

What states tend to have significant delays departing from Seattle?

Understanding vertex degrees

Determining the top transfer airports

Understanding motifs

Determining airport ranking using PageRank

Determining the most popular non-stop flights

Using Breadth-First Search

Visualizing flights using D3

Summary

8. TensorFrames

What is Deep Learning?

The need for neural networks and Deep Learning

What is feature engineering?

Bridging the data and algorithm

What is TensorFlow?

Installing Pip

Installing TensorFlow

Matrix multiplication using constants

Matrix multiplication using placeholders

Running the model

Running another model

Discussion

Introducing TensorFrames

TensorFrames – quick start

Configuration and setup

Launching a Spark cluster

Creating a TensorFrames library

Installing TensorFlow on your cluster

Using TensorFlow to add a constant to an existing column

Executing the Tensor graph

Blockwise reducing operations example

Building a DataFrame of vectors

Analysing the DataFrame

Computing elementwise sum and min of all vectors

Summary

9. Polyglot Persistence with Blaze

Installing Blaze

Polyglot persistence

Abstracting data

Working with NumPy arrays

Working with pandas' DataFrame

Working with files

Working with databases

Interacting with relational databases

Interacting with the MongoDB database

Data operations

Accessing columns

Symbolic transformations

Operations on colum...

Über dieses Buch

Häufig gestellte Fragen

Information

Learning PySpark

Table of Contents

Inhaltsverzeichnis