eBook - ePub

Learning PySpark

Name: Learning PySpark
Author: Tomasz Drabas, Denny Lee

Tomasz Drabas, Denny Lee

Partager le livre

274 pages
English
ePUB (adapté aux mobiles)
Disponible sur iOS et Android

eBook - ePub

Learning PySpark

Tomasz Drabas, Denny Lee

Détails du livre

Aperçu du livre

Table des matières

Citations

À propos de ce livre

Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0

About This Book

Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0
Develop and deploy efficient, scalable real-time Spark solutions
Take your understanding of using Spark with Python to the next level with this jump start guide

Who This Book Is For

If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory.

What You Will Learn

Learn about Apache Spark and the Spark 2.0 architecture
Build and interact with Spark DataFrames using Spark SQL
Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively
Read, transform, and understand data and use it to train machine learning models
Build machine learning models with MLlib and ML
Learn how to submit your applications programmatically using spark-submit
Deploy locally built applications to a cluster

In Detail

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark.

You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command.

By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Style and approach

This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept.

Foire aux questions

Comment puis-je résilier mon abonnement ?

Il vous suffit de vous rendre dans la section compte dans paramètres et de cliquer sur « Résilier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez résilié votre abonnement, il restera actif pour le reste de la période pour laquelle vous avez payé. Découvrez-en plus ici.

Puis-je / comment puis-je télécharger des livres ?

Pour le moment, tous nos livres en format ePub adaptés aux mobiles peuvent être téléchargés via l’application. La plupart de nos PDF sont également disponibles en téléchargement et les autres seront téléchargeables très prochainement. Découvrez-en plus ici.

Quelle est la différence entre les formules tarifaires ?

Les deux abonnements vous donnent un accès complet à la bibliothèque et à toutes les fonctionnalités de Perlego. Les seules différences sont les tarifs ainsi que la période d’abonnement : avec l’abonnement annuel, vous économiserez environ 30 % par rapport à 12 mois d’abonnement mensuel.

Qu’est-ce que Perlego ?

Nous sommes un service d’abonnement à des ouvrages universitaires en ligne, où vous pouvez accéder à toute une bibliothèque pour un prix inférieur à celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! Découvrez-en plus ici.

Prenez-vous en charge la synthèse vocale ?

Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte à haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accélérer ou le ralentir. Découvrez-en plus ici.

Est-ce que Learning PySpark est un PDF/ePUB en ligne ?

Oui, vous pouvez accéder à Learning PySpark par Tomasz Drabas, Denny Lee en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Ciencia de la computación et Minería de datos. Nous disposons de plus d’un million d’ouvrages à découvrir dans notre catalogue.

Informations

Éditeur

Packt Publishing

Année

2017

ISBN

9781786463708

Édition

Sujet

Ciencia de la computación

Sous-sujet

Minería de datos

Learning PySpark

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Understanding Spark

What is Apache Spark?

Spark Jobs and APIs

Execution process

Resilient Distributed Dataset

DataFrames

Datasets

Catalyst Optimizer

Project Tungsten

Spark 2.0 architecture

Unifying Datasets and DataFrames

Introducing SparkSession

Tungsten phase 2

Structured Streaming

Continuous applications

Summary

2. Resilient Distributed Datasets

Internal workings of an RDD

Creating RDDs

Schema

Reading from files

Lambda expressions

Global versus local scope

Transformations

The .map(...) transformation

The .filter(...) transformation

The .flatMap(...) transformation

The .distinct(...) transformation

The .sample(...) transformation

The .leftOuterJoin(...) transformation

The .repartition(...) transformation

Actions

The .take(...) method

The .collect(...) method

The .reduce(...) method

The .count(...) method

The .saveAsTextFile(...) method

The .foreach(...) method

Summary

3. DataFrames

Python to RDD communications

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Creating DataFrames

Generating our own JSON data

Creating a DataFrame

Creating a temporary table

Simple DataFrame queries

DataFrame API query

SQL query

Interoperating with RDDs

Inferring the schema using reflection

Programmatically specifying the schema

Querying with the DataFrame API

Number of rows

Running filter statements

Querying with SQL

Number of rows

Running filter statements using the where Clauses

DataFrame scenario – on-time flight performance

Preparing the source datasets

Joining flight performance and airports

Visualizing our flight-performance data

Spark Dataset API

Summary

4. Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Duplicates

Missing observations

Outliers

Getting familiar with your data

Descriptive statistics

Correlations

Visualization

Histograms

Interactions between features

Summary

5. Introducing MLlib

Overview of the package

Loading and transforming the data

Getting to know your data

Descriptive statistics

Correlations

Statistical testing

Creating the final dataset

Creating an RDD of LabeledPoints

Splitting into training and testing

Predicting infant survival

Logistic regression in MLlib

Selecting only the most predictable features

Random forest in MLlib

Summary

6. Introducing the ML Package

Overview of the package

Transformer

Estimators

Classification

Regression

Clustering

Pipeline

Predicting the chances of infant survival with ML

Loading the data

Creating transformers

Creating an estimator

Creating a pipeline

Fitting the model

Evaluating the performance of the model

Saving the model

Parameter hyper-tuning

Grid search

Train-validation splitting

Other features of PySpark ML in action

Feature extraction

NLP - related feature extractors

Discretizing continuous variables

Standardizing continuous variables

Classification

Clustering

Finding clusters in the births dataset

Topic mining

Regression

Summary

7. GraphFrames

Introducing GraphFrames

Installing GraphFrames

Creating a library

Preparing your flights dataset

Building the graph

Executing simple queries

Determining the number of airports and trips

Determining the longest delay in this dataset

Determining the number of delayed versus on-time/early flights

What flights departing Seattle are most likely to have significant delays?

What states tend to have significant delays departing from Seattle?

Understanding vertex degrees

Determining the top transfer airports

Understanding motifs

Determining airport ranking using PageRank

Determining the most popular non-stop flights

Using Breadth-First Search

Visualizing flights using D3

Summary

8. TensorFrames

What is Deep Learning?

The need for neural networks and Deep Learning

What is feature engineering?

Bridging the data and algorithm

What is TensorFlow?

Installing Pip

Installing TensorFlow

Matrix multiplication using constants

Matrix multiplication using placeholders

Running the model

Running another model

Discussion

Introducing TensorFrames

TensorFrames – quick start

Configuration and setup

Launching a Spark cluster

Creating a TensorFrames library

Installing TensorFlow on your cluster

Using TensorFlow to add a constant to an existing column

Executing the Tensor graph

Blockwise reducing operations example

Building a DataFrame of vectors

Analysing the DataFrame

Computing elementwise sum and min of all vectors

Summary

9. Polyglot Persistence with Blaze

Installing Blaze

Polyglot persistence

Abstracting data

Working with NumPy arrays

Working with pandas' DataFrame

Working with files

Working with databases

Interacting with relational databases

Interacting with the MongoDB database

Data operations

Accessing columns

Symbolic transformations

Operations on colum...

À propos de ce livre

Foire aux questions

Informations

Learning PySpark

Table of Contents

Table des matières