eBook - ePub

Learning PySpark

Name: Learning PySpark
Author: Tomasz Drabas, Denny Lee

Tomasz Drabas, Denny Lee

Compartir libro

274 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Learning PySpark

Tomasz Drabas, Denny Lee

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0

About This Book

Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0
Develop and deploy efficient, scalable real-time Spark solutions
Take your understanding of using Spark with Python to the next level with this jump start guide

Who This Book Is For

If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory.

What You Will Learn

Learn about Apache Spark and the Spark 2.0 architecture
Build and interact with Spark DataFrames using Spark SQL
Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively
Read, transform, and understand data and use it to train machine learning models
Build machine learning models with MLlib and ML
Learn how to submit your applications programmatically using spark-submit
Deploy locally built applications to a cluster

In Detail

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark.

You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command.

By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Style and approach

This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Learning PySpark un PDF/ePUB en línea?

Sí, puedes acceder a Learning PySpark de Tomasz Drabas, Denny Lee en formato PDF o ePUB, así como a otros libros populares de Ciencia de la computación y Minería de datos. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Packt Publishing

Año

2017

ISBN

9781786463708

Edición

Categoría

Ciencia de la computación

Categoría

Minería de datos

Learning PySpark

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Understanding Spark

What is Apache Spark?

Spark Jobs and APIs

Execution process

Resilient Distributed Dataset

DataFrames

Datasets

Catalyst Optimizer

Project Tungsten

Spark 2.0 architecture

Unifying Datasets and DataFrames

Introducing SparkSession

Tungsten phase 2

Structured Streaming

Continuous applications

Summary

2. Resilient Distributed Datasets

Internal workings of an RDD

Creating RDDs

Schema

Reading from files

Lambda expressions

Global versus local scope

Transformations

The .map(...) transformation

The .filter(...) transformation

The .flatMap(...) transformation

The .distinct(...) transformation

The .sample(...) transformation

The .leftOuterJoin(...) transformation

The .repartition(...) transformation

Actions

The .take(...) method

The .collect(...) method

The .reduce(...) method

The .count(...) method

The .saveAsTextFile(...) method

The .foreach(...) method

Summary

3. DataFrames

Python to RDD communications

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Creating DataFrames

Generating our own JSON data

Creating a DataFrame

Creating a temporary table

Simple DataFrame queries

DataFrame API query

SQL query

Interoperating with RDDs

Inferring the schema using reflection

Programmatically specifying the schema

Querying with the DataFrame API

Number of rows

Running filter statements

Querying with SQL

Number of rows

Running filter statements using the where Clauses

DataFrame scenario – on-time flight performance

Preparing the source datasets

Joining flight performance and airports

Visualizing our flight-performance data

Spark Dataset API

Summary

4. Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Duplicates

Missing observations

Outliers

Getting familiar with your data

Descriptive statistics

Correlations

Visualization

Histograms

Interactions between features

Summary

5. Introducing MLlib

Overview of the package

Loading and transforming the data

Getting to know your data

Descriptive statistics

Correlations

Statistical testing

Creating the final dataset

Creating an RDD of LabeledPoints

Splitting into training and testing

Predicting infant survival

Logistic regression in MLlib

Selecting only the most predictable features

Random forest in MLlib

Summary

6. Introducing the ML Package

Overview of the package

Transformer

Estimators

Classification

Regression

Clustering

Pipeline

Predicting the chances of infant survival with ML

Loading the data

Creating transformers

Creating an estimator

Creating a pipeline

Fitting the model

Evaluating the performance of the model

Saving the model

Parameter hyper-tuning

Grid search

Train-validation splitting

Other features of PySpark ML in action

Feature extraction

NLP - related feature extractors

Discretizing continuous variables

Standardizing continuous variables

Classification

Clustering

Finding clusters in the births dataset

Topic mining

Regression

Summary

7. GraphFrames

Introducing GraphFrames

Installing GraphFrames

Creating a library

Preparing your flights dataset

Building the graph

Executing simple queries

Determining the number of airports and trips

Determining the longest delay in this dataset

Determining the number of delayed versus on-time/early flights

What flights departing Seattle are most likely to have significant delays?

What states tend to have significant delays departing from Seattle?

Understanding vertex degrees

Determining the top transfer airports

Understanding motifs

Determining airport ranking using PageRank

Determining the most popular non-stop flights

Using Breadth-First Search

Visualizing flights using D3

Summary

8. TensorFrames

What is Deep Learning?

The need for neural networks and Deep Learning

What is feature engineering?

Bridging the data and algorithm

What is TensorFlow?

Installing Pip

Installing TensorFlow

Matrix multiplication using constants

Matrix multiplication using placeholders

Running the model

Running another model

Discussion

Introducing TensorFrames

TensorFrames – quick start

Configuration and setup

Launching a Spark cluster

Creating a TensorFrames library

Installing TensorFlow on your cluster

Using TensorFlow to add a constant to an existing column

Executing the Tensor graph

Blockwise reducing operations example

Building a DataFrame of vectors

Analysing the DataFrame

Computing elementwise sum and min of all vectors

Summary

9. Polyglot Persistence with Blaze

Installing Blaze

Polyglot persistence

Abstracting data

Working with NumPy arrays

Working with pandas' DataFrame

Working with files

Working with databases

Interacting with relational databases

Interacting with the MongoDB database

Data operations

Accessing columns

Symbolic transformations

Operations on colum...

Información del libro

Preguntas frecuentes

Información

Learning PySpark

Table of Contents

Índice