eBook - ePub

Hands-On Big Data Analytics with PySpark

Name: Hands-On Big Data Analytics with PySpark
ISBN: 9781838648831

Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Rudy Lai,

Bartłomiej Potaczek,

182 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Hands-On Big Data Analytics with PySpark

Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Rudy Lai,

Bartłomiej Potaczek,

About this book

Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs

Key Features

Work with large amounts of agile data using distributed datasets and in-memory caching
Source data from all popular data hosting platforms, such as HDFS, Hive, JSON, and S3
Employ the easy-to-use PySpark API to deploy big data Analytics for production

Book Description

Apache Spark is an open source parallel-processing framework that has been around for quite some time now. One of the many uses of Apache Spark is for data analytics applications across clustered computers. In this book, you will not only learn how to use Spark and the Python API to create high-performance analytics with big data, but also discover techniques for testing, immunizing, and parallelizing Spark jobs.

You will learn how to source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3, and deal with large datasets with PySpark to gain practical big data experience. This book will help you work on prototypes on local machines and subsequently go on to handle messy data in production and at scale. This book covers installing and setting up PySpark, RDD operations, big data cleaning and wrangling, and aggregating and summarizing data into useful reports. You will also learn how to implement some practical and proven techniques to improve certain aspects of programming and administration in Apache Spark.

By the end of the book, you will be able to build big data analytical solutions using the various PySpark offerings and also optimize them effectively.

What you will learn

Get practical big data experience while working on messy datasets
Analyze patterns with Spark SQL to improve your business intelligence
Use PySpark's interactive shell to speed up development time
Create highly concurrent Spark programs by leveraging immutability
Discover ways to avoid the most expensive operation in the Spark API: the shuffle operation
Re-design your jobs to use reduceByKey instead of groupBy
Create robust processing pipelines by testing Apache Spark jobs

Who this book is for

This book is for developers, data scientists, business analysts, or anyone who needs to reliably analyze large amounts of large-scale, real-world data. Whether you're tasked with creating your company's business intelligence function or creating great data platforms for your machine learning models, or are looking to use code to magnify the impact of your business, this book is for you.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2019

Print ISBN

9781838644130

Edition

eBook ISBN

9781838648831

Topic

Computer Science

Subtopic

Artificial Intelligence (AI) & Semantics

Index

Computer Science

Leveraging the Spark GraphX API

In this chapter, we will learn how to create a graph from a data source. We will then carry out experiments with the Edge API and Vertex API. By the end of this chapter, you will know how to calculate the degree of vertex and PageRank.

In this chapter, we will cover the following topics:

Creating a graph from a data source
Using the Vertex API
Using the Edge API
Calculating the degree of vertex
Calculating PageRank

Creating a graph from a data source

We will be creating a loader component that will be used to load the data, revisit the graph format, and load a Spark graph from file.

Creating the loader component

The graph.g file consists of a structure of vertex to vertex. In the following graph.g file, if we align 1 to 2, this means that there is an edge between vertex ID 1 and vertex ID 2. The second line means that there's an edge from vertex ID 1 to 3, then from 2 to 3, and finally 3 to 5:

We will take the graph.g file, load it, and see how it will provide results in Spark. First, we need to get a resource to our graph.g file. We will do this using the getClass.getResource() method to get the path to it, as follows:

package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite

class CreatingGraph extends FunSuite {
 val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

 test("should load graph from a file") {
 //given
 val path = getClass.getResource("/graph.g").getPath

Revisiting the graph format

Next, we have the GraphBuilder method, which is our own component:

 //when
 val graph = GraphBuilder.loadFromFile(spark, path)

The following is our GraphBuilder.scala file for our GraphBuilder method:

package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Graph, GraphLoader}

object GraphBuilder {

 def loadFromFile(sc: SparkContext, path: String): Graph[Int, Int] = {
 GraphLoader.edgeListFile(sc, path)
 }
}

It uses a GraphLoader class from the org.apache.spark.graphx.{Graph, GraphLoader} package and we are specifying the format.

The format that's specified here is edgeListFile. We are passing the sc parameter, which is the SparkContext and path parameter, which contains the path of where the file is placed. The resulting graph will be Graph [Int, Int], which we will use as the identifier of our vertices.

Loading Spark from file

Once we have the resulting graph, we can pass the spark and path parameters to our GraphBuilder.loadFromFile() method, and at this point, we'll have a graph that is a construct graph of Graph [Int, Int], as follows:

 val graph = GraphBuilder.loadFromFile(spark, path)

To iterate and validate that our graph was properly loaded, we will use triplets from graph, which are a pair of vertex to vertex and also an edge between those vertices. We will see that the structure of the graph was loaded properly:

 //then
 graph.triplets.foreach(println(_))

At the end, we are asserting that we get 4 triplets (as shown earlier in the Creating the loader component section, we have four definitions from the graph.g file):

 assert(graph.triplets.count() == 4)
 }

}

We will start the test and see whether we are able to load our graph properly.

We get the following output. Here, we have (2, 1), (3, 1), (3,1), (5,1), (1,1), (2,1), (1,1), and (3,1):

Hence, according to the output graph, we were able to reload our graph using Spark.

Using the Vertex API

In this section, we will construct the graph using edge. We will learn to use the Vertex API and also leverage edge transformations.

Constructing a graph using the vertex

Constructing a graph is not a trivial task; we need to supply vertices and edges between them. Let's focus on the first part. The first part consists of our users, users is an RDD of VertexId and String as follows:

package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite

class VertexAPI extends FunSuite {
 val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

 test("Should use Vertex API") {
 //given
 val users: RDD[(VertexId, (String))] =
 spark.parallelize(Array(
 (1L, "a"),
 (2L, "b"),
 (3L, "c"),
 (4L, "d")
 ))

VertexId is of the long type; this is only a type alias for Long:

type VertexID = Long

But since our graph sometimes has a lot of content, the VertexId should be unique and a very long number. Every vertex in our vertices' RDD should have a unique VertexId. The custom data associated with the vertex can be any class, but we will go for simplicity with the String class. First, we are creating a vertex with ID 1 and string data a, the next with ID 2 and string data b, the next with ID 3 and string data c, and similarly for the data with ID 4 and string d, as follows:

 val users: RDD[(VertexId, (String))] =
 spark.parallelize(Array(
 (1L, "a"),
 (2L, "b"),
 (3L, "c"),
 (4L, "d")
 ))

Creating a graph from only vert...

Title Page
Copyright and Credits
About Packt
Contributors
Preface
Installing Pyspark and Setting up Your Development Environment
Getting Your Big Data into the Spark Environment Using RDDs
Big Data Cleaning and Wrangling with Spark Notebooks
Aggregating and Summarizing Data into Useful Reports
Powerful Exploratory Data Analysis with MLlib
Putting Structure on Your Big Data with SparkSQL
Transformations and Actions
Immutable Design
Avoiding Shuffle and Reducing Operational Expenses
Saving Data in the Correct Format
Working with the Spark Key/Value API
Testing Apache Spark Jobs
Leveraging the Spark GraphX API
Other Books You May Enjoy

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Hands-On Big Data Analytics with PySpark an online PDF/ePUB?

Yes, you can access Hands-On Big Data Analytics with PySpark by Rudy Lai, Bartłomiej Potaczek in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over 1.5 million books available in our catalogue for you to explore.

Hands-On Big Data Analytics with PySpark

Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Hands-On Big Data Analytics with PySpark

Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

About this book

Trusted by 375,005 students

Information

Leveraging the Spark GraphX API

Creating a graph from a data source

Creating the loader component

Revisiting the graph format

Loading Spark from file

Using the Vertex API

Constructing a graph using the vertex

Table of contents

Frequently asked questions