Hands-On Big Data Analytics with PySpark
eBook - ePub

Hands-On Big Data Analytics with PySpark

Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

  1. 182 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Hands-On Big Data Analytics with PySpark

Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

About this book

Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs

Key Features

  • Work with large amounts of agile data using distributed datasets and in-memory caching
  • Source data from all popular data hosting platforms, such as HDFS, Hive, JSON, and S3
  • Employ the easy-to-use PySpark API to deploy big data Analytics for production

Book Description

Apache Spark is an open source parallel-processing framework that has been around for quite some time now. One of the many uses of Apache Spark is for data analytics applications across clustered computers. In this book, you will not only learn how to use Spark and the Python API to create high-performance analytics with big data, but also discover techniques for testing, immunizing, and parallelizing Spark jobs.

You will learn how to source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3, and deal with large datasets with PySpark to gain practical big data experience. This book will help you work on prototypes on local machines and subsequently go on to handle messy data in production and at scale. This book covers installing and setting up PySpark, RDD operations, big data cleaning and wrangling, and aggregating and summarizing data into useful reports. You will also learn how to implement some practical and proven techniques to improve certain aspects of programming and administration in Apache Spark.

By the end of the book, you will be able to build big data analytical solutions using the various PySpark offerings and also optimize them effectively.

What you will learn

  • Get practical big data experience while working on messy datasets
  • Analyze patterns with Spark SQL to improve your business intelligence
  • Use PySpark's interactive shell to speed up development time
  • Create highly concurrent Spark programs by leveraging immutability
  • Discover ways to avoid the most expensive operation in the Spark API: the shuffle operation
  • Re-design your jobs to use reduceByKey instead of groupBy
  • Create robust processing pipelines by testing Apache Spark jobs

Who this book is for

This book is for developers, data scientists, business analysts, or anyone who needs to reliably analyze large amounts of large-scale, real-world data. Whether you're tasked with creating your company's business intelligence function or creating great data platforms for your machine learning models, or are looking to use code to magnify the impact of your business, this book is for you.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Year
2019
Print ISBN
9781838644130
Edition
1
eBook ISBN
9781838648831

Leveraging the Spark GraphX API

In this chapter, we will learn how to create a graph from a data source. We will then carry out experiments with the Edge API and Vertex API. By the end of this chapter, you will know how to calculate the degree of vertex and PageRank.
In this chapter, we will cover the following topics:
  • Creating a graph from a data source
  • Using the Vertex API
  • Using the Edge API
  • Calculating the degree of vertex
  • Calculating PageRank

Creating a graph from a data source

We will be creating a loader component that will be used to load the data, revisit the graph format, and load a Spark graph from file.

Creating the loader component

The graph.g file consists of a structure of vertex to vertex. In the following graph.g file, if we align 1 to 2, this means that there is an edge between vertex ID 1 and vertex ID 2. The second line means that there's an edge from vertex ID 1 to 3, then from 2 to 3, and finally 3 to 5:
1 2
1 3
2 3
3 5
We will take the graph.g file, load it, and see how it will provide results in Spark. First, we need to get a resource to our graph.g file. We will do this using the getClass.getResource() method to get the path to it, as follows:
package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite

class CreatingGraph extends FunSuite {
val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

test("should load graph from a file") {
//given
val path = getClass.getResource("/graph.g").getPath

Revisiting the graph format

Next, we have the GraphBuilder method, which is our own component:
 //when
val graph = GraphBuilder.loadFromFile(spark, path)
The following is our GraphBuilder.scala file for our GraphBuilder method:
package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Graph, GraphLoader}

object GraphBuilder {

def loadFromFile(sc: SparkContext, path: String): Graph[Int, Int] = {
GraphLoader.edgeListFile(sc, path)
}
}
It uses a GraphLoader class from the org.apache.spark.graphx.{Graph, GraphLoader} package and we are specifying the format.
The format that's specified here is edgeListFile. We are passing the sc parameter, which is the SparkContext and path parameter, which contains the path of where the file is placed. The resulting graph will be Graph [Int, Int], which we will use as the identifier of our vertices.

Loading Spark from file

Once we have the resulting graph, we can pass the spark and path parameters to our GraphBuilder.loadFromFile() method, and at this point, we'll have a graph that is a construct graph of Graph [Int, Int], as follows:
 val graph = GraphBuilder.loadFromFile(spark, path)
To iterate and validate that our graph was properly loaded, we will use triplets from graph, which are a pair of vertex to vertex and also an edge between those vertices. We will see that the structure of the graph was loaded properly:
 //then
graph.triplets.foreach(println(_))
At the end, we are asserting that we get 4 triplets (as shown earlier in the Creating the loader component section, we have four definitions from the graph.g file):
 assert(graph.triplets.count() == 4)
}

}
We will start the test and see whether we are able to load our graph properly.
We get the following output. Here, we have (2, 1), (3, 1), (3,1), (5,1), (1,1), (2,1), (1,1), and (3,1):
Hence, according to the output graph, we were able to reload our graph using Spark.

Using the Vertex API

In this section, we will construct the graph using edge. We will learn to use the Vertex API and also leverage edge transformations.

Constructing a graph using the vertex

Constructing a graph is not a trivial task; we need to supply vertices and edges between them. Let's focus on the first part. The first part consists of our users, users is an RDD of VertexId and String as follows:
package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite

class VertexAPI extends FunSuite {
val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

test("Should use Vertex API") {
//given
val users: RDD[(VertexId, (String))] =
spark.parallelize(Array(
(1L, "a"),
(2L, "b"),
(3L, "c"),
(4L, "d")
))
VertexId is of the long type; this is only a type alias for Long:
type VertexID = Long
But since our graph sometimes has a lot of content, the VertexId should be unique and a very long number. Every vertex in our vertices' RDD should have a unique VertexId. The custom data associated with the vertex can be any class, but we will go for simplicity with the String class. First, we are creating a vertex with ID 1 and string data a, the next with ID 2 and string data b, the next with ID 3 and string data c, and similarly for the data with ID 4 and string d, as follows:
 val users: RDD[(VertexId, (String))] =
spark.parallelize(Array(
(1L, "a"),
(2L, "b"),
(3L, "c"),
(4L, "d")
))
Creating a graph from only vert...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. About Packt
  4. Contributors
  5. Preface
  6. Installing Pyspark and Setting up Your Development Environment
  7. Getting Your Big Data into the Spark Environment Using RDDs
  8. Big Data Cleaning and Wrangling with Spark Notebooks
  9. Aggregating and Summarizing Data into Useful Reports
  10. Powerful Exploratory Data Analysis with MLlib
  11. Putting Structure on Your Big Data with SparkSQL
  12. Transformations and Actions
  13. Immutable Design
  14. Avoiding Shuffle and Reducing Operational Expenses
  15. Saving Data in the Correct Format
  16. Working with the Spark Key/Value API
  17. Testing Apache Spark Jobs
  18. Leveraging the Spark GraphX API
  19. Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Hands-On Big Data Analytics with PySpark by Rudy Lai, Bartłomiej Potaczek in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over 1.5 million books available in our catalogue for you to explore.