eBook - ePub

Apache Spark Graph Processing

Name: Apache Spark Graph Processing
ISBN: 9781784391805

Rindra Ramamonjison,

148 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Apache Spark Graph Processing

Rindra Ramamonjison,

About this book

Build, process and analyze large-scale graph data effectively with Spark

About This Book

Find solutions for every stage of data processing from loading and transforming graph data to
Improve the scalability of your graphs with a variety of real-world applications with complete Scala code.
A concise guide to processing large-scale networks with Apache Spark.

Who This Book Is For

This book is for data scientists and big data developers who want to learn the processing and analyzing graph datasets at scale. Basic programming experience with Scala is assumed. Basic knowledge of Spark is assumed.

What You Will Learn

Write, build and deploy Spark applications with the Scala Build Tool.
Build and analyze large-scale network datasets
Analyze and transform graphs using RDD and graph-specific operations
Implement new custom graph operations tailored to specific needs.
Develop iterative and efficient graph algorithms using message aggregation and Pregel abstraction
Extract subgraphs and use it to discover common clusters
Analyze graph data and solve various data science problems using real-world datasets.

In Detail

Apache Spark is the next standard of open-source cluster-computing engine for processing big data. Many practical computing problems concern large graphs, like the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. Apache Spark GraphX API combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework.

This book will teach the user to do graphical programming in Apache Spark, apart from an explanation of the entire process of graphical data analysis. You will journey through the creation of graphs, its uses, its exploration and analysis and finally will also cover the conversion of graph elements into graph structures.

This book begins with an introduction of the Spark system, its libraries and the Scala Build Tool. Using a hands-on approach, this book will quickly teach you how to install and leverage Spark interactively on the command line and in a standalone Scala program. Then, it presents all the methods for building Spark graphs using illustrative network datasets. Next, it will walk you through the process of exploring, visualizing and analyzing different network characteristics. This book will also teach you how to transform raw datasets into a usable form. In addition, you will learn powerful operations that can be used to transform graph elements and graph structures. Furthermore, this book also teaches how to create custom graph operations that are tailored for specific needs with efficiency in mind. The later chapters of this book cover more advanced topics such as clustering graphs, implementing graph-parallel iterative algorithms and learning methods from graph data.

Style and approach

A step-by-step guide that will walk you through the key ideas and techniques for processing big graph data at scale, with practical examples that will ensure an overall understanding of the concepts of Spark.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Packt Publishing

Year

2015

eBook ISBN

9781784391805

Edition

Topic

Computer Science

Subtopic

Data Processing

Index

Computer Science

Apache Spark Graph Processing

Credits

Foreword

About the Author

About the Reviewer

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

Distinctive features

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Getting Started with Spark and GraphX

Downloading and installing Spark 1.4.1

Experimenting with the Spark shell

Getting started with GraphX

Building a tiny social network

Loading the data

The property graph

Transforming RDDs to VertexRDD and EdgeRDD

Introducing graph operations

Building and submitting a standalone application

Writing and configuring a Spark program

Building the program with the Scala Build Tool

Deploying and running with spark-submit

Summary

2. Building and Exploring Graphs

Network datasets

The communication network

Flavor networks

Social ego networks

Graph builders

The Graph factory method

edgeListFile

fromEdges

fromEdgeTuples

Building graphs

Building directed graphs

Building a bipartite graph

Building a weighted social ego network

Computing the degrees of the network nodes

In-degree and out-degree of the Enron email network

Degrees in the bipartite food network

Degree histogram of the social ego networks

Summary

3. Graph Analysis and Visualization

Network datasets

The graph visualization

Installing the GraphStream and BreezeViz libraries

Visualizing the graph data

Plotting the degree distribution

The analysis of network connectedness

Finding the connected components

Counting triangles and computing clustering coefficients

The network centrality and PageRank

How PageRank works

Ranking web pages

Scala Build Tool revisited

Organizing build definitions

Managing library dependencies

A preview of the steps

Step 1 – Enable the sbt-assembly plugin

Step 2 – Create a build.sbt file

Step 3 – Declare library dependencies and resolvers

Step 4 – Set up the sbt-assembly plugin

Step 5 – Create the uber JAR

Running tasks with SBT commands

Summary

4. Transforming and Shaping Up Graphs to Your Needs

Transforming the vertex and edge attributes

mapVertices

mapEdges

mapTriplets

Modifying graph structures

The reverse operator

The subgraph operator

The mask operator

The groupEdges operator

Joining graph datasets

joinVertices

outerJoinVertices

Example – Hollywood movie graph

Data operations on VertexRDD and EdgeRDD

Mapping VertexRDD and EdgeRDD

Filtering VertexRDDs

Joining VertexRDDs

Joining EdgeRDDs

Reversing edge directions

Collecting neighboring information

Example – from food network to flavor pairing

Summary

5. Creating Custom Graph Aggregation Operators

NCAA College Basketball datasets

The aggregateMessages operator

EdgeContext

Abstracting out the aggregation

Keeping things DRY

Coach wants more numbers

Calculating average points per game

Defense stats – D matters as in direction

Joining average stats into a graph

Performance optimization

The MapReduceTriplets operator

Summary

6. Iterative Graph-Parallel Processing with Pregel

The Pregel computational model

Example – iterating towards the social equality

The Pregel API in GraphX

Community detection through label propagation

The Pregel implementation of PageRank

Summary

7. Learning Graph Structures

Community clustering in graphs

Spectral clustering

Power iteration clustering

Applications – music fan community detection

Step 1 – load the data into a Spark graph property

Step 2 – extract the features of nodes

Step 3 – define a similarity measure between two nodes

Step 4 – create an affinity matrix

Step 5 – run k-means clustering on the affinity matrix

Exercise – collaborative clustering through playlists

Summary

A. References

Chapter 2, Building and Exploring Graphs

Chapter 3, Graph Analysis and Visualization

Chapter 7, Learning Graph Structures

Index

Apache Spark Graph Processing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: September 2015

Production reference: 1040915

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78439-180-5

www.packtpub.com

Credits

Author

Rindra Ramamonjison

Reviewer

Thomas W. Dinsmore

Ryan Mccune

Francoise Provencher

Commissioning Editor

Amit Ghodke

Acquisition Editor

Larissa Pinto

Content Development Editor

Dharmesh Parmar

Technical Editor

Prajakta Mhatre

Copy Editor

Yesha Gangani

Project Coordinator

Nikhil Nair

Proofreader

Safis Editing

Indexer

Tejal Soni

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

Foreword

Apache Spark is one of the most compelling technologies in the big data space and for good reason. It allows data scientists and data engineers alike to work in their language of choice (Java, Scala, Python, SQL, and R as of this writing) to make sense of their data. As ReynoldXin noted, Apache Spark is the Swiss Army Knife of big data analytics tools. It allows you to use one tool to do many things from real-time streaming to advanced analytics. And in no small part, the versatility and power of GraphX has helped Spark propel forward.

Apache Spark Graph Processing follows Rindra's journey into solving complex analytics problems. As a PhD g...

Apache Spark Graph Processing

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Apache Spark Graph Processing by Rindra Ramamonjison in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Apache Spark Graph Processing

Apache Spark Graph Processing

About this book

Tools to learn more effectively

Information

Apache Spark Graph Processing

Table of Contents

Apache Spark Graph Processing

Credits

Foreword

Table of contents

Frequently asked questions