Apache Spark Graph Processing
eBook - ePub

Apache Spark Graph Processing

  1. 148 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Apache Spark Graph Processing

About this book

Build, process and analyze large-scale graph data effectively with Spark

About This Book

  • Find solutions for every stage of data processing from loading and transforming graph data to
  • Improve the scalability of your graphs with a variety of real-world applications with complete Scala code.
  • A concise guide to processing large-scale networks with Apache Spark.

Who This Book Is For

This book is for data scientists and big data developers who want to learn the processing and analyzing graph datasets at scale. Basic programming experience with Scala is assumed. Basic knowledge of Spark is assumed.

What You Will Learn

  • Write, build and deploy Spark applications with the Scala Build Tool.
  • Build and analyze large-scale network datasets
  • Analyze and transform graphs using RDD and graph-specific operations
  • Implement new custom graph operations tailored to specific needs.
  • Develop iterative and efficient graph algorithms using message aggregation and Pregel abstraction
  • Extract subgraphs and use it to discover common clusters
  • Analyze graph data and solve various data science problems using real-world datasets.

In Detail

Apache Spark is the next standard of open-source cluster-computing engine for processing big data. Many practical computing problems concern large graphs, like the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. Apache Spark GraphX API combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework.

This book will teach the user to do graphical programming in Apache Spark, apart from an explanation of the entire process of graphical data analysis. You will journey through the creation of graphs, its uses, its exploration and analysis and finally will also cover the conversion of graph elements into graph structures.

This book begins with an introduction of the Spark system, its libraries and the Scala Build Tool. Using a hands-on approach, this book will quickly teach you how to install and leverage Spark interactively on the command line and in a standalone Scala program. Then, it presents all the methods for building Spark graphs using illustrative network datasets. Next, it will walk you through the process of exploring, visualizing and analyzing different network characteristics. This book will also teach you how to transform raw datasets into a usable form. In addition, you will learn powerful operations that can be used to transform graph elements and graph structures. Furthermore, this book also teaches how to create custom graph operations that are tailored for specific needs with efficiency in mind. The later chapters of this book cover more advanced topics such as clustering graphs, implementing graph-parallel iterative algorithms and learning methods from graph data.

Style and approach

A step-by-step guide that will walk you through the key ideas and techniques for processing big graph data at scale, with practical examples that will ensure an overall understanding of the concepts of Spark.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Information

Apache Spark Graph Processing


Table of Contents

Apache Spark Graph Processing
Credits
Foreword
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
Distinctive features
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started with Spark and GraphX
Downloading and installing Spark 1.4.1
Experimenting with the Spark shell
Getting started with GraphX
Building a tiny social network
Loading the data
The property graph
Transforming RDDs to VertexRDD and EdgeRDD
Introducing graph operations
Building and submitting a standalone application
Writing and configuring a Spark program
Building the program with the Scala Build Tool
Deploying and running with spark-submit
Summary
2. Building and Exploring Graphs
Network datasets
The communication network
Flavor networks
Social ego networks
Graph builders
The Graph factory method
edgeListFile
fromEdges
fromEdgeTuples
Building graphs
Building directed graphs
Building a bipartite graph
Building a weighted social ego network
Computing the degrees of the network nodes
In-degree and out-degree of the Enron email network
Degrees in the bipartite food network
Degree histogram of the social ego networks
Summary
3. Graph Analysis and Visualization
Network datasets
The graph visualization
Installing the GraphStream and BreezeViz libraries
Visualizing the graph data
Plotting the degree distribution
The analysis of network connectedness
Finding the connected components
Counting triangles and computing clustering coefficients
The network centrality and PageRank
How PageRank works
Ranking web pages
Scala Build Tool revisited
Organizing build definitions
Managing library dependencies
A preview of the steps
Step 1 โ€“ Enable the sbt-assembly plugin
Step 2 โ€“ Create a build.sbt file
Step 3 โ€“ Declare library dependencies and resolvers
Step 4 โ€“ Set up the sbt-assembly plugin
Step 5 โ€“ Create the uber JAR
Running tasks with SBT commands
Summary
4. Transforming and Shaping Up Graphs to Your Needs
Transforming the vertex and edge attributes
mapVertices
mapEdges
mapTriplets
Modifying graph structures
The reverse operator
The subgraph operator
The mask operator
The groupEdges operator
Joining graph datasets
joinVertices
outerJoinVertices
Example โ€“ Hollywood movie graph
Data operations on VertexRDD and EdgeRDD
Mapping VertexRDD and EdgeRDD
Filtering VertexRDDs
Joining VertexRDDs
Joining EdgeRDDs
Reversing edge directions
Collecting neighboring information
Example โ€“ from food network to flavor pairing
Summary
5. Creating Custom Graph Aggregation Operators
NCAA College Basketball datasets
The aggregateMessages operator
EdgeContext
Abstracting out the aggregation
Keeping things DRY
Coach wants more numbers
Calculating average points per game
Defense stats โ€“ D matters as in direction
Joining average stats into a graph
Performance optimization
The MapReduceTriplets operator
Summary
6. Iterative Graph-Parallel Processing with Pregel
The Pregel computational model
Example โ€“ iterating towards the social equality
The Pregel API in GraphX
Community detection through label propagation
The Pregel implementation of PageRank
Summary
7. Learning Graph Structures
Community clustering in graphs
Spectral clustering
Power iteration clustering
Applications โ€“ music fan community detection
Step 1 โ€“ load the data into a Spark graph property
Step 2 โ€“ extract the features of nodes
Step 3 โ€“ define a similarity measure between two nodes
Step 4 โ€“ create an affinity matrix
Step 5 โ€“ run k-means clustering on the affinity matrix
Exercise โ€“ collaborative clustering through playlists
Summary
A. References
Chapter 2, Building and Exploring Graphs
Chapter 3, Graph Analysis and Visualization
Chapter 7, Learning Graph Structures
Index

Apache Spark Graph Processing

Copyright ยฉ 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2015
Production reference: 1040915
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-180-5
www.packtpub.com

Credits

Author
Rindra Ramamonjison
Reviewer
Thomas W. Dinsmore
Ryan Mccune
Francoise Provencher
Commissioning Editor
Amit Ghodke
Acquisition Editor
Larissa Pinto
Content Development Editor
Dharmesh Parmar
Technical Editor
Prajakta Mhatre
Copy Editor
Yesha Gangani
Project Coordinator
Nikhil Nair
Proofreader
Safis Editing
Indexer
Tejal Soni
Production Coordinator
Aparna Bhagat
Cover Work
Aparna Bhagat

Foreword

Apache Spark is one of the most compelling technologies in the big data space and for good reason. It allows data scientists and data engineers alike to work in their language of choice (Java, Scala, Python, SQL, and R as of this writing) to make sense of their data. As ReynoldXin noted, Apache Spark is the Swiss Army Knife of big data analytics tools. It allows you to use one tool to do many things from real-time streaming to advanced analytics. And in no small part, the versatility and power of GraphX has helped Spark propel forward.
Apache Spark Graph Processing follows Rindra's journey into solving complex analytics problems. As a PhD g...

Table of contents

  1. Apache Spark Graph Processing

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, weโ€™ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere โ€” even offline. Perfect for commutes or when youโ€™re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Apache Spark Graph Processing by Rindra Ramamonjison in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.