eBook - ePub

Mastering Apache Spark 2.x - Second Edition

Name: Mastering Apache Spark 2.x - Second Edition
Author: Romeo Kienzler

Romeo Kienzler

332 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Mastering Apache Spark 2.x - Second Edition

Romeo Kienzler

Book details

Book preview

Table of contents

Citations

About This Book

Advanced analytics on your Big Data with latest Apache Spark 2.xAbout This Book• An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities.• Extend your data processing capabilities to process huge chunk of data in minimum time using advanced concepts in Spark.• Master the art of real-time processing with the help of Apache Spark 2.xWho This Book Is ForIf you are a developer with some experience with Spark and want to strengthen your knowledge of how to get around in the world of Spark, then this book is ideal for you. Basic knowledge of Linux, Hadoop and Spark is assumed. Reasonable knowledge of Scala is expected.What You Will Learn• Examine Advanced Machine Learning and DeepLearning with MLlib, SparkML, SystemML, H2O and DeepLearning4J• Study highly optimised unified batch and real-time data processing using SparkSQL and Structured Streaming• Evaluate large-scale Graph Processing and Analysis using GraphX and GraphFrames• Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud• Understand internal details of cost based optimizers used in Catalyst, SystemML and GraphFrames• Learn how specific parameter settings affect overall performance of an Apache Spark cluster• Leverage Scala, R and python for your data science projectsIn DetailApache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and SQL. This book aims to take your knowledge of Spark to the next level by teaching you how to expand Spark's functionality and implement your data flows and machine/deep learning programs on top of the platform.The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.You will understand how memory management and binary processing, cache-aware computation, and code generation are used to speed things up dramatically. The book extends to show how to incorporate H20, SystemML, and Deeplearning4j for machine learning, and Jupyter Notebooks and Kubernetes/Docker for cloud-based Spark. During the course of the book, you will learn about the latest enhancements to Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.You will also learn about the updates on the APIs and how DataFrames and Datasets affect SQL, machine learning, graph processing, and streaming. You will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.Style and approachThis book is an extensive guide to Apache Spark modules and tools and shows how Spark's functionality can be extended for real-time processing and storage with worked examples.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Mastering Apache Spark 2.x - Second Edition an online PDF/ePUB?

Yes, you can access Mastering Apache Spark 2.x - Second Edition by Romeo Kienzler in PDF and/or ePUB format, as well as other popular books in Informatique & Traitement des données. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2017

ISBN

9781785285226

Edition

Topic

Informatique

Subtopic

Traitement des données

Deep Learning on Apache Spark with DeepLearning4j and H2O

This chapter will give you an introduction to Deep Learning and how you can use third-party machine learning libraries on top of Apache Spark in order to do so. Deep Learning is outperforming a variety of state-of-the-art machine learning algorithms, and it is a very active area of research, so there is more to come soon. Therefore, it is important to know how Deep Learning works and how it can be applied in a parallel data processing environment such as Apache Spark.

This chapter will cover the following topics in detail:

Introduction to the installation and usage of the H2O framework
Introduction to Deeplearning4j with an IoT anomaly detection example

H2O

H2O is an open source system developed in Java by http://h2o.ai/ for machine learning. It offers a rich set of machine learning algorithms and a web-based data processing user interface. It offers the ability to develop in a range of languages: Java, Scala, Python, and R.

It also has the ability to interface with Spark, HDFS, SQL, and NoSQL databases. This chapter will concentrate on H2O's integration with Apache Spark using the Sparkling Water component of H2O. A simple example developed in Scala will be used based on real data to create a deep-learning model.

The next step will be to provide an overview of the H2O functionality and the Sparkling Water architecture that will be used in this chapter.

Overview

Since it is only possible to examine and use a small amount of H2O's functionality in this chapter, we thought that it would be useful to provide a list of all of the functional areas that it covers. This list is taken from the http://h2o.ai/ website at http://h2o.ai/product/algorithms/ and is based upon wrangling data, modeling using the data, and scoring the resulting models:

Process
Model
The score tool
Data profiling
Generalized linear models (GLM)
Predict
Summary statistics
Decision trees
Confusion matrix
Aggregate, filter, bin, and derive columns
Gradient boosting machine (GBM)
AUC
Slice, log transform, and anonymize
K-means
Hit ratio
Variable creation
Anomaly detection
PCA/PCA score
Deep learning
Multimodel scoring
Training and validation sampling plan
Naive Bayes
Grid search

The following section will explain the environment used for the Spark and H2O examples in this chapter and some of the problems encountered.

For completeness, we will show you how we downloaded, installed, and used H2O. Although we finally settled on version 0.2.12-95, we first downloaded and used 0.2.12-92. This section is based on the earlier install, but the approach used to source the software is the same. The download link changes over time, so follow the Sparkling Water download option at http://h2o.ai/download/.

This will source the zipped Sparkling Water release, as shown in the file listing here:

 [hadoop@hc2r1m2 h2o]$ pwd ; ls -l
 /home/hadoop/h2o
 total 15892
 -rw-r--r-- 1 hadoop hadoop 16272364 Apr 11 12:37 sparkling-water-0.2.12-92.zip

This zipped release file is unpacked using the Linux unzip command, and it results in a Sparkling Water release file tree:

 [hadoop@hc2r1m2 h2o]$ unzip sparkling-water-0.2.12-92.zip
 
 [hadoop@hc2r1m2 h2o]$ ls -d sparkling-water*
 sparkling-water-0.2.12-92 sparkling-water-0.2.12-92.zip

We have moved the release tree to the /usr/local/ area using the root account and created a simple symbolic link to the release called H2O. This means that our H2O-based build can refer to this link, and it doesn't need to change as new versions of Sparkling Water are sourced. We have also made sure, using the Linux chmod command, that our development account, Hadoop, has access to the release:

[hadoop@hc2r1m2 h2o]$ su -
[root@hc2r1m2 ~]# cd /home/hadoop/h2o
[root@hc2r1m2 h2o]# mv sparkling-water-0.2.12-92 /usr/local
[root@hc2r1m2 h2o]# cd /usr/local
 
[root@hc2r1m2 local]# chown -R hadoop:hadoop sparkling-water-0.2.12-92
[root@hc2r1m2 local]# ln –s sparkling-water-0.2.12-92 h2o
 
[root@hc2r1m2 local]# ls –lrt | grep sparkling
total 52
drwxr-xr-x 6 hadoop hadoop 4096 Mar 28 02:27 sparkling-water-0.2.12-92
lrwxrwxrwx 1 root root 25 Apr 11 12:43 h2o -> sparkling-water-0.2.12-92

The release has been installed on all the nodes of our Hadoop clusters.

The build environment

From past examples, you know that we favor sbt as a build tool for developing Scala source examples.

We have created a development environment on the Linux server called hc2r1m2 using the Hadoop development account. The development directory is called h2o_spark_1_2:

[hadoop@hc2r1m2 h2o_spark_1_2]$ pwd
 /home/hadoop/spark/h2o_spark_1_2

Our SBT build configuration file named h2o.sbt is located here; it contains the following:

 [hadoop@hc2r1m2 h2o_spark_1_2]$ more h2o.sbt
 
 name := "H 2 O"
 
 version := "1.0"
 
 scalaVersion := "2.10.4"
 
 libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0"
 
 libraryDependencies += "org.apache.spark" % "spark-core" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
 
 libraryDependencies += "org.apache.spark" % "mllib" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
 
 libraryDependencies += "org.apache.spark" % "sql" % "1.2.0" from "file:///usr/hdp/2.6.0.3-8/spark/lib/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar"
 
 libraryDependencies += "org.apache.spark" % "h2o" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-0.2.12-95-all.jar"
 
 libraryDependencies += "hex.deeplearning" % "DeepLearningModel" % "0.2.12-95" from "file:///usr/local/h2o/assembly/build/libs/sparkling-water-assembly-...

Citation styles for Mastering Apache Spark 2.x - Second Edition

APA 6 Citation

Kienzler, R. (2017). Mastering Apache Spark 2.x - Second Edition (2nd ed.). Packt Publishing. Retrieved from https://www.perlego.com/book/527080/mastering-apache-spark-2x-second-edition-pdf (Original work published 2017)

Chicago Citation

Kienzler, Romeo. (2017) 2017. Mastering Apache Spark 2.x - Second Edition. 2nd ed. Packt Publishing. https://www.perlego.com/book/527080/mastering-apache-spark-2x-second-edition-pdf.

Harvard Citation

Kienzler, R. (2017) Mastering Apache Spark 2.x - Second Edition. 2nd edn. Packt Publishing. Available at: https://www.perlego.com/book/527080/mastering-apache-spark-2x-second-edition-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Kienzler, Romeo. Mastering Apache Spark 2.x - Second Edition. 2nd ed. Packt Publishing, 2017. Web. 14 Oct. 2022.