eBook - ePub

Apache Spark 2.x Machine Learning Cookbook

Name: Apache Spark 2.x Machine Learning Cookbook
ISBN: 9781782174608

666 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Apache Spark 2.x Machine Learning Cookbook

About this book

Simplify machine learning model implementations with SparkAbout This Book• Solve the day-to-day problems of data science with Spark• This unique cookbook consists of exciting and intuitive numerical recipes• Optimize your work by acquiring, cleaning, analyzing, predicting, and visualizing your dataWho This Book Is ForThis book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and ecosystem.What You Will Learn• Get to know how Scala and Spark go hand-in-hand for developers when developing ML systems with Spark• Build a recommendation engine that scales with Spark• Find out how to build unsupervised clustering systems to classify data in Spark• Build machine learning systems with the Decision Tree and Ensemble models in Spark• Deal with the curse of high-dimensionality in big data using Spark• Implement Text analytics for Search Engines in Spark• Streaming Machine Learning System implementation using SparkIn DetailMachine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks.This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we'll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.Style and approachThis book is packed with intuitive recipes supported with line-by-line explanations to help you understand how to optimize your work flow and resolve problems when working with complex data modeling tasks and predictive algorithms. This is a valuable resource for data scientists and those working on large scale data projects.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Packt Publishing

Year

2017

eBook ISBN

9781782174608

Edition

Topic

Computer Science

Subtopic

Artificial Intelligence (AI) & Semantics

Index

Computer Science

Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I

In this chapter, we will cover the following recipes:

Fitting a linear regression line to data the old-fashioned way
Generalized linear regression in Spark 2.0
Linear regression API with Lasso and L-BFGS in Spark 2.0
Linear regression API with Lasso and auto optimization selection in Spark 2.0
Linear regression API with ridge regression and auto optimization selection in Spark 2.0
Isotonic regression in Apache Spark 2.0
Multilayer perceptron classifier in Apache Spark 2.0
One versus Rest classifier (One-vs-All) in Apache Spark 2.0
Survival regression - parametric AFT model in Apache Spark 2.0

Introduction

This chapter, along with the next chapter, covers the fundamental techniques for regression and classification available in Spark 2.0 ML and MLlib library. Spark 2.0 highlights a new direction by moving the RDD-based regressions (see the next chapter) to maintenance mode while emphasizing Linear Regression and Generalized Regression going forward.

At a high level, the new API design favors parameterization of elastic net to produce the ridge versus Lasso regression and everything in between, as opposed to a named API (for example, LassoWithSGD). The new API approach is a much cleaner design and forces you to learn elastic net and its power when it comes to feature engineering that remains an art in data science. We provide adequate examples, variations, and notes to guide you through the complexities in these techniques.

The following figure depicts the regression and classification coverage (part 1) in this chapter:

First, you will learn how to implement linear regression using algebraic equations via just Scala code and RDDs from scratch to get an insight for the math and why we need an iterative optimization method to estimate the solution for a large system of regressions. Second, we explore the generalized linear model (GLM) and its various statistical distribution families and link functions while stressing its limitation to 4,096 parameters only in the current implementation. Third, we tackle the linear regression model (LRM) and how to use the elastic net parameterization to mix and match L1 and L2 penalty functions to achieve logistic, ridge, Lasso, and everything in between. We also explore the solver (that is, optimizer) method and how to set it to use L-BFGS optimization, auto optimizer selection, and so on.

After exploring the GLM and linear regression recipes, we proceed to provide recipes for more exotic regression/classification methods such as isotonic regression, multilayer perceptron (that is, form of neuron net), One-vs-Rest, and survival regression to demonstrate Spark 2.0's power and completeness to deal with cases that are not addressed by linear techniques. With the increased risks in the financial world in the early 21^st century and new advancements in genome, Spark 2.0 also pulls together four important methods (isotonic regression, multilayer perceptron, One-vs-Rest, and survival regression or parametric ATF) in an easy to use machine learning library. The parametric ATF method at scale should be of particular interest to financial, data scientist, or actuarial professionals alike.

Even though some of these methods such as LinearRegression() API, have theoretically been available since 1.3x+, it is important to note that Spark 2.0 pulls all of them together in an easy-to-use and maintainable API (that is, backward compatibility) in a glmnet R-like manner as they move the RDD-based regression API into maintenance mode. The L-BFGS optimizer and normal equations take a front seat while SGD is available in RDD-based APIs for backward compatibility.

Elastic net is the preferred method that can not only deal with L1 (Lasso regression) and L2 (ridge regression) in absolute terms prefered method for regularization, but also provide a dial-like mechanism that enables the user to fine-tune the penalty function (parameter shrinkage versus selection). While we recall using the elastic net function in 1.4.2, Spark 2.0 pulls it all together without the need to deal with each individual API for parameter tuning (important when selecting a model dynamically based on the latest data). As we start diving into the recipes, we strongly encourage the user to explore various parameter settings setElasticNetParam() and setSolver() configurations to master these powerful APIs. It is important not to mix the penalty function setElasticNetParam(value: Double) (L1 , L2, OLs, elastic net: linearly mixed L1/L2), which are regular or model penalty schemes with optimization (normal, L-BFGS, auto, and so on) techniques that are related to cost function optimization techniques.

It is critical to note that the RDD-based regressions are still very important since there are a lot of current ML implementation systems that rely heavily on the previous API regime and its SGD optimizer. Please see the next chapter for complete treatment with teaching notes covering RDD-based regressions.

Fitting a linear regression line to data the old fashioned way

In this recipe, we use RDDs and a closed form formula to code a simple linear equation from scratch. The reason we use this as the first recipe is to demonstrate that you can always implement any given statistical learning algorithm via the RDDs to achieve computational scale using Apache Spark.

How to do it...

Start a new project in IntelliJ or in an IDE of ...

Title Page
Copyright
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Practical Machine Learning with Spark Using Scala
Just Enough Linear Algebra for Machine Learning with Spark
Spark's Three Data Musketeers for Machine Learning - Perfect Together
Common Recipes for Implementing a Robust Machine Learning System
Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I
Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II
Recommendation Engine that Scales with Spark
Unsupervised Clustering with Apache Spark 2.0
Optimization - Going Down the Hill with Gradient Descent
Building Machine Learning Systems with Decision Tree and Ensemble Models
Curse of High-Dimensionality in Big Data
Implementing Text Analytics with Spark 2.0 ML Library
Spark Streaming and Machine Learning Library

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions