Mastering Scala Machine Learning
eBook - ePub

Mastering Scala Machine Learning

  1. 310 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Mastering Scala Machine Learning

About this book

Advance your skills in efficient data analysis and data processing using the powerful tools of Scala, Spark, and Hadoop

About This Book

  • This is a primer on functional-programming-style techniques to help you efficiently process and analyze all of your data
  • Get acquainted with the best and newest tools available such as Scala, Spark, Parquet and MLlib for machine learning
  • Learn the best practices to incorporate new Big Data machine learning in your data-driven enterprise to gain future scalability and maintainability

Who This Book Is For

Mastering Scala Machine Learning is intended for enthusiasts who want to plunge into the new pool of emerging techniques for machine learning. Some familiarity with standard statistical techniques is required.

What You Will Learn

  • Sharpen your functional programming skills in Scala using REPL
  • Apply standard and advanced machine learning techniques using Scala
  • Get acquainted with Big Data technologies and grasp why we need a functional approach to Big Data
  • Discover new data structures, algorithms, approaches, and habits that will allow you to work effectively with large amounts of data
  • Understand the principles of supervised and unsupervised learning in machine learning
  • Work with unstructured data and serialize it using Kryo, Protobuf, Avro, and AvroParquet
  • Construct reliable and robust data pipelines and manage data in a data-driven enterprise
  • Implement scalable model monitoring and alerts with Scala

In Detail

Since the advent of object-oriented programming, new technologies related to Big Data are constantly popping up on the market. One such technology is Scala, which is considered to be a successor to Java in the area of Big Data by many, like Java was to C/C++ in the area of distributed programing.

This book aims to take your knowledge to next level and help you impart that knowledge to build advanced applications such as social media mining, intelligent news portals, and more. After a quick refresher on functional programming concepts using REPL, you will see some practical examples of setting up the development environment and tinkering with data. We will then explore working with Spark and MLlib using k-means and decision trees.

Most of the data that we produce today is unstructured and raw, and you will learn to tackle this type of data with advanced topics such as regression, classification, integration, and working with graph algorithms. Finally, you will discover at how to use Scala to perform complex concept analysis, to monitor model performance, and to build a model repository. By the end of this book, you will have gained expertise in performing Scala machine learning and will be able to build complex machine learning projects using Scala.

Style and approach

This hands-on guide dives straight into implementing Scala for machine learning without delving much into mathematical proofs or validations. There are ample code examples and tricks that will help you sail through using the standard techniques and libraries. This book provides practical examples from the field on how to correctly tackle data analysis problems, particularly for modern Big Data datasets.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Information

Mastering Scala Machine Learning


Table of Contents

Mastering Scala Machine Learning
Credits
About the Author
Acknowlegement
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Exploratory Data Analysis
Getting started with Scala
Distinct values of a categorical field
Summarization of a numeric field
Grepping across multiple fields
Basic, stratified, and consistent sampling
Working with Scala and Spark Notebooks
Basic correlations
Summary
2. Data Pipelines and Modeling
Influence diagrams
Sequential trials and dealing with risk
Exploration and exploitation
Unknown unknowns
Basic components of a data-driven system
Data ingest
Data transformation layer
Data analytics and machine learning
UI component
Actions engine
Correlation engine
Monitoring
Optimization and interactivity
Feedback loops
Summary
3. Working with Spark and MLlib
Setting up Spark
Understanding Spark architecture
Task scheduling
Spark components
MQTT, ZeroMQ, Flume, and Kafka
HDFS, Cassandra, S3, and Tachyon
Mesos, YARN, and Standalone
Applications
Word count
Streaming word count
Spark SQL and DataFrame
ML libraries
SparkR
Graph algorithms – GraphX and GraphFrames
Spark performance tuning
Running Hadoop HDFS
Summary
4. Supervised and Unsupervised Learning
Records and supervised learning
Iris dataset
Labeled point
SVMWithSGD
Logistic regression
Decision tree
Bagging and boosting – ensemble learning methods
Unsupervised learning
Problem dimensionality
Summary
5. Regression and Classification
What regression stands for?
Continuous space and metrics
Linear regression
Logistic regression
Regularization
Multivariate regression
Heteroscedasticity
Regression trees
Classification metrics
Multiclass problems
Perceptron
Generalization error and overfitting
Summary
6. Working with Unstructured Data
Nested data
Other serialization formats
Hive and Impala
Sessionization
Working with traits
Working with pattern matching
Other uses of unstructured data
Probabilistic structures
Projections
Summary
7. Working with Graph Algorithms
A quick introduction to graphs
SBT
Graph for Scala
Adding nodes and edges
Graph constraints
JSON
GraphX
Who is getting e-mails?
Connected components
Triangle counting
Strongly connected components
PageRank
SVD++
Summary
8. Integrating Scala with R and Python
Integrating with R
Setting up R and SparkR
Linux
Mac OS
Windows
Running SparkR via scripts
Running Spark via R's command line
DataFrames
Linear models
Generalized linear model
Reading JSON files in SparkR
Writing Parquet files in SparkR
Invoking Scala from R
Using Rserve
Integrating with Python
Setting up Python
PySpark
Calling Python from Java/Scala
Using sys.process._
Spark pipe
Jython and JSR 223
Summary
9. NLP in Scala
Text analysis pipeline
Simple text analysis
MLlib algorithms in Spark
TF-IDF
LDA
Segmentation, annotation, and chunking
POS tagging
Using word2vec to find word relationships
A Porter Stemmer implementation of the code
Summary
10. Advanced Model Monitoring
System monitoring
Process monitoring
Model monitoring
Performance over time
Criteria for model retiring
A/B testing
Summary
Index

Mastering Scala Machine Learning

Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: June 2016
Production reference: 1220616
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-088-9
www.packtpub.com

Credits

Author
Alex Kozlov
Reviewer
Rok Kralj
Commissioning Editor
Dipika Gaonkar
Acquisition Editor
Kirk D'costa
Content Development Editor
Samantha Gonsalves
Technical Editor
Suwarna Patil
Copy Editor
Vibha Shukla
Project Coordinator
Sanchita Mandal
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta

About the Author

Alex Kozlov is a multidisciplinary big data scientist. He came to Silicon Valley in 1991, got his Ph.D. from Stanford University under the supervision of Prof. Daphne Koller and Prof. John Hennessy in 1998, and has been around a few computer and data management companies since. His latest stint was with Cloudera, the leader in Hadoop, where he was one of the early employees and ended up heading the solution architects group on the West Coast. Before that, he spent time with an online advertising company, Turn, Inc.; and before that, he had the privilege to work with HP Labs researchers at HP Inc., and on data mining software at SGI, Inc. Currently, Alexander is the chief solutions architect at an enterprise security startup, E8 Security, where he came to understand the intricacies of catching bad guys in the Internet universe.
On the non-professional side, Alexander lives i...

Table of contents

  1. Mastering Scala Machine Learning

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Mastering Scala Machine Learning by Alex Kozlov in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.