eBook - ePub

Large Scale Machine Learning with Python

Name: Large Scale Machine Learning with Python
ISBN: 9781785887215

Bastiaan Sjardin,

Luca Massaron,

Alberto Boschetti,

420 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Large Scale Machine Learning with Python

Bastiaan Sjardin,

Luca Massaron,

Alberto Boschetti,

About this book

Learn to build powerful machine learning models quickly and deploy large-scale predictive applications

About This Book

Design, engineer and deploy scalable machine learning solutions with the power of Python
Take command of Hadoop and Spark with Python for effective machine learning on a map reduce framework
Build state-of-the-art models and develop personalized recommendations to perform machine learning at scale

Who This Book Is For

This book is for anyone who intends to work with large and complex data sets. Familiarity with basic Python and machine learning concepts is recommended. Working knowledge in statistics and computational mathematics would also be helpful.

What You Will Learn

Apply the most scalable machine learning algorithms
Work with modern state-of-the-art large-scale machine learning techniques
Increase predictive accuracy with deep learning and scalable data-handling techniques
Improve your work by combining the MapReduce framework with Spark
Build powerful ensembles at scale
Use data streams to train linear and non-linear predictive models from extremely large datasets using a single machine

In Detail

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy.

Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Style and Approach

This efficient and practical title is stuffed full of the techniques, tips and tools you need to ensure your large scale Python machine learning runs swiftly and seamlessly.

Large-scale machine learning tackles a different issue to what is currently on the market. Those working with Hadoop clusters and in data intensive environments can now learn effective ways of building powerful machine learning models from prototype to production.

This book is written in a style that programmers from other languages (R, Julia, Java, Matlab) can follow.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2016

Edition

eBook ISBN

9781785887215

Topic

Computer Science

Subtopic

Data Mining

Index

Computer Science

Large Scale Machine Learning with Python

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. First Steps to Scalability

Explaining scalability in detail

Making large scale examples

Introducing Python

Scale up with Python

Scale out with Python

Python for large scale machine learning

Choosing between Python 2 and Python 3

Installing Python

Step-by-step installation

The installation of packages

Package upgrades

Scientific distributions

Introducing Jupyter/IPython

Python packages

NumPy

SciPy

Pandas

Scikit-learn

The matplotlib package

Gensim

H2O

XGBoost

Theano

TensorFlow

The sknn library

Theanets

Keras

Other useful packages to install on your system

Summary

2. Scalable Learning in Scikit-learn

Out-of-core learning

Subsampling as a viable option

Optimizing one instance at a time

Building an out-of-core learning system

Streaming data from sources

Datasets to try the real thing yourself

The first example – streaming the bike-sharing dataset

Using pandas I/O tools

Working with databases

Paying attention to the ordering of instances

Stochastic learning

Batch gradient descent

Stochastic gradient descent

The Scikit-learn SGD implementation

Defining SGD learning parameters

Feature management with data streams

Describing the target

The hashing trick

Other basic transformations

Testing and validation in a stream

Trying SGD in action

Summary

3. Fast SVM Implementations

Datasets to experiment with on your own

The bike-sharing dataset

The covertype dataset

Support Vector Machines

Hinge loss and its variants

Understanding the Scikit-learn SVM implementation

Pursuing nonlinear SVMs by subsampling

Achieving SVM at scale with SGD

Feature selection by regularization

Including non-linearity in SGD

Trying explicit high-dimensional mappings

Hyperparameter tuning

Other alternatives for SVM fast learning

Nonlinear and faster with Vowpal Wabbit

Installing VW

Understanding the VW data format

Python integration

A few examples using reductions for SVM and neural nets

Faster bike-sharing

The covertype dataset crunched by VW

Summary

4. Neural Networks and Deep Learning

The neural network architecture

What and how neural networks learn

Choosing the right architecture

The input layer

The hidden layer

The output layer

Neural networks in action

Parallelization for sknn

Neural networks and regularization

Neural networks and hyperparameter optimization

Neural networks and decision boundaries

Deep learning at scale with H2O

Large scale deep learning with H2O

Gridsearch on H2O

Deep learning and unsupervised pretraining

Deep learning with theanets

Autoencoders and unsupervised learning

Autoencoders

Summary

5. Deep Learning with TensorFlow

TensorFlow installation

TensorFlow operations

GPU computing

Linear regression with SGD

A neural network from scratch in TensorFlow

Machine learning on TensorFlow with SkFlow

Deep learning with large files – incremental learning

Keras and TensorFlow installation

Convolutional Neural Networks in TensorFlow through Keras

The convolution layer

The pooling layer

The fully connected layer

CNN's with an incremental approach

GPU Computing

Summary

6. Classification and Regression Trees at Scale

Bootstrap aggregation

Random forest and extremely randomized forest

Fast parameter optimization with randomized search

Extremely randomized trees and large datasets

CART and boosting

Gradient Boosting Machines

max_depth

learning_rate

Subsample

Faster GBM with warm_start

Speeding up GBM with warm_start

Training and storing GBM models

XGBoost

XGBoost regression

XGBoost and variable importance

XGBoost streaming large datasets

XGBoost model persistence

Out-of-core CART with H2O

Random forest and gridsearch on H2O

Stochastic gradient boosting and gridsearch on H2O

Summary

7. Unsupervised Learning at Scale

Unsupervised methods

Feature decomposition – PCA

Randomized PCA

Incremental PCA

Sparse PCA

PCA with H2O

Clustering – K-means

Initialization methods

K-means assumptions

Selection of the best K

Scaling K-means – mini-batch

K-means with H2O

LDA

Scaling LDA – memory, CPUs, and machines

Summary

8. Distributed Environments – Hadoop and Spark

From a standalone machine to a bunch of nodes

Why do we need a distributed framework?

Setting up the VM

VirtualBox

Vagrant

Using the VM

The Hadoop ecosystem

Architecture

HDFS

MapReduce

YARN

Spark

pySpark

Summary

9. Practical Machine Learning with Spark

Setting up the VM for this chapter

Sharing variables across cluster nodes

Broadcast read-only variables

Accumulators write-only variables

Broadcast and accumulators together – an example

Data preprocessing in Spark

JSON files and Spark DataFrames

Dealing with missing data

Grouping and creating tables in-memory

Writing the preprocessed DataFrame or RDD to disk

Working with Spark DataFrames

Machine learning with Spark

Spark on the KDD99 dataset

Reading the dataset

Feature engineering

Training a learner

Evaluating a learner's performance

The power of the ML pipeline

Manual tuning

Cross-validation

Final cleanup

Summary

A. Introduction to GPUs and Theano

GPU computing

Theano – parallel computing on the GPU

Installing Theano

Index

Large Scale Machine Learning with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2016

Production reference: 1270716

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-721-5

www.packtpub.com

Credits

Authors

Bastiaan Sjardin

Luca Massaron

Alberto Boschetti

Reviewers

Oleg Okun

Kai Londenberg

Commissioning Editor

Akram Hussain

Acquisition Editor

Sonali Vernekar

Content Development Editor

Sumeet Sawant

Technical Editor

Manthan Raja

Copy Editor

Tasneem Fatehi

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Disha Haria

Kirk D'Penha

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

About the Authors

Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.

Luca Massaron is a data scientist and marketing research director who is specialized in multivariate statistical analysis, machine learning, and customer insight, with over...

Large Scale Machine Learning with Python

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Large Scale Machine Learning with Python an online PDF/ePUB?

Yes, you can access Large Scale Machine Learning with Python by Bastiaan Sjardin, Luca Massaron, Alberto Boschetti in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over 1.5 million books available in our catalogue for you to explore.

Related ISBNs