eBook - ePub

Large Scale Machine Learning with Python

Name: Large Scale Machine Learning with Python
Author: Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Compartir libro

420 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Large Scale Machine Learning with Python

Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Learn to build powerful machine learning models quickly and deploy large-scale predictive applications

About This Book

Design, engineer and deploy scalable machine learning solutions with the power of Python
Take command of Hadoop and Spark with Python for effective machine learning on a map reduce framework
Build state-of-the-art models and develop personalized recommendations to perform machine learning at scale

Who This Book Is For

This book is for anyone who intends to work with large and complex data sets. Familiarity with basic Python and machine learning concepts is recommended. Working knowledge in statistics and computational mathematics would also be helpful.

What You Will Learn

Apply the most scalable machine learning algorithms
Work with modern state-of-the-art large-scale machine learning techniques
Increase predictive accuracy with deep learning and scalable data-handling techniques
Improve your work by combining the MapReduce framework with Spark
Build powerful ensembles at scale
Use data streams to train linear and non-linear predictive models from extremely large datasets using a single machine

In Detail

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy.

Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Style and Approach

This efficient and practical title is stuffed full of the techniques, tips and tools you need to ensure your large scale Python machine learning runs swiftly and seamlessly.

Large-scale machine learning tackles a different issue to what is currently on the market. Those working with Hadoop clusters and in data intensive environments can now learn effective ways of building powerful machine learning models from prototype to production.

This book is written in a style that programmers from other languages (R, Julia, Java, Matlab) can follow.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Large Scale Machine Learning with Python un PDF/ePUB en línea?

Sí, puedes acceder a Large Scale Machine Learning with Python de Bastiaan Sjardin, Luca Massaron, Alberto Boschetti en formato PDF o ePUB, así como a otros libros populares de Informatica y Data mining. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Packt Publishing

Año

2016

ISBN

9781785887215

Edición

Categoría

Informatica

Categoría

Data mining

Large Scale Machine Learning with Python

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. First Steps to Scalability

Explaining scalability in detail

Making large scale examples

Introducing Python

Scale up with Python

Scale out with Python

Python for large scale machine learning

Choosing between Python 2 and Python 3

Installing Python

Step-by-step installation

The installation of packages

Package upgrades

Scientific distributions

Introducing Jupyter/IPython

Python packages

NumPy

SciPy

Pandas

Scikit-learn

The matplotlib package

Gensim

H2O

XGBoost

Theano

TensorFlow

The sknn library

Theanets

Keras

Other useful packages to install on your system

Summary

2. Scalable Learning in Scikit-learn

Out-of-core learning

Subsampling as a viable option

Optimizing one instance at a time

Building an out-of-core learning system

Streaming data from sources

Datasets to try the real thing yourself

The first example – streaming the bike-sharing dataset

Using pandas I/O tools

Working with databases

Paying attention to the ordering of instances

Stochastic learning

Batch gradient descent

Stochastic gradient descent

The Scikit-learn SGD implementation

Defining SGD learning parameters

Feature management with data streams

Describing the target

The hashing trick

Other basic transformations

Testing and validation in a stream

Trying SGD in action

Summary

3. Fast SVM Implementations

Datasets to experiment with on your own

The bike-sharing dataset

The covertype dataset

Support Vector Machines

Hinge loss and its variants

Understanding the Scikit-learn SVM implementation

Pursuing nonlinear SVMs by subsampling

Achieving SVM at scale with SGD

Feature selection by regularization

Including non-linearity in SGD

Trying explicit high-dimensional mappings

Hyperparameter tuning

Other alternatives for SVM fast learning

Nonlinear and faster with Vowpal Wabbit

Installing VW

Understanding the VW data format

Python integration

A few examples using reductions for SVM and neural nets

Faster bike-sharing

The covertype dataset crunched by VW

Summary

4. Neural Networks and Deep Learning

The neural network architecture

What and how neural networks learn

Choosing the right architecture

The input layer

The hidden layer

The output layer

Neural networks in action

Parallelization for sknn

Neural networks and regularization

Neural networks and hyperparameter optimization

Neural networks and decision boundaries

Deep learning at scale with H2O

Large scale deep learning with H2O

Gridsearch on H2O

Deep learning and unsupervised pretraining

Deep learning with theanets

Autoencoders and unsupervised learning

Autoencoders

Summary

5. Deep Learning with TensorFlow

TensorFlow installation

TensorFlow operations

GPU computing

Linear regression with SGD

A neural network from scratch in TensorFlow

Machine learning on TensorFlow with SkFlow

Deep learning with large files – incremental learning

Keras and TensorFlow installation

Convolutional Neural Networks in TensorFlow through Keras

The convolution layer

The pooling layer

The fully connected layer

CNN's with an incremental approach

GPU Computing

Summary

6. Classification and Regression Trees at Scale

Bootstrap aggregation

Random forest and extremely randomized forest

Fast parameter optimization with randomized search

Extremely randomized trees and large datasets

CART and boosting

Gradient Boosting Machines

max_depth

learning_rate

Subsample

Faster GBM with warm_start

Speeding up GBM with warm_start

Training and storing GBM models

XGBoost

XGBoost regression

XGBoost and variable importance

XGBoost streaming large datasets

XGBoost model persistence

Out-of-core CART with H2O

Random forest and gridsearch on H2O

Stochastic gradient boosting and gridsearch on H2O

Summary

7. Unsupervised Learning at Scale

Unsupervised methods

Feature decomposition – PCA

Randomized PCA

Incremental PCA

Sparse PCA

PCA with H2O

Clustering – K-means

Initialization methods

K-means assumptions

Selection of the best K

Scaling K-means – mini-batch

K-means with H2O

LDA

Scaling LDA – memory, CPUs, and machines

Summary

8. Distributed Environments – Hadoop and Spark

From a standalone machine to a bunch of nodes

Why do we need a distributed framework?

Setting up the VM

VirtualBox

Vagrant

Using the VM

The Hadoop ecosystem

Architecture

HDFS

MapReduce

YARN

Spark

pySpark

Summary

9. Practical Machine Learning with Spark

Setting up the VM for this chapter

Sharing variables across cluster nodes

Broadcast read-only variables

Accumulators write-only variables

Broadcast and accumulators together – an example

Data preprocessing in Spark

JSON files and Spark DataFrames

Dealing with missing data

Grouping and creating tables in-memory

Writing the preprocessed DataFrame or RDD to disk

Working with Spark DataFrames

Machine learning with Spark

Spark on the KDD99 dataset

Reading the dataset

Feature engineering

Training a learner

Evaluating a learner's performance

The power of the ML pipeline

Manual tuning

Cross-validation

Final cleanup

Summary

A. Introduction to GPUs and Theano

GPU computing

Theano – parallel computing on the GPU

Installing Theano

Index

Large Scale Machine Learning with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2016

Production reference: 1270716

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-721-5

www.packtpub.com

Credits

Authors

Bastiaan Sjardin

Luca Massaron

Alberto Boschetti

Reviewers

Oleg Okun

Kai Londenberg

Commissioning Editor

Akram Hussain

Acquisition Editor

Sonali Vernekar

Content Development Editor

Sumeet Sawant

Technical Editor

Manthan Raja

Copy Editor

Tasneem Fatehi

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Disha Haria

Kirk D'Penha

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

About the Authors

Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.

Luca Massaron is a data scientist and marketing research director who is specialized in multivariate statistical analysis, machine learning, and customer insight, with over...

Información del libro

Preguntas frecuentes

Información

Large Scale Machine Learning with Python

Table of Contents

Large Scale Machine Learning with Python

Credits

About the Authors

Índice