Large Scale Machine Learning with Python
eBook - ePub

Large Scale Machine Learning with Python

Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Compartir libro
  1. 420 páginas
  2. English
  3. ePUB (apto para móviles)
  4. Disponible en iOS y Android
eBook - ePub

Large Scale Machine Learning with Python

Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Detalles del libro
Vista previa del libro
Índice
Citas

Información del libro

Learn to build powerful machine learning models quickly and deploy large-scale predictive applications

About This Book

  • Design, engineer and deploy scalable machine learning solutions with the power of Python
  • Take command of Hadoop and Spark with Python for effective machine learning on a map reduce framework
  • Build state-of-the-art models and develop personalized recommendations to perform machine learning at scale

Who This Book Is For

This book is for anyone who intends to work with large and complex data sets. Familiarity with basic Python and machine learning concepts is recommended. Working knowledge in statistics and computational mathematics would also be helpful.

What You Will Learn

  • Apply the most scalable machine learning algorithms
  • Work with modern state-of-the-art large-scale machine learning techniques
  • Increase predictive accuracy with deep learning and scalable data-handling techniques
  • Improve your work by combining the MapReduce framework with Spark
  • Build powerful ensembles at scale
  • Use data streams to train linear and non-linear predictive models from extremely large datasets using a single machine

In Detail

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy.

Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Style and Approach

This efficient and practical title is stuffed full of the techniques, tips and tools you need to ensure your large scale Python machine learning runs swiftly and seamlessly.

Large-scale machine learning tackles a different issue to what is currently on the market. Those working with Hadoop clusters and in data intensive environments can now learn effective ways of building powerful machine learning models from prototype to production.

This book is written in a style that programmers from other languages (R, Julia, Java, Matlab) can follow.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?
Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.
¿Cómo descargo los libros?
Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.
¿En qué se diferencian los planes de precios?
Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.
¿Qué es Perlego?
Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.
¿Perlego ofrece la función de texto a voz?
Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.
¿Es Large Scale Machine Learning with Python un PDF/ePUB en línea?
Sí, puedes acceder a Large Scale Machine Learning with Python de Bastiaan Sjardin, Luca Massaron, Alberto Boschetti en formato PDF o ePUB, así como a otros libros populares de Informatica y Data mining. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Año
2016
ISBN
9781785887215
Edición
1
Categoría
Informatica
Categoría
Data mining

Large Scale Machine Learning with Python


Table of Contents

Large Scale Machine Learning with Python
Credits
About the Authors
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. First Steps to Scalability
Explaining scalability in detail
Making large scale examples
Introducing Python
Scale up with Python
Scale out with Python
Python for large scale machine learning
Choosing between Python 2 and Python 3
Installing Python
Step-by-step installation
The installation of packages
Package upgrades
Scientific distributions
Introducing Jupyter/IPython
Python packages
NumPy
SciPy
Pandas
Scikit-learn
The matplotlib package
Gensim
H2O
XGBoost
Theano
TensorFlow
The sknn library
Theanets
Keras
Other useful packages to install on your system
Summary
2. Scalable Learning in Scikit-learn
Out-of-core learning
Subsampling as a viable option
Optimizing one instance at a time
Building an out-of-core learning system
Streaming data from sources
Datasets to try the real thing yourself
The first example – streaming the bike-sharing dataset
Using pandas I/O tools
Working with databases
Paying attention to the ordering of instances
Stochastic learning
Batch gradient descent
Stochastic gradient descent
The Scikit-learn SGD implementation
Defining SGD learning parameters
Feature management with data streams
Describing the target
The hashing trick
Other basic transformations
Testing and validation in a stream
Trying SGD in action
Summary
3. Fast SVM Implementations
Datasets to experiment with on your own
The bike-sharing dataset
The covertype dataset
Support Vector Machines
Hinge loss and its variants
Understanding the Scikit-learn SVM implementation
Pursuing nonlinear SVMs by subsampling
Achieving SVM at scale with SGD
Feature selection by regularization
Including non-linearity in SGD
Trying explicit high-dimensional mappings
Hyperparameter tuning
Other alternatives for SVM fast learning
Nonlinear and faster with Vowpal Wabbit
Installing VW
Understanding the VW data format
Python integration
A few examples using reductions for SVM and neural nets
Faster bike-sharing
The covertype dataset crunched by VW
Summary
4. Neural Networks and Deep Learning
The neural network architecture
What and how neural networks learn
Choosing the right architecture
The input layer
The hidden layer
The output layer
Neural networks in action
Parallelization for sknn
Neural networks and regularization
Neural networks and hyperparameter optimization
Neural networks and decision boundaries
Deep learning at scale with H2O
Large scale deep learning with H2O
Gridsearch on H2O
Deep learning and unsupervised pretraining
Deep learning with theanets
Autoencoders and unsupervised learning
Autoencoders
Summary
5. Deep Learning with TensorFlow
TensorFlow installation
TensorFlow operations
GPU computing
Linear regression with SGD
A neural network from scratch in TensorFlow
Machine learning on TensorFlow with SkFlow
Deep learning with large files – incremental learning
Keras and TensorFlow installation
Convolutional Neural Networks in TensorFlow through Keras
The convolution layer
The pooling layer
The fully connected layer
CNN's with an incremental approach
GPU Computing
Summary
6. Classification and Regression Trees at Scale
Bootstrap aggregation
Random forest and extremely randomized forest
Fast parameter optimization with randomized search
Extremely randomized trees and large datasets
CART and boosting
Gradient Boosting Machines
max_depth
learning_rate
Subsample
Faster GBM with warm_start
Speeding up GBM with warm_start
Training and storing GBM models
XGBoost
XGBoost regression
XGBoost and variable importance
XGBoost streaming large datasets
XGBoost model persistence
Out-of-core CART with H2O
Random forest and gridsearch on H2O
Stochastic gradient boosting and gridsearch on H2O
Summary
7. Unsupervised Learning at Scale
Unsupervised methods
Feature decomposition – PCA
Randomized PCA
Incremental PCA
Sparse PCA
PCA with H2O
Clustering – K-means
Initialization methods
K-means assumptions
Selection of the best K
Scaling K-means – mini-batch
K-means with H2O
LDA
Scaling LDA – memory, CPUs, and machines
Summary
8. Distributed Environments – Hadoop and Spark
From a standalone machine to a bunch of nodes
Why do we need a distributed framework?
Setting up the VM
VirtualBox
Vagrant
Using the VM
The Hadoop ecosystem
Architecture
HDFS
MapReduce
YARN
Spark
pySpark
Summary
9. Practical Machine Learning with Spark
Setting up the VM for this chapter
Sharing variables across cluster nodes
Broadcast read-only variables
Accumulators write-only variables
Broadcast and accumulators together – an example
Data preprocessing in Spark
JSON files and Spark DataFrames
Dealing with missing data
Grouping and creating tables in-memory
Writing the preprocessed DataFrame or RDD to disk
Working with Spark DataFrames
Machine learning with Spark
Spark on the KDD99 dataset
Reading the dataset
Feature engineering
Training a learner
Evaluating a learner's performance
The power of the ML pipeline
Manual tuning
Cross-validation
Final cleanup
Summary
A. Introduction to GPUs and Theano
GPU computing
Theano – parallel computing on the GPU
Installing Theano
Index

Large Scale Machine Learning with Python

Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2016
Production reference: 1270716
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-721-5
www.packtpub.com

Credits

Authors
Bastiaan Sjardin
Luca Massaron
Alberto Boschetti
Reviewers
Oleg Okun
Kai Londenberg
Commissioning Editor
Akram Hussain
Acquisition Editor
Sonali Vernekar
Content Development Editor
Sumeet Sawant
Technical Editor
Manthan Raja
Copy Editor
Tasneem Fatehi
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Kirk D'Penha
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta

About the Authors

Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.
Luca Massaron is a data scientist and marketing research director who is specialized in multivariate statistical analysis, machine learning, and customer insight, with over...

Índice