Large Scale Machine Learning with Python
eBook - ePub

Large Scale Machine Learning with Python

Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Share book
  1. 420 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Large Scale Machine Learning with Python

Bastiaan Sjardin, Luca Massaron, Alberto Boschetti

Book details
Book preview
Table of contents
Citations

About This Book

Learn to build powerful machine learning models quickly and deploy large-scale predictive applications

About This Book

  • Design, engineer and deploy scalable machine learning solutions with the power of Python
  • Take command of Hadoop and Spark with Python for effective machine learning on a map reduce framework
  • Build state-of-the-art models and develop personalized recommendations to perform machine learning at scale

Who This Book Is For

This book is for anyone who intends to work with large and complex data sets. Familiarity with basic Python and machine learning concepts is recommended. Working knowledge in statistics and computational mathematics would also be helpful.

What You Will Learn

  • Apply the most scalable machine learning algorithms
  • Work with modern state-of-the-art large-scale machine learning techniques
  • Increase predictive accuracy with deep learning and scalable data-handling techniques
  • Improve your work by combining the MapReduce framework with Spark
  • Build powerful ensembles at scale
  • Use data streams to train linear and non-linear predictive models from extremely large datasets using a single machine

In Detail

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy.

Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Style and Approach

This efficient and practical title is stuffed full of the techniques, tips and tools you need to ensure your large scale Python machine learning runs swiftly and seamlessly.

Large-scale machine learning tackles a different issue to what is currently on the market. Those working with Hadoop clusters and in data intensive environments can now learn effective ways of building powerful machine learning models from prototype to production.

This book is written in a style that programmers from other languages (R, Julia, Java, Matlab) can follow.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Large Scale Machine Learning with Python an online PDF/ePUB?
Yes, you can access Large Scale Machine Learning with Python by Bastiaan Sjardin, Luca Massaron, Alberto Boschetti in PDF and/or ePUB format, as well as other popular books in Informatica & Data mining. We have over one million books available in our catalogue for you to explore.

Information

Year
2016
ISBN
9781785887215
Subtopic
Data mining
Edition
1

Large Scale Machine Learning with Python


Table of Contents

Large Scale Machine Learning with Python
Credits
About the Authors
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. First Steps to Scalability
Explaining scalability in detail
Making large scale examples
Introducing Python
Scale up with Python
Scale out with Python
Python for large scale machine learning
Choosing between Python 2 and Python 3
Installing Python
Step-by-step installation
The installation of packages
Package upgrades
Scientific distributions
Introducing Jupyter/IPython
Python packages
NumPy
SciPy
Pandas
Scikit-learn
The matplotlib package
Gensim
H2O
XGBoost
Theano
TensorFlow
The sknn library
Theanets
Keras
Other useful packages to install on your system
Summary
2. Scalable Learning in Scikit-learn
Out-of-core learning
Subsampling as a viable option
Optimizing one instance at a time
Building an out-of-core learning system
Streaming data from sources
Datasets to try the real thing yourself
The first example – streaming the bike-sharing dataset
Using pandas I/O tools
Working with databases
Paying attention to the ordering of instances
Stochastic learning
Batch gradient descent
Stochastic gradient descent
The Scikit-learn SGD implementation
Defining SGD learning parameters
Feature management with data streams
Describing the target
The hashing trick
Other basic transformations
Testing and validation in a stream
Trying SGD in action
Summary
3. Fast SVM Implementations
Datasets to experiment with on your own
The bike-sharing dataset
The covertype dataset
Support Vector Machines
Hinge loss and its variants
Understanding the Scikit-learn SVM implementation
Pursuing nonlinear SVMs by subsampling
Achieving SVM at scale with SGD
Feature selection by regularization
Including non-linearity in SGD
Trying explicit high-dimensional mappings
Hyperparameter tuning
Other alternatives for SVM fast learning
Nonlinear and faster with Vowpal Wabbit
Installing VW
Understanding the VW data format
Python integration
A few examples using reductions for SVM and neural nets
Faster bike-sharing
The covertype dataset crunched by VW
Summary
4. Neural Networks and Deep Learning
The neural network architecture
What and how neural networks learn
Choosing the right architecture
The input layer
The hidden layer
The output layer
Neural networks in action
Parallelization for sknn
Neural networks and regularization
Neural networks and hyperparameter optimization
Neural networks and decision boundaries
Deep learning at scale with H2O
Large scale deep learning with H2O
Gridsearch on H2O
Deep learning and unsupervised pretraining
Deep learning with theanets
Autoencoders and unsupervised learning
Autoencoders
Summary
5. Deep Learning with TensorFlow
TensorFlow installation
TensorFlow operations
GPU computing
Linear regression with SGD
A neural network from scratch in TensorFlow
Machine learning on TensorFlow with SkFlow
Deep learning with large files – incremental learning
Keras and TensorFlow installation
Convolutional Neural Networks in TensorFlow through Keras
The convolution layer
The pooling layer
The fully connected layer
CNN's with an incremental approach
GPU Computing
Summary
6. Classification and Regression Trees at Scale
Bootstrap aggregation
Random forest and extremely randomized forest
Fast parameter optimization with randomized search
Extremely randomized trees and large datasets
CART and boosting
Gradient Boosting Machines
max_depth
learning_rate
Subsample
Faster GBM with warm_start
Speeding up GBM with warm_start
Training and storing GBM models
XGBoost
XGBoost regression
XGBoost and variable importance
XGBoost streaming large datasets
XGBoost model persistence
Out-of-core CART with H2O
Random forest and gridsearch on H2O
Stochastic gradient boosting and gridsearch on H2O
Summary
7. Unsupervised Learning at Scale
Unsupervised methods
Feature decomposition – PCA
Randomized PCA
Incremental PCA
Sparse PCA
PCA with H2O
Clustering – K-means
Initialization methods
K-means assumptions
Selection of the best K
Scaling K-means – mini-batch
K-means with H2O
LDA
Scaling LDA – memory, CPUs, and machines
Summary
8. Distributed Environments – Hadoop and Spark
From a standalone machine to a bunch of nodes
Why do we need a distributed framework?
Setting up the VM
VirtualBox
Vagrant
Using the VM
The Hadoop ecosystem
Architecture
HDFS
MapReduce
YARN
Spark
pySpark
Summary
9. Practical Machine Learning with Spark
Setting up the VM for this chapter
Sharing variables across cluster nodes
Broadcast read-only variables
Accumulators write-only variables
Broadcast and accumulators together – an example
Data preprocessing in Spark
JSON files and Spark DataFrames
Dealing with missing data
Grouping and creating tables in-memory
Writing the preprocessed DataFrame or RDD to disk
Working with Spark DataFrames
Machine learning with Spark
Spark on the KDD99 dataset
Reading the dataset
Feature engineering
Training a learner
Evaluating a learner's performance
The power of the ML pipeline
Manual tuning
Cross-validation
Final cleanup
Summary
A. Introduction to GPUs and Theano
GPU computing
Theano – parallel computing on the GPU
Installing Theano
Index

Large Scale Machine Learning with Python

Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2016
Production reference: 1270716
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-721-5
www.packtpub.com

Credits

Authors
Bastiaan Sjardin
Luca Massaron
Alberto Boschetti
Reviewers
Oleg Okun
Kai Londenberg
Commissioning Editor
Akram Hussain
Acquisition Editor
Sonali Vernekar
Content Development Editor
Sumeet Sawant
Technical Editor
Manthan Raja
Copy Editor
Tasneem Fatehi
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Kirk D'Penha
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta

About the Authors

Bastiaan Sjardin is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.
Luca Massaron is a data scientist and marketing research director who is specialized in multivariate statistical analysis, machine learning, and customer insight, with over...

Table of contents