Mastering Predictive Analytics with Python
Table of Contents
Mastering Predictive Analytics with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. From Data to Decisions – Getting Started with Analytic Applications
Designing an advanced analytic solution
Data layer: warehouses, lakes, and streams
Modeling layer
Deployment layer
Reporting layer
Case study: sentiment analysis of social media feeds
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Case study: targeted e-mail campaigns
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Summary
2. Exploratory Data Analysis and Visualization in Python
Exploring categorical and numerical data in IPython
Installing IPython notebook
The notebook interface
Loading and inspecting data
Basic manipulations – grouping, filtering, mapping, and pivoting
Charting with Matplotlib
Time series analysis
Cleaning and converting
Time series diagnostics
Joining signals and correlation
Working with geospatial data
Loading geospatial data
Working in the cloud
Introduction to PySpark
Creating the SparkContext
Creating an RDD
Creating a Spark DataFrame
Summary
3. Finding Patterns in the Noise – Clustering and Unsupervised Learning
Similarity and distance metrics
Numerical distance metrics
Correlation similarity metrics and time series
Similarity metrics for categorical data
K-means clustering
Affinity propagation – automatically choosing cluster numbers
k-medoids
Agglomerative clustering
Where agglomerative clustering fails
Streaming clustering in Spark
Summary
4. Connecting the Dots with Models – Regression Methods
Linear regression
Data preparation
Model fitting and evaluation
Statistical significance of regression outputs
Generalize estimating equations
Mixed effects models
Time series data
Generalized linear models
Applying regularization to linear models
Tree methods
Decision trees
Random forest
Scaling out with PySpark – predicting year of song release
Summary
5. Putting Data in its Place – Classification Methods and Analysis
Logistic regression
Multiclass logistic classifiers: multinomial regression
Formatting a dataset for classification problems
Learning pointwise updates with stochastic gradient descent
Jointly optimizing all parameters with second-order methods
Fitting the model
Evaluating classification models
Strategies for improving classification models
Separating Nonlinear boundaries with Support vector machines
Fitting and SVM to the census data
Boosting – combining small models to improve accuracy
Gradient boosted decision trees
Comparing classification methods
Case study: fitting classifier models in pyspark
Summary
6. Words and Pixels – Working with Unstructured Data
Working with textual data
Cleaning textual data
Extracting features from textual data
Using dimensionality reduction to simplify datasets
Principal component analysis
Latent Dirichlet Allocation
Using dimensionality reduction in predictive modeling
Images
Cleaning image data
Thresholding images to highlight objects
Dimensionality reduction for image analysis
Case Study: Training a Recommender System in PySpark
Summary
7. Learning from the Bottom Up – Deep Networks and Unsupervised Features
Learning patterns with neural networks
A network of one – the perceptron
Combining perceptrons – a single-layer neural network
Parameter fitting with back-propagation
Discriminative versus generative models
Vanishing gradients and explaining away
Pretraining belief networks
Using dropout to regularize networks
Convolutional networks and rectified units
Compressing Data with autoencoder networks
Optimizing the learning rate
The TensorFlow library and digit recognition
The MNIST data
Constructing the network
Summary
8. Sharing Models with Prediction Services
The architecture of a prediction service
Clients and making requests
The GET requests
The POST request
The HEAD request
The PUT request
The DELETE request
Server – the web traffic controller
Application – the engine of the predictive services
Persisting information with database systems
Case study – logistic regression service
Setting up the database
The web server
The web application
The flow of a prediction service – training a model
On-demand and bulk prediction
Summary
9. Reporting and Testing – Iterating on Analytic Systems
Checking the health of models with diagnostics
Evaluating changes in model performance
Changes in feature importance
Changes in unsupervised model performance
Iterating on models through A/B testing
Experimental allocation – assigning customers to experiments
Deciding a sample size
Multiple hypothesis testing
Guidelines for communication
Translate terms to business values
Visualizing results
Case Study: building a reporting service
The report server
The report application
The visualization layer
Summary
Index
Mastering Predictive Analytics with Python
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2016
Production reference: 1290816
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-271-5
www.packtpub.com
Author
Joseph Babcock
Reviewer
Dipanjan Deb
Commissioning Editor
Kartikey Pandey
Acquisition Editor
Aaron Lazar
Content Development Editor
Sumeet Sawant
Technical Editor
Utkarsha S. Kadam
Copy Editor
Vikrant Phadke
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Monica Ajmera Mehta
Graphics
Kirk D'Pinha
Production Coordinator
Nilesh Mohite
Cover Work
Nilesh Mohite
Joseph Babcock has spent almost a decade exploring complex datasets and combining predictive modeling with visualization to understand correlations and forecast anticipated outcomes. He received a PhD from the Solomon H. Snyder Department of Neuroscience at The Johns Hopkins University School of Medicine, where he used machine learning to predict adverse cardiac side effects of drugs. Outside the academy, he has tackled big data challenges in...