![]()
Regression Analysis with Python
Table of Contents
Regression Analysis with Python
Credits
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Regression – The Workhorse of Data Science
Regression analysis and data science
Exploring the promise of data science
The challenge
The linear models
What you are going to find in the book
Python for data science
Installing Python
Choosing between Python 2 and Python 3
Step-by-step installation
Installing packages
Package upgrades
Scientific distributions
Introducing Jupyter or IPython
Python packages and functions for linear models
NumPy
SciPy
Statsmodels
Scikit-learn
Summary
2. Approaching Simple Linear Regression
Defining a regression problem
Linear models and supervised learning
Reflecting on predictive variables
Reflecting on response variables
The family of linear models
Preparing to discover simple linear regression
Starting from the basics
A measure of linear relationship
Extending to linear regression
Regressing with Statsmodels
The coefficient of determination
Meaning and significance of coefficients
Evaluating the fitted values
Correlation is not causation
Predicting with a regression model
Regressing with Scikit-learn
Minimizing the cost function
Explaining the reason for using squared errors
Pseudoinverse and other optimization methods
Gradient descent at work
Summary
3. Multiple Regression in Action
Using multiple features
Model building with Statsmodels
Using formulas as an alternative
The correlation matrix
Revisiting gradient descent
Feature scaling
Unstandardizing coefficients
Estimating feature importance
Inspecting standardized coefficients
Comparing models by R-squared
Interaction models
Discovering interactions
Polynomial regression
Testing linear versus cubic transformation
Going for higher-degree solutions
Introducing underfitting and overfitting
Summary
4. Logistic Regression
Defining a classification problem
Formalization of the problem: binary classification
Assessing the classifier's performance
Defining a probability-based approach
More on the logistic and logit functions
Let's see some code
Pros and cons of logistic regression
Revisiting gradient descent
Multiclass Logistic Regression
An example
Summary
5. Data Preparation
Numeric feature scaling
Mean centering
Standardization
Normalization
The logistic regression case
Qualitative feature encoding
Dummy coding with Pandas
DictVectorizer and one-hot encoding
Feature hasher
Numeric feature transformation
Observing residuals
Summarizations by binning
Missing data
Missing data imputation
Keeping track of missing values
Outliers
Outliers on the response
Outliers among the predictors
Removing or replacing outliers
Summary
6. Achieving Generalization
Checking on out-of-sample data
Testing by sample split
Cross-validation
Bootstrapping
Greedy selection of features
The Madelon dataset
Univariate selection of features
Recursive feature selection
Regularization optimized by grid-search
Ridge (L2 regularization)
Grid search for optimal parameters
Random grid search
Lasso (L1 regularization)
Elastic net
Stability selection
Experimenting with the Madelon
Summary
7. Online and Batch Learning
Batch learning
Online mini-batch learning
A real example
Streaming scenario without a test set
Summary
8. Advanced Regression Methods
Least Angle Regression
Visual showcase of LARS
A code example
LARS wrap up
Bayesian regression
Bayesian regression wrap up
SGD classification with hinge loss
Comparison with logistic regression
SVR
SVM wrap up
Regression trees (CART)
Regression tree wrap up
Bagging and boosting
Bagging
Boosting
Ensemble wrap up
Gradient Boosting Regressor with LAD
GBM with LAD wrap up
Summary
9. Real-world Applications for Regression Models
Downloading the datasets
Time series problem dataset
Regression problem dataset
Multiclass classification problem dataset
Ranking problem dataset
A regression problem
Testing a classifier instead of a regressor
An imbalanced and multiclass classification problem
A ranking problem
A time series problem
Open questions
Summary
Index
![]()
Regression Analysis with Python
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2016
Production reference: 1250216
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78528-631-5
www.packtpub.com
![]()
Authors
Luca Massaron
Alberto Boschetti
Reviewers
Giuliano Janson
Zacharias Voulgaris
Commissioning Editor
Kunal Parikh
Acquisition Editor
Sonali Vernekar
Content Development Editor
Siddhesh Salvi
Technical Editor
Shivani Kiran Mistry
Copy Editor
Stephen Copestake
Project Coordinator
Nidhi Joshi
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Production Coordinator
Nilesh Mohite
Cover Work
Nilesh Mohite
![]()
Luca Massaron is a data scientist and a marketing research director who is specialized in multivariate statistical analysis, machine learning, and customer insight with over a decade of experience in solving real-world problems and in generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of Web audience analysis in Italy to achieving the rank of a top ten Kaggler, he has always been very passionate about everything regarding data and its analysis and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts. Favoring simplicity over unnecessary sophistication, he believes that a lot can be achieved in data science just by doing the essentials.
Alberto Boschetti is a data scientist, with an expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces daily challenges that span from natural language processing (NLP) and machine learning to distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending m...