![]()
Table of Contents
Data Analysis with R
Credits
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. RefresheR
Navigating the basics
Arithmetic and assignment
Logicals and characters
Flow of control
Getting help in R
Vectors
Subsetting
Vectorized functions
Advanced subsetting
Recycling
Functions
Matrices
Loading data into R
Working with packages
Exercises
Summary
2. The Shape of Data
Univariate data
Frequency distributions
Central tendency
Spread
Populations, samples, and estimation
Probability distributions
Visualization methods
Exercises
Summary
3. Describing Relationships
Multivariate data
Relationships between a categorical and a continuous variable
Relationships between two categorical variables
The relationship between two continuous variables
Covariance
Correlation coefficients
Comparing multiple correlations
Visualization methods
Categorical and continuous variables
Two categorical variables
Two continuous variables
More than two continuous variables
Exercises
Summary
4. Probability
Basic probability
A tale of two interpretations
Sampling from distributions
Parameters
The binomial distribution
The normal distribution
The three-sigma rule and using z-tables
Exercises
Summary
5. Using Data to Reason About the World
Estimating means
The sampling distribution
Interval estimation
How did we get 1.96?
Smaller samples
Exercises
Summary
6. Testing Hypotheses
Null Hypothesis Significance Testing
One and two-tailed tests
When things go wrong
A warning about significance
A warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled!
Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions
What if my assumptions are unfounded?
Exercises
Summary
7. Bayesian Methods
The big idea behind Bayesian analysis
Choosing a prior
Who cares about coin flips
Enter MCMC – stage left
Using JAGS and runjags
Fitting distributions the Bayesian way
The Bayesian independent samples t-test
Exercises
Summary
8. Predicting Continuous Variables
Linear models
Simple linear regression
Simple linear regression with a binary predictor
A word of warning
Multiple regression
Regression with a non-binary predictor
Kitchen sink regression
The bias-variance trade-off
Cross-validation
Striking a balance
Linear regression diagnostics
Second Anscombe relationship
Third Anscombe relationship
Fourth Anscombe relationship
Advanced topics
Exercises
Summary
9. Predicting Categorical Variables
k-Nearest Neighbors
Using k-NN in R
Confusion matrices
Limitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees
Random forests
Choosing a classifier
The vertical decision boundary
The diagonal decision boundary
The crescent decision boundary
The circular decision boundary
Exercises
Summary
10. Sources of Data
Relational Databases
Why didn't we just do that in SQL?
Using JSON
XML
Other data formats
Online repositories
Exercises
Summary
11. Dealing with Messy Data
Analysis with missing data
Visualizing missing data
Types of missing data
So which one is it?
Unsophisticated methods for dealing with missing data
Complete case analysis
Pairwise deletion
Mean substitution
Hot deck imputation
Regression imputation
Stochastic regression imputation
Multiple imputation
So how does mice come up with the imputed values?
Methods of imputation
Multiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds data
Checking the data type of a column
Checking for unexpected categories
Checking for outliers, entry errors, or unlikely data points
Chaining assertions
Other messiness
OpenRefine
Regular expressions
tidyr
Exercises
Summary
12. Dealing with Large Data
Wait to optimize
Using a bigger and faster machine
Be smart about your code
Allocation of memory
Vectorization
Using optimized packages
Using another R implementation
Use parallelization
Getting started with parallel R
An example of (some) substance
Using Rcpp
Be smarter about your code
Exercises
Summary
13. Reproducibility and Best Practices
R Scripting
RStudio
Running R scripts
An example script
Scripting and reproducibility
R projects
Version control
Communicating results
Exercises
Summary
Index
![]()
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2015
Production reference: 1171215
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78528-814-2
www.packtpub.com
![]()
Author
Tony Fischetti
Reviewer
Dipanjan Sarkar
Commissioning Editor
Akram Hussain
Acquisition Editor
Meeta Rajani
Content Development Editor
Anish Dhurat
Technical Editor
Siddhesh Patil
Copy Editor
Sonia Mathur
Project Coordinator
Bijal Patel
Proofreader
Safis Editing
Indexer
Monica Ajmera Mehta
Graphics
Disha Haria
Production Coordinator
Conidon Miranda
Cover Work
Conidon Miranda
![]()
Tony Fischetti is a data scientist at College Factual, where he gets to use R everyday to build personalized rankings and recommender systems. He graduated in cognitive science from Rensselaer Polytechnic Institute, and his thesis was strongly focused on using statistics to study visual short-term memory.
Tony enjoys writing and contributing to open source software, blogging at http://www.onthelambda.com, writing about himself in third person, and sharing his knowledge using simple, approachable language and engaging examples.
The more traditionally exciting of his daily activities include listening to records, playing the guitar and bass (poorly), weight training, and helping others.