Data Analysis with R
eBook - ePub

Data Analysis with R

  1. 388 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Analysis with R

About this book

Load, wrangle, and analyze your data using the world's most powerful statistical programming language

About This Book

  • Load, manipulate and analyze data from different sources
  • Gain a deeper understanding of fundamentals of applied statistics
  • A practical guide to performing data analysis in practice

Who This Book Is For

Whether you are learning data analysis for the first time, or you want to deepen the understanding you already have, this book will prove to an invaluable resource. If you are looking for a book to bring you all the way through the fundamentals to the application of advanced and effective analytics methodologies, and have some prior programming experience and a mathematical background, then this is for you.

What You Will Learn

  • Navigate the R environment
  • Describe and visualize the behavior of data and relationships between data
  • Gain a thorough understanding of statistical reasoning and sampling
  • Employ hypothesis tests to draw inferences from your data
  • Learn Bayesian methods for estimating parameters
  • Perform regression to predict continuous variables
  • Apply powerful classification methods to predict categorical data
  • Handle missing data gracefully using multiple imputation
  • Identify and manage problematic data points
  • Employ parallelization and Rcpp to scale your analyses to larger data
  • Put best practices into effect to make your job easier and facilitate reproducibility

In Detail

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. With over 7, 000 user contributed packages, it's easy to find support for the latest and greatest algorithms and techniques.

Starting with the basics of R and statistical reasoning, Data Analysis with R dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

Packed with engaging problems and exercises, this book begins with a review of R and its syntax. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with "messy data", large data, communicating results, and facilitating reproducibility.

This book is engineered to be an invaluable resource through many stages of anyone's career as a data analyst.

Style and approach

Learn data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Data Analysis with R


Table of Contents

Data Analysis with R
Credits
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. RefresheR
Navigating the basics
Arithmetic and assignment
Logicals and characters
Flow of control
Getting help in R
Vectors
Subsetting
Vectorized functions
Advanced subsetting
Recycling
Functions
Matrices
Loading data into R
Working with packages
Exercises
Summary
2. The Shape of Data
Univariate data
Frequency distributions
Central tendency
Spread
Populations, samples, and estimation
Probability distributions
Visualization methods
Exercises
Summary
3. Describing Relationships
Multivariate data
Relationships between a categorical and a continuous variable
Relationships between two categorical variables
The relationship between two continuous variables
Covariance
Correlation coefficients
Comparing multiple correlations
Visualization methods
Categorical and continuous variables
Two categorical variables
Two continuous variables
More than two continuous variables
Exercises
Summary
4. Probability
Basic probability
A tale of two interpretations
Sampling from distributions
Parameters
The binomial distribution
The normal distribution
The three-sigma rule and using z-tables
Exercises
Summary
5. Using Data to Reason About the World
Estimating means
The sampling distribution
Interval estimation
How did we get 1.96?
Smaller samples
Exercises
Summary
6. Testing Hypotheses
Null Hypothesis Significance Testing
One and two-tailed tests
When things go wrong
A warning about significance
A warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled!
Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions
What if my assumptions are unfounded?
Exercises
Summary
7. Bayesian Methods
The big idea behind Bayesian analysis
Choosing a prior
Who cares about coin flips
Enter MCMC – stage left
Using JAGS and runjags
Fitting distributions the Bayesian way
The Bayesian independent samples t-test
Exercises
Summary
8. Predicting Continuous Variables
Linear models
Simple linear regression
Simple linear regression with a binary predictor
A word of warning
Multiple regression
Regression with a non-binary predictor
Kitchen sink regression
The bias-variance trade-off
Cross-validation
Striking a balance
Linear regression diagnostics
Second Anscombe relationship
Third Anscombe relationship
Fourth Anscombe relationship
Advanced topics
Exercises
Summary
9. Predicting Categorical Variables
k-Nearest Neighbors
Using k-NN in R
Confusion matrices
Limitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees
Random forests
Choosing a classifier
The vertical decision boundary
The diagonal decision boundary
The crescent decision boundary
The circular decision boundary
Exercises
Summary
10. Sources of Data
Relational Databases
Why didn't we just do that in SQL?
Using JSON
XML
Other data formats
Online repositories
Exercises
Summary
11. Dealing with Messy Data
Analysis with missing data
Visualizing missing data
Types of missing data
So which one is it?
Unsophisticated methods for dealing with missing data
Complete case analysis
Pairwise deletion
Mean substitution
Hot deck imputation
Regression imputation
Stochastic regression imputation
Multiple imputation
So how does mice come up with the imputed values?
Methods of imputation
Multiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds data
Checking the data type of a column
Checking for unexpected categories
Checking for outliers, entry errors, or unlikely data points
Chaining assertions
Other messiness
OpenRefine
Regular expressions
tidyr
Exercises
Summary
12. Dealing with Large Data
Wait to optimize
Using a bigger and faster machine
Be smart about your code
Allocation of memory
Vectorization
Using optimized packages
Using another R implementation
Use parallelization
Getting started with parallel R
An example of (some) substance
Using Rcpp
Be smarter about your code
Exercises
Summary
13. Reproducibility and Best Practices
R Scripting
RStudio
Running R scripts
An example script
Scripting and reproducibility
R projects
Version control
Communicating results
Exercises
Summary
Index

Data Analysis with R

Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2015
Production reference: 1171215
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78528-814-2
www.packtpub.com

Credits

Author
Tony Fischetti
Reviewer
Dipanjan Sarkar
Commissioning Editor
Akram Hussain
Acquisition Editor
Meeta Rajani
Content Development Editor
Anish Dhurat
Technical Editor
Siddhesh Patil
Copy Editor
Sonia Mathur
Project Coordinator
Bijal Patel
Proofreader
Safis Editing
Indexer
Monica Ajmera Mehta
Graphics
Disha Haria
Production Coordinator
Conidon Miranda
Cover Work
Conidon Miranda

About the Author

Tony Fischetti is a data scientist at College Factual, where he gets to use R everyday to build personalized rankings and recommender systems. He graduated in cognitive science from Rensselaer Polytechnic Institute, and his thesis was strongly focused on using statistics to study visual short-term memory.
Tony enjoys writing and contributing to open source software, blogging at http://www.onthelambda.com, writing about himself in third person, and sharing his knowledge using simple, approachable language and engaging examples.
The more traditionally exciting of his daily activities include listening to records, playing the guitar and bass (poorly), weight training, and helping others.

Table of contents

  1. Data Analysis with R

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Data Analysis with R by Tony Fischetti in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over one million books available in our catalogue for you to explore.