eBook - ePub

Data Analysis with R

Name: Data Analysis with R
ISBN: 9781785288142

Tony Fischetti,

388 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Data Analysis with R

Tony Fischetti,

About this book

Load, wrangle, and analyze your data using the world's most powerful statistical programming language

About This Book

Load, manipulate and analyze data from different sources
Gain a deeper understanding of fundamentals of applied statistics
A practical guide to performing data analysis in practice

Who This Book Is For

Whether you are learning data analysis for the first time, or you want to deepen the understanding you already have, this book will prove to an invaluable resource. If you are looking for a book to bring you all the way through the fundamentals to the application of advanced and effective analytics methodologies, and have some prior programming experience and a mathematical background, then this is for you.

What You Will Learn

Navigate the R environment
Describe and visualize the behavior of data and relationships between data
Gain a thorough understanding of statistical reasoning and sampling
Employ hypothesis tests to draw inferences from your data
Learn Bayesian methods for estimating parameters
Perform regression to predict continuous variables
Apply powerful classification methods to predict categorical data
Handle missing data gracefully using multiple imputation
Identify and manage problematic data points
Employ parallelization and Rcpp to scale your analyses to larger data
Put best practices into effect to make your job easier and facilitate reproducibility

In Detail

Frequently the tool of choice for academics, R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. The power and domain-specificity of R allows the user to express complex analytics easily, quickly, and succinctly. With over 7, 000 user contributed packages, it's easy to find support for the latest and greatest algorithms and techniques.

Starting with the basics of R and statistical reasoning, Data Analysis with R dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

Packed with engaging problems and exercises, this book begins with a review of R and its syntax. From there, get to grips with the fundamentals of applied statistics and build on this knowledge to perform sophisticated and powerful analytics. Solve the difficulties relating to performing data analysis in practice and find solutions to working with "messy data", large data, communicating results, and facilitating reproducibility.

This book is engineered to be an invaluable resource through many stages of anyone's career as a data analyst.

Style and approach

Learn data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2015

Edition

eBook ISBN

9781785288142

Topic

Computer Science

Subtopic

Computer Science General

Index

Computer Science

Data Analysis with R

Credits

About the Author

About the Reviewer

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. RefresheR

Navigating the basics

Arithmetic and assignment

Logicals and characters

Flow of control

Getting help in R

Vectors

Subsetting

Vectorized functions

Advanced subsetting

Recycling

Functions

Matrices

Loading data into R

Working with packages

Exercises

Summary

2. The Shape of Data

Univariate data

Frequency distributions

Central tendency

Spread

Populations, samples, and estimation

Probability distributions

Visualization methods

Exercises

Summary

3. Describing Relationships

Multivariate data

Relationships between a categorical and a continuous variable

Relationships between two categorical variables

The relationship between two continuous variables

Covariance

Correlation coefficients

Comparing multiple correlations

Visualization methods

Categorical and continuous variables

Two categorical variables

Two continuous variables

More than two continuous variables

Exercises

Summary

4. Probability

Basic probability

A tale of two interpretations

Sampling from distributions

Parameters

The binomial distribution

The normal distribution

The three-sigma rule and using z-tables

Exercises

Summary

5. Using Data to Reason About the World

Estimating means

The sampling distribution

Interval estimation

How did we get 1.96?

Smaller samples

Exercises

Summary

6. Testing Hypotheses

Null Hypothesis Significance Testing

One and two-tailed tests

When things go wrong

A warning about significance

A warning about p-values

Testing the mean of one sample

Assumptions of the one sample t-test

Testing two means

Don't be fooled!

Assumptions of the independent samples t-test

Testing more than two means

Assumptions of ANOVA

Testing independence of proportions

What if my assumptions are unfounded?

Exercises

Summary

7. Bayesian Methods

The big idea behind Bayesian analysis

Choosing a prior

Who cares about coin flips

Enter MCMC – stage left

Using JAGS and runjags

Fitting distributions the Bayesian way

The Bayesian independent samples t-test

Exercises

Summary

8. Predicting Continuous Variables

Linear models

Simple linear regression

Simple linear regression with a binary predictor

A word of warning

Multiple regression

Regression with a non-binary predictor

Kitchen sink regression

The bias-variance trade-off

Cross-validation

Striking a balance

Linear regression diagnostics

Second Anscombe relationship

Third Anscombe relationship

Fourth Anscombe relationship

Advanced topics

Exercises

Summary

9. Predicting Categorical Variables

k-Nearest Neighbors

Using k-NN in R

Confusion matrices

Limitations of k-NN

Logistic regression

Using logistic regression in R

Decision trees

Random forests

Choosing a classifier

The vertical decision boundary

The diagonal decision boundary

The crescent decision boundary

The circular decision boundary

Exercises

Summary

10. Sources of Data

Relational Databases

Why didn't we just do that in SQL?

Using JSON

XML

Other data formats

Online repositories

Exercises

Summary

11. Dealing with Messy Data

Analysis with missing data

Visualizing missing data

Types of missing data

So which one is it?

Unsophisticated methods for dealing with missing data

Complete case analysis

Pairwise deletion

Mean substitution

Hot deck imputation

Regression imputation

Stochastic regression imputation

Multiple imputation

So how does mice come up with the imputed values?

Methods of imputation

Multiple imputation in practice

Analysis with unsanitized data

Checking for out-of-bounds data

Checking the data type of a column

Checking for unexpected categories

Checking for outliers, entry errors, or unlikely data points

Chaining assertions

Other messiness

OpenRefine

Regular expressions

tidyr

Exercises

Summary

12. Dealing with Large Data

Wait to optimize

Using a bigger and faster machine

Be smart about your code

Allocation of memory

Vectorization

Using optimized packages

Using another R implementation

Use parallelization

Getting started with parallel R

An example of (some) substance

Using Rcpp

Be smarter about your code

Exercises

Summary

13. Reproducibility and Best Practices

R Scripting

RStudio

Running R scripts

An example script

Scripting and reproducibility

R projects

Version control

Communicating results

Exercises

Summary

Index

Data Analysis with R

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2015

Production reference: 1171215

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-814-2

www.packtpub.com

Credits

Author

Tony Fischetti

Reviewer

Dipanjan Sarkar

Commissioning Editor

Akram Hussain

Acquisition Editor

Meeta Rajani

Content Development Editor

Anish Dhurat

Technical Editor

Siddhesh Patil

Copy Editor

Sonia Mathur

Project Coordinator

Bijal Patel

Proofreader

Safis Editing

Indexer

Monica Ajmera Mehta

Graphics

Disha Haria

Production Coordinator

Conidon Miranda

Cover Work

Conidon Miranda

About the Author

Tony Fischetti is a data scientist at College Factual, where he gets to use R everyday to build personalized rankings and recommender systems. He graduated in cognitive science from Rensselaer Polytechnic Institute, and his thesis was strongly focused on using statistics to study visual short-term memory.

Tony enjoys writing and contributing to open source software, blogging at http://www.onthelambda.com, writing about himself in third person, and sharing his knowledge using simple, approachable language and engaging examples.

The more traditionally exciting of his daily activities include listening to records, playing the guitar and bass (poorly), weight training, and helping others.

Because I'm aware of how incredibly lucky I am, it's really hard to express all the gratitude I have for everyone in my life that helped me—either directly, or indirectly—in completing this book. The following (partial) list is my best attempt at balancing thoroughness whilst also maximizing the number of people who will read this section by keeping it to a manageable length.

First, I'd like to thank all of my educators. In particular, I'd like to thank the Bronx High School of Science and Rensselaer Polytechnic Institute. More specifically, I'd like the Bronx Science Robotics Team, all it's members, it's team moms, the wonderful Dena Ford and Cherrie Fleisher-Strauss; and Justin Fox. From the latter instituti...

Data Analysis with R

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Data Analysis with R an online PDF/ePUB?

Yes, you can access Data Analysis with R by Tony Fischetti in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over 1.5 million books available in our catalogue for you to explore.

Related ISBNs

9781118839867,

Data Analysis with R

Data Analysis with R

About this book

Trusted by 375,005 students

Information

Data Analysis with R

Table of Contents

Data Analysis with R

Credits

About the Author

Table of contents

Frequently asked questions