Introduction to Data Science
eBook - ePub

Introduction to Data Science

Data Analysis and Prediction Algorithms with R

Rafael A. Irizarry

  1. 713 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Introduction to Data Science

Data Analysis and Prediction Algorithms with R

Rafael A. Irizarry

Book details
Book preview
Table of contents
Citations

About This Book

Introduction to Data Science: Data Analysis and Prediction Algorithms with R introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as R programming, data wrangling, data visualization, predictive algorithm building, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation.

This book is a textbook for a first course in data science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The book is divided into six parts: R, data visualization, statistics with R, data wrangling, machine learning, and productivity tools. Each part has several chapters meant to be presented as one lecture.

The author uses motivating case studies that realistically mimic a data scientist's experience. He starts by asking specific questions and answers these through data analysis so concepts are learned as a means to answering the questions. Examples of the case studies included are: US murder rates by state, self-reported student heights, trends in world health and economics, the impact of vaccines on infectious disease rates, the financial crisis of 2007-2008, election forecasting, building a baseball team, image processing of hand-written digits, and movie recommendation systems.

The statistical concepts used to answer the case study questions are only briefly introduced, so complementing with a probability and statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand the chapters and complete the exercises, you will be prepared to learn the more advanced concepts and skills needed to become an expert.

A complete solutions manual is available to registered instructors who require the text for a course.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Introduction to Data Science an online PDF/ePUB?
Yes, you can access Introduction to Data Science by Rafael A. Irizarry in PDF and/or ePUB format, as well as other popular books in Matemáticas & Probabilidad y estadística. We have over one million books available in our catalogue for you to explore.

Information

Year
2019
ISBN
9781000708035

1

Getting started with R and RStudio

_______________

1.1 Why R?

R is not a programming language like C or Java. It was not created by software engineers for software development. Instead, it was developed by statisticians as an interactive environment for data analysis. You can read the full history in the paper A Brief History of S1. The interactivity is an indispensable feature in data science because, as you will soon learn, the ability to quickly explore data is a necessity for success in this field. However, like in other programming languages, you can save your work as scripts that can be easily executed at any moment. These scripts serve as a record of the analysis you performed, a key feature that facilitates reproducible work. If you are an expert programmer, you should not expect R to follow the conventions you are used to since you will be disappointed. If you are patient, you will come to appreciate the unequal power of R when it comes to data analysis and, specifically, data visualization.
Other attractive features of R are:
  1. R is free and open source2.
  2. It runs on all major platforms: Windows, Mac Os, UNIX/Linux.
  3. Scripts and data objects can be shared seamlessly across platforms.
  4. There is a large, growing, and active community of R users and, as a result, there are numerous resources for learning and asking questions3 4 5.
  5. It is easy for others to contribute add-ons which enables developers to share software implementations of new data science methodologies. This gives R users early access to the latest methods and to tools which are developed for a wide variety of disciplines, including ecology, molecular biology, social sciences, and geography, just to name a few examples.

1.2 The R console

Interactive data analysis usually occurs on the R console that executes commands as you type them. There are several ways to gain access to an R console. One way is to simply start R on your computer. The console looks something like this:
____________________________________
1 https://pdfs.semanticscholar.org/9b48/46f192aa37ca122cfabb1ed1b59866d8bfda.pdf
2 https://opensource.org/history
3 https://stats.stackexchange.com/questions/138/free-resources-for-learning-r
4 https://www.r-project.org/help.html
5 https://stackoverflow.com/documentation/r/topics
Image
As a quick example, try using the console to calculate a 15% tip on a meal that cost $19.71:
0.15 * 19.71
#> [1] 2.96
Note that in this book, grey boxes are used to show R code typed into the R console. The symbol #> is used to denote what the R console outputs.

1.3 Scripts

One of the great advantages of R over point-and-click analysis software is that you can save your work as scripts. You can edit and save these scripts using a text editor. The material in this book was developed using the interactive integrated development environment (IDE) RStudio6. RStudio includes an editor with many R specific features, a console to execute your code, and other useful panes, including one to show figures.
____________________________________
6 https://www.rstudio.com/
Image
Most web-based R consoles also provide a pane to edit scripts, but not all permit you to save the scripts for later use.
All the R scripts used to generate this book can be found on GitHub7.

1.4 RStudio

RStudio will be our launching pad for data science projects. It not only provides an editor for us to create and edit our scripts but also provides many other useful tools. In this section, we go over some of the basics.

1.4.1 The panes

When you start RStudio for the first time, you will see three panes. The left pane shows the R console. On the right, the top pane includes tabs such as Environment and History, while the bottom pane shows five tabs: File, Plots, Packages, Help, and Viewer (these tabs may change in new versions). You can click on each tab to move across the different features.
____________________________________
7 https://github.com/rafalab/dsbook
Image
To start a new script, you can click on File, the New File, then R Script.
Image
This starts a new pane on the left and it is here where you can start writing your script.
Image

1.4.2 Key bindings

Many tasks we perform with the mouse can be achieved with a combination of key strokes instead. These keyboard versions for performing tasks are referred to as key bindings. For example, we just showed how to use the mouse to start a new script, but you can also use a key binding: Ctrl+Shift+N on Windows and command+shift+N on the Mac.
Although in this tutorial we often show how to use the mouse, we highly recommend that you memorize key bindings for the operations you use most. RStudio provides a useful cheat sheet with the most widely used commands. You can get it from RStudio directly:
Image
You might want to keep this handy so you can look up key-bindings when you find yourself performing repetitive point-and-clicking.

1.4.3 Running commands while editing scripts

There are many editors specifically made for coding. These are useful because color and indentation are automatically added to make code more readable. RStudio is one of these editors, and it was specifically developed for...

Table of contents

Citation styles for Introduction to Data Science

APA 6 Citation

Irizarry, R. (2019). Introduction to Data Science (1st ed.). CRC Press. Retrieved from https://www.perlego.com/book/1520484/introduction-to-data-science-data-analysis-and-prediction-algorithms-with-r-pdf (Original work published 2019)

Chicago Citation

Irizarry, Rafael. (2019) 2019. Introduction to Data Science. 1st ed. CRC Press. https://www.perlego.com/book/1520484/introduction-to-data-science-data-analysis-and-prediction-algorithms-with-r-pdf.

Harvard Citation

Irizarry, R. (2019) Introduction to Data Science. 1st edn. CRC Press. Available at: https://www.perlego.com/book/1520484/introduction-to-data-science-data-analysis-and-prediction-algorithms-with-r-pdf (Accessed: 14 October 2022).

MLA 7 Citation

Irizarry, Rafael. Introduction to Data Science. 1st ed. CRC Press, 2019. Web. 14 Oct. 2022.