eBook - ePub

Introduction to R for Social Scientists

Name: Introduction to R for Social Scientists
Author: Ryan Kennedy, Philip D. Waggoner

A Tidy Programming Approach

Ryan Kennedy, Philip D. Waggoner

Share book

198 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Introduction to R for Social Scientists

A Tidy Programming Approach

Ryan Kennedy, Philip D. Waggoner

Book details

Book preview

Table of contents

Citations

About This Book

Introduction to R for Social Scientists: A Tidy Programming Approach introduces the Tidy approach to programming in R for social science research to help quantitative researchers develop a modern technical toolbox. The Tidy approach is built around consistent syntax, common grammar, and stacked code, which contribute to clear, efficient programming. The authors include hundreds of lines of code to demonstrate a suite of techniques for developing and debugging an efficient social science research workflow. To deepen the dedication to teaching Tidy best practices for conducting social science research in R, the authors include numerous examples using real world data including the American National Election Study and the World Indicators Data. While no prior experience in R is assumed, readers are expected to be acquainted with common social science research designs and terminology.

Whether used as a reference manual or read from cover to cover, readers will be equipped with a deeper understanding of R and the Tidyverse, as well as a framework for how best to leverage these powerful tools to write tidy, efficient code for solving problems. To this end, the authors provide many suggestions for additional readings and tools to build on the concepts covered. They use all covered techniques in their own work as scholars and practitioners.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Introduction to R for Social Scientists an online PDF/ePUB?

Yes, you can access Introduction to R for Social Scientists by Ryan Kennedy, Philip D. Waggoner in PDF and/or ePUB format, as well as other popular books in Informatique & Traitement des données. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Chapman and Hall/CRC

Year

2021

ISBN

9781000353877

Edition

Topic

Informatique

Subtopic

Traitement des données

R is a widely used statistical environment that has become very popular in the social sciences because of its power and extensibility. However, the way that R is taught to many social scientists is, we think, less than ideal. Many social scientists come to R after learning another statistical program (e.g. SAS, SPSS, or Stata). There are a variety of reasons they do this, such as finding there are some tasks they cannot do in these other programs, collaborating with colleagues who work in R, and/or being told that they need to learn R. For others, R may be the first statistical program they encounter, but they come to it without any kind of experience with programming (or even, increasingly, using a text interface).

This is part of why “learning R” can be frustrating. Learning R for the first time, most students are shown how to undertake particular tasks in the style of a cookbook (i.e., here is how you conduct a regression analysis in R), with little effort dedicated to developing an underlying intuition of how R works as a language. As a result, for those who have experience with other statistical programs, R comes across as a harder way to do the same things they can do more easily in another program. This cookbook approach can also produce frustration for those who are coming to R as their first statistical analysis environment. Working with R in such a way becomes a process of copying and pasting, with only a shallow understanding of why things have a particular structure and, thus, difficulty moving beyond the demonstrated examples.

Finally, the cookbook approach is, in many ways, a holdover from the preinternet era, when large coding manuals were a critical reference for finding out how to do anything in a complex program. These books had to be exhaustive, since they were needed as much for reference as for learning the environment. Today, however, there is a plethora of online materials to demonstrate how to perform specific tasks in R, and exhaustiveness can come at a cost to comprehension. What most beginners with R need is a concrete introduction to the fundamentals, which will allow them to fully leverage the tools available online.

This book is focused on equipping readers with the tools and knowledge to overcome their initial frustration and fully engage with R. We introduce a modern approach to programming in R — the Tidyverse. This set of tools introduces a consistent grammar for working with R that allows users to quickly develop intuitions of how their code works and how to conduct new tasks. We have found this increases the speed of learning and encourages creativity in programming.

This book is based on an intensive 3-day workshop introducing R, taught by one of the authors at the Inter-University Consortium for Political and Social Research (ICPSR), as well as numerous workshops and classes (at both the undergraduate and graduate levels) conducted by both authors. The goal is to have the reader: (1) understand and feel comfortable using R for data analysis tasks, (2) have the skills necessary to approach just about any task or program in R with confidence, and (3) have an appreciation for that which R allows a researcher to do and a desire to further their knowledge.

1.1 Why R?

If you have picked up this book, chances are that you already have a reason for learning R. But let’s go through some of the more common reasons why conducting your research in R is a good idea.

One of the major attractions of R is that it is free and open source. R was created by Ross Ihaka and Robert Gentleman, of the Department of Statistics at the University of Auckland, in the early 1990s (Ihaka and Gentleman, 1996). It was designed to be a dialect of the popular S-PLUS statistical language that was developed for Bell Labs. Unlike S-PLUS, however, R was released under the GNU General Public License, which allows users to freely download, alter, and redistribute it.

The result of this open source license is that R is accessible to everyone, without exorbitant licensing fees. It is also regularly updated and maintained, with frequent releases that allow for quick fixing of bugs and the addition of new features.¹ Perhaps most importantly, the open source nature allows users to contribute their own additions to R in the form of “packages.” You will often hear R users say, in response to a question about how to do something in R, “There is a package for that.” From running advanced statistical models to ordering an Uber (the ubeR package) or making a scatterplot with cats instead of points (the CatterPlots package), it is likely that someone has developed a way to do it in R. As of 2015, there were over 10,000 packages on the Comprehensive R Archive Network (CRAN), with scores more being created all the time. Indeed, the book you are reading now was originally written completely in R using R Markdown and the bookdown package (Xie, 2019).

_________________

¹ The major new release usually comes around October, so you should, at a minimum, update your R system around this time.

Another reason for learning R is flexibility. R is both a language and an “environment” where users can do statistics and analysis. This covers a lot of ground — from data visualization and exploratory data analysis, to complex modeling, advanced programming and computation. R allows you to scrape data from websites, interact with APIs, and even create your own online (”Shiny”) applications. This flexibility, in turn, allows you as a researcher to undertake a wider variety of research tasks, some of which you might not even have considered previously.

Though R is wonderfully flexible, fast, and efficient, the learning curve can be quite steep, as users must learn to write code. For example, in some other popular statistics programs, users can point-and-click on the models they want with little to no interface with the mechanics behind what is going on. This is both good and bad. It is good in that the learning curve in point-and-click interfaces is much gentler and accommodating. However, it is not a great thing in that it restricts user interface with the process of coding and statistical analysis. Point-and-click encourages minimal interaction with the data and tasks, and ultimately following the well-trod path of others, rather than creating your own path.

The coding process required by R is also increasingly becoming the standard in the social sciences. The “replication revolution” in the social sciences has encouraged/required scholars to not only think about how they will share their results, but also how they will share the way they got those results (King, 1995; Collaboration et al., 2015; Freese and Peterson, 2017). Indeed, several of the top social science journals — including American Economic Review, Journal of Political Economy, PL OS ONE, American Journal of Political Science, and Sociological Methods and Research, among others — now require submission of replication code and/or data prior to publication. Still others strongly encourage the submission of replication code. R code is ideal for this purpose — there are almost no obstacles to other scholars downloading and running your R code. The same cannot be said about programs that require licenses and point-and-click interaction.

This replication process can also be useful for your own work. There is a common refrain among computer programmers that, “If you do not look at your code for a month, and have not included enough comments to explain what the commands do, it might as well have been written by someone else.” The same is true of point-and-click software. If you have a process that is reasonably complex and you do not work with it for a while, you might completely forget how to do it. By writing an R script, you have a written record of how you did each task, which you can easily execute again.

Additionally, we recommend the use of R in a variety of applied research settings because of the high-quality options for visualization. Broadly, R uses layers to build plots. This layering provides many flexible options for users to interact directly with their visual tools to produce high-quality graphical depictions of quantities of interest. Further, some packages, e.g., ggplot2, use something called the “grammar of graphics” (Wilkinson, 2012), which is a process of streamlining the building of sophisticated plots and figures in R (Wickham et al., 2019b; Wickham, 2009; Healy, 2018). This and other similar packages offer users even more advanced tools for generating high-quality, publication-ready visualizations (Lüdecke et al., 2020).

And finally, we highly recommend R, because of the community. From blogs and local “R User” community groups in cities throughout the world to a host of conferences (e.g., UseR, EARL, rstudio::conf), the R community is a welcoming place. Further, the open source nature of R contributes to a communal atmosphere, where innovation and sophistication in programming and practice are highly prioritized. Put simply, R users want R to be the best it can be. The result is an inclusive community filled with creative programmers and applied users all contributing to this broader goal of a superior computing platform and language. And in the words of one of the most influential modern R developers, Hadley Wickham (a name you will see a lot in this book) (Waggoner, 2018a),

… When you talk about choosing programming languages, I always say you shouldn’t pick them based on technical merits, but rather pick them based on the community. And I think the R community is like really, really strong, vibrant, free, welcoming, and embraces a wide range of domains. So, if there are people like you using R, then your life is going to be much easier.

Therefore, though tricky to learn, if users are engaged in any way with data, whether working for an NGO, attending graduate school, or even legal work in many cases, users will be glad they opted to begin in R and endured the hard, but vastly rewarding work up front.

1.2 Why This Book?

There are many good introductions to R (Monogan III, 2015; Li, 2018; Wickham and Grolemund, 2017), and we will point you towards several of them throughout. Yet, this book provides a unique and beneficial starting place, particularly for social scientists. There are several features of this book that lead us to this conclusion.

First, it is written specifically for social scientists. Many of the best introductions to R are written for those who are coming from other programming languages (e.g. Python, C++, Java) or from database design (e.g., Spark, SQL). The assumption is that the reader will already be pretty familiar with programming concepts, like objects, functions, scope, or even with R itself. This, however, does not apply to most social scientists, who usually do not come in with experience in either programming or database management, and will, therefore, find these concepts unfamiliar, and often quite vexing. We also include details that are likely to be particularly relevant to social scientists, such as how to automatically generate tables using R.

Second, we write this as a genuine introduction course, not as a cookbook. Cookbooks have their place for learning R. They provide handy guides to completing particular tasks, and are indispensable as you go through your work. But, just as following the steps to make bread is not the same as understanding how bread is made, copying code from a book or online resource is not the same as developing the skill base to flourish as a data analyst who uses R. For a similar reason, unlike some other introductions, we do not create any special software specifically for this book — you are here to learn R, not a software we design. This book concentrates on helping you to understand what you are doing and why. After working your way through this book, you should be able to undertake a range of tasks in R and more easily learn new ones and even troubleshoot your own errors.

Third, we provide a thoroughly modern introduction to R. While using the word “modern” in any book is a risky proposition, we mean this in terms of using the latest tools as of this writing to help you be as productive as possible. This means using the RStudio integrated development environment (IDE) to assist you in writing and running code, R projects to keep track of and organize your work, and the Tidyverse set of tools to make your code more modular and comprehensible.

Fourth, we concentrate on the areas of learning R that you will use the most often and are typically the most frustrating for beginners. Many people have heard of the “Dunning-Kruger effect”, which is the tendency for people with low ability to overestimate their ability (Kruger and Dunning, 1999). Many people forget about the inverse part of the Kruger-Dunning effect — the tendency for experts to underestimate the difficulty of tasks for which they are an expert. This sometimes exhibits itself in R introductions that attempt to introduce quite advanced statistical models, but give little to no attention to issues like file systems and data management. Yet, things like setting working directories are some of the most common stumbling blocks for students and data scientists will often say that 80% of their job is managing and shaping data, but this is almost never reflected in introductory texts...