Text as Data
eBook - ePub

Text as Data

A New Framework for Machine Learning and the Social Sciences

Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart

Share book
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Text as Data

A New Framework for Machine Learning and the Social Sciences

Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart

Book details
Book preview
Table of contents
Citations

About This Book

A guide for using computational text analysis to learn about the social world From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights. Text as Data is organized around the core tasks in research projects using text—representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research.Bridging many divides—computer science and social science, the qualitative and the quantitative, and industry and academia— Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain.

  • Overview of how to use text as data
  • Research design for a world of data deluge
  • Examples from across the social sciences and industry

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Text as Data an online PDF/ePUB?
Yes, you can access Text as Data by Justin Grimmer, Margaret E. Roberts, Brandon M. Stewart in PDF and/or ePUB format, as well as other popular books in Sozialwissenschaften & Wissenschaftliche Forschung & Methodik. We have over one million books available in our catalogue for you to explore.

PART I

Preliminaries

CHAPTER 1

Introduction

This is a book about the use of texts and language to make inferences about human behavior. Our framework for using text as data is aimed at a wide variety of audiences—from informing social science research, offering guidance for researchers in the digital humanities, providing solutions to problems in industry, and addressing issues faced in government. This book is relevant to such a wide range of scholars and practitioners because language is an important component of social interaction—it is how laws are recorded, religious beliefs articulated, and historical events reported. Language is also how individuals voice complaints to representatives, organizers appeal to their fellow citizens to join in protest, and advertisers persuade consumers to buy their product. And yet, quantitative social science research has made surprisingly little use of texts—until recently.
Texts were used sparingly because they were cumbersome to work with at scale. It was difficult to acquire documents because there was no clear way to collect and transcribe all the things people had written and said. Even if the texts could be acquired, it was impossibly time consuming to read collections of documents filled with billions of words. And even if the reading were possible, it was often perceived to be an impossible task to organize the texts into relevant categories, or to measure the presence of concepts of interest. Not surprisingly, texts did not play a central role in the evidence base of the social sciences. And when texts were used, the usage was either in small datasets or as the product of massive, well-funded teams of researchers.
Recently, there has been a dramatic change in the cost of analyzing large collections of text. Social scientists, digital humanities scholars, and industry professionals are now routinely making use of document collections. It has become common to see papers that use millions of social media messages, billions of words, and collections of books larger than the world’s largest physical libraries. Part of this change has been technological. With the rapid expansion of the internet, texts became much easier to acquire. At the same time, computational power increased—laptop computers could handle computations that previously would require servers. And part of the change was also methodological. A burgeoning literature—first in computer science and computational linguistics, and later in the social sciences and digital humanities—developed tools, models, and software that facilitated the analysis and organization of texts at scale.
Almost all of the applications of large-scale text analysis in the social sciences use algorithms either first developed in computer science or built closely on those developments. For example, numerous papers within political science—including many of our own—build on topic models (Blei, Ng, and Jordan, 2003; Quinn et al., 2010; Grimmer, 2010; Roberts et al., 2013) or use supervised learning algorithms for document classification (Joachims, 1998; Jones, Wilkerson, and Baumgartner, 2009; Stewart and Zhukov, 2009; Pan and Chen, 2018; Barberá et al., 2021). Social scientists have also made methodological contributions themselves, and in this book we will showcase many of these new models designed to accomplish new types of tasks. Many of these contributions have even flowed from the social sciences to computer science. Statistical models used to analyze roll call votes, such as Item Response Theory models, are now used in several computer science articles (Clinton, Jackman, and Rivers, 2004; Gerrish and Blei, 2011; Nguyen et al., 2015). Social scientists have broadly adapted the tools and techniques of computer scientists to social science questions.
However, the knowledge transfer from computer science and related fields has created confusion in how text as data models are applied, how they are validated, and how their output is interpreted. This confusion emerges because tasks in academic computer science are different than the tasks in social science, the digital humanities, and even parts of industry. While computer scientists are often (but not exclusively!) interested in information retrieval, recommendation systems, and benchmark linguistic tasks, a different community is interested in using “text as data” to learn about previously studied phenomena such as in social science, literature, and history. Despite these differences of purpose, text as data practitioners have tended to reflexively adopt the guidance from the computer science literature when doing their own work. This blind importing of the default methods and practices used to select, evaluate, and validate models from the computer science literature can lead to unintended consequences.
This book will demonstrate how to treat “text as data” for social science tasks and social science problems. We think this perspective can be useful beyond just the social sciences in the digital humanities, industry, and even mainstream computer science. We organize our argument around the core tasks of social science research: discovery, measurement, prediction, and causal inference. Discovery is the process of creating new conceptualizations or ways to organize the world. Measurement is the process where concepts are connected to data, allowing us to describe the prevalence of those concepts in the real world. These measures are then used to make a causal inference about the effect of some intervention or to predict values in the future. These tasks are sometimes related to computer science tasks that define the usual way to organize machine learning books. But as we will see, the usual distinctions made between particular types of algorithms—such as supervised and unsupervised—can obscure the ways these tools are employed to accomplish social science tasks.
Building on our experience developing and applying text as data methods in the social sciences, we emphasize a sequential, iterative, and inductive approach to research. Our experience has been that we learn the most in social science when we refine our concepts and measurements iteratively, improving our own understanding of definitions as we are exposed to new data. We also learn the most when we consider our evidence sequentially, confirming the results of prior work, then testing new hypotheses, and, finally, generating hypotheses for future work. Future studies continue the pattern, confirming the findings from prior studies, testing prior speculations, and generating new hypotheses. At the end of the process, the evidence is aggregated to summarize the results and to clarify what was learned. Importantly, this process doesn’t happen within the context of a single article or book, but across a community of collaborators.
This inductive method provides a principled way to approach research that places a strong emphasis on an evolving understanding of the process under study. We call this understanding theory—explanations of the systematic facets of social process. This is an intentionally broad definition encompassing formal theory, political/sociological theory, and general subject-area expertise. At the core of this book is an argument that scholars can learn a great deal about human behavior from texts but that to do so requires an engagement with the context in which those texts are produced. A deep understanding of the social science context will enable researchers to ask more important and impactful questions, ensure that the measures they extract are valid, and be more attentive to the practical and ethical implications of their work.
We write this book now because the use of text data is at a critical point. As more scholars adopt text as data methods for their research, a guide is essential to explain how text as data work in the social sciences differs from its work in computer science. Without such a guide, researchers outside of computer science solving problems run the risk of applying the wrong algorithms, validating the wrong quantities, and ultimately making inferences not justified by the evidence they have acquired.
We also focus on texts because they are an excellent vehicle for learning about recent advances in machine learning. The argument that we make in this book about how to organize social science research applies beyond texts. Indeed, we view our approach as useful for social science generally, but particularly in any application where researchers are using large-scale data to discover new categories, measure their prevalence, and then to assess their relationships in the world.

1.1 How This Book Informs the Social Sciences

A central argument of this book is that the goal of text as data research differs from the goals of computer science work. Fortunately, this difference is not so great that many of the tools and ideas first developed in other fields cannot be applied to text as data problems. It does imply, however, that we have to think more carefully about what we learn from applying those models.
To help us make our case, consider the use of texts by political scientist Amy Catalinac (Catalinac, 2016a)—a path-breaking demonstration of how electoral district structure affects political candidates’ behavior. We focus on this book because the texts are used clearly, precisely, and effectively to make a social science point, even though the algorithm used to conduct the analysis comes from a different discipline. And importantly, the method for validation used is distinctively social scientific and thorough.
Catalinac’s work begins with a puzzle: why have Japanese politicians allocated so much more attention to national security and foreign policy after 1997, despite significant social, political, and government constraints on the use of military and foreign policy discussions put in place after World War II? Catalinac (2016a) argues that a 1994 reform in how Japanese legislators are elected explains the change because it fundamentally altered the incentives that politicians face. Before the 1994 reform, Japanese legislators were elected through a system where each district was represented by multiple candidates and each party would run several candidates in each district trying to get the majority of the seats. Because multiple candidates from the same party couldn’t effectively compete with their co-partisans on ideological issues, representatives tried to secure votes by delivering the most pork—spending that has only local impact, such as for building a bridge—to the district as possible. The new post-1994 reform system eliminated multi-member districts and replaced them with a parallel system: single-member districts—where voters cast their ballot for a candidate—and representatives for the whole country—where voters cast their ballot for a party and the elected officials are chosen from the party’s list. This new system allowed the parties to impose stricter ideological discipline on their members and the choices of voters became less about the individual personalities and more about party platforms. Thus, the argument goes, the reform changed the legislators’ incentives. Focusing on local issues like pork was now less advantageous than focusing on national issues like foreign policy.
This device does not support SVG
Figure 1.1. An example of a candidate manifesto of Kanezo Muraoka from 2003, Figure 3.7 from Catalinac (2016a).
The argument proceeds through iteration and induction. To begin understanding the effect of the change in electoral rules on electoral strategy, Catalinac collected an original dataset of 7,497 Japanese Diet candidate manifestos. The manifestos are nearly ideal data for her study: they are important to candidates and voters, under the control of candidates, and available for all candidates for all elections for a period before and after the shift in electoral rules. We discuss the principles for data collection in Chapter 4, but Catalinac’s exemplary work shows that working with text data does not mean that we must opt for the most convenient data. Rather, Catalinac engaged in a painstaking data collection process to find the manifestos through archival visits and digitize them through manual transcription. This process alone took years.
With the data in hand, Catalinac uses an inductive approach to learn the categories in her data she needs to investigate her empirical puzzle: what elected officials are discussing when they run for office. Catalinac uses a well-known statistical model, Latent Dirichlet Allocation (LDA)—which we return to in Chapter 13—to discover an underlying set of topics and to measure the proportion of each manifesto that belongs to each topic. As Catalinac describes,
Typically, the model is fit iteratively. The researcher sets some number of topics; runs the model; ascertains the nature of the topics outputted by reading the words and documents identified as having high probabilities of belonging to each of the topics; and decides whether or not those topics are substantively meaningful.… My approach was also iterative and guided by my hypotheses.
(Catalinac, 2016a, p. 84)
As we describe in Chapter 4, discovery with text data does not mean that we begin with a blank slate. Catalinac’s prior work, qualitative interviews, and expertise in Japenese politics helped to shape the discoveries she made in the text. We can bring this prior knowledge to bear in discovery; theory and hunches play a role in defining our categories, but so too does the data itself.
Catalinac uses the model fit from LDA to measure the prevalence of candidates’ discussions of pork, policy, and other categories of interest. To establish which topics capture these categories, Catalinac engages in extensive validation. Importantly, her validations are not the validations most commonly conducted in computer science, where LDA originated. Those validations tend to focus on how LDA functions as a language model—that is, how well it is able to predict unseen words in a document. For Catalinac’s purposes, it isn’t important that the model can predict unseen words—she has all the words! Instead, her validations are designed to demonstrate that her model has uncovered an organization that is interesting and useful for her particular social scientific task: assessing how a change in the structure of districts affected the behavior of candidates and elected officials. Catalinac engages in two broad kinds of validation. First, she does an in-depth analysis of the particular topics that the model automatically discovers, reading both the high probability words the model assigns to the topic and the manifestos the model indicates are most aligned with each topic. This analysis assures the reader that her labels and interpretations of the computer-discovered topics are both valid and helpful for her social scientific task. Second, she shows that her measures align with well-known facts about Japanese politics. This step ensures that the measures that come from the manifestos are not idiosyncratic or reflecting a wildly different process than that studied in other work. It also provides further evidence that the labels Catalinac assigns to texts are valid reflections of the content of those texts.
Of course, Catalinac is not interested in just categorizing the texts for their own sake—she wants to use the categories assigned to the texts as a source of data to learn about the world. In particular, she wants to estimate the causal effect of the 1994 electoral reform on the shift in issues discussed by candidates when they are running. To do this, she uses her validated model and careful research design to pursue her claim that the e...

Table of contents