âBig dataâ has entered the academic lexicon as a new buzzword. Although there are no clear guidelines for what dataset size qualifies as âbig,â there is widespread recognition that the availability of massive digital datasets provides a novel opportunity for scholars. By using traces of data left behind by people as they navigate their digital environmentsâthe sites they peruse, the social media posts they make, the way they interact with sitesâscholars can analyze peopleâs expressed attitudes and behaviors. In this volume, we focus on what political communication scholars can learn by studying digital trace dataâthe transmission of information and opinions in public, digital spaces. The messages left in comment sections, posted on social media sites, and tweeted by bloggers provide the raw data for new understandings of how citizens, elites, and journalists make sense of the political world. This book aims to examine the theoretical and methodological implications of big data, and to provide new empirical research that makes use of big data.
There are intriguing possibilities from working with these data. Unlike traditional survey and experimental datasets, big data (at least as conceptualized here) are not created under contrived circumstances. And, unlike in-depth interviews or ethnographies, big data are available on a much larger scale. Of course, the datasets have limitations. Big data come from self-selected participantsâonly those who have a Twitter account and want to tweet about politics, for instance, will be included in a political Twitter dataset. This is only a substantial weakness if one is looking to make inferences about the broader population. Further, the data are constrained by technology. Algorithmic changes, for instance, can affect the data, as can the availability of digital archives.
Nonetheless, big data present major research possibilities for political communication scholars who are interested in how citizens, elites, and journalists interact. Political discussions, for instance, have long been of interest to communication scholars (e.g. Katz & Lazarsfeld, 1955; Mutz, 2006; Price & Cappella, 2002). With the availability of social media data, academics can observe, on a large scale, how people talk about and interact with politics. The opportunity to study political discussions is also available to media organizations and political elites: examining how they make use of big data represents another fruitful scholarly trajectory. The scholars involved in this book represent forward thinkers who aim to inform the study of political communication by analyzing the behavior of and messages left by citizens, elites, and journalists in digital spaces. Using a variety of methodological approaches and bringing diverse theoretical perspectives, this group is poised to shed light on how big data can inform political communication scholarship.
Big Data and Related Terms
Electing to use the term âbig dataâ to describe this book was not an easy choice. It is fraught with complication because there is no definition of what makes data âbig.â The best definitions offered by contributors to this volume sidestep this issue. Bode, for instance, defines big data as âinformation that is (1) created digitally and (2) collected in large numbers to facilitate analysis.â Guo identifies big data as âany large-scaled numerical, textual, visual, or geographic data, which can be analyzed to reveal patterns and trends of human behavior.â She goes further, saying that the size and complexity of big data are beyond traditional tools for gathering and analyzing data.
We tend to agree that there is no bright line distinguishing big data from medium, or small, data. Nonetheless, the term is useful because it conveys the advanced tools required for gathering and analyzing this form of data. Some traditional statistical programs are unable to accommodate datasets of this size. Further, these datasets tax traditional computersâ storage and processing capacities. As technology improves, however, this definition of big data seems less relevant (boyd & Crawford, 2012).
We considered other terms that also seem to capture the phenomenon of interest. Most of the authors in this book are interested in a particular type of big dataâdigital trace data. In this volume, Jungherr, drawing on work from Howison, Wiggins, and Crowston (2011), defines digital trace data as âdata documenting the interactions of users with digital devices or services.â These data are, quite literally, the traces that people leave behind when they have engaged in digital spaces. This could be browser history, comments, or social media posts, and the list could continue indefinitely. Of course, there are other types of big data beyond digital trace dataâyou could think about big datasets with relevance to medicine or engineering. For communication scholars, however, digital trace datasets are often of primary interest.
We also considered using the term âcomputational social science,â which captures a method frequently employed by those using big data. As Shah, Cappella, and Neuman (2015) explain, computational social science involves:
(1)the use of large, complex datasets, oftenâthough not alwaysâmeasured in terabytes or petabytes; (2) the frequent involvement of ânaturally occurringâ social and digital media sources and other electronic databases; (3) the use of computational or algorithmic solutions to generate patterns and inferences from these data; and (4) the applicability to social theory in a variety of domains from the study of mass opinion to public health, from examinations of political events to social movements (p. 7).
This form of analysis is at the intersection of computer and social science, and can require collaborations with computer scientists, as Guo notes in her chapter.
Acknowledging that the work here involves both digital trace data and computational social science, we nonetheless opted for the term âbig data.â We did so for several reasons. First, âbig dataâ has gained traction in academic communities, and is now widely discussed in popular and scholarly contexts. Second, we wanted to focus on the data in this volume, rather than the method. The term data, we felt, lent itself to more diverse analyses, such as Baldwin-Philippiâs qualitative work on how campaigns are using âbig data.â So, with an acknowledgment of the complexities of the term, we adopted it as a defining feature of the chapters that follow.
Big Data and Political Communication
Political communication scholars aim to look at how elites, the media, and the public interact around political topics. Big data allow many opportunities to do precisely this work, as all three entities leave volumes of trace data. Research to date has used big data approaches to examine how political elites communicate (McGregor, Lawrence, & Cardona, 2017), how agenda setting occurs across traditional and social media (Neuman, Guggenheim, Jang, & Bae, 2014), and how norms regarding incivility and partisanship are rewarded and punished in news comment sections (Muddiman & Stroud, 2017). Studies like these demonstrate the utility of this approach for answering questions of theoretical interest to political communication researchers.
Methodologically, political communication scholars should be especially well poised to make contributions to the study of big data. Political content is widely distributed on such platforms as Twitter and political news garners extensive comments on news sites (Coe, Kenski, & Rains, 2014). Communication scholars have been pioneers in the analysis of texts and the methods of content analysis (e.g. Krippendorff, 2012): and political communication scholars, in particular, have been developing computerized content-analysis programs that can be used to analyze large corpuses of text (e.g. Hart, 1985; Young & Soroka, 2012). The availability of content and methods relevant to political communication makes this volume particularly apropos.
With that said, the explosion of research related to big data means that this volume will not be comprehensive. Several aspects of big data are not covered in these chapters, but can be found in other places, such as the analysis of networks and the use of algorithms and recommender systems (e.g. Beam, 2014; Colleoni, Rozza, & Arvidsson, 2014; Flaxman, Goel, & Rao, 2016). We also focus on U.S.-based big data analyses, although the methodological issues raised and the theoretical lessons drawn from the chapters will have relevance to political communication scholars regardless of their country of residence. Finally, there has been an overarching use of big data to analyze textual content, and more development is needed to bring this approach to images and video. This gap in our technical abilities is apparent in this volume as well.
Organization of the Book
The book is organized into three sections; the first examines the benefits and drawbacks of political communication researchersâ use of big data; the second evaluates the reliability and validity our uses of these datasets; and the third demonstrates the ways in which we can gain new insights by using big data.
The first section of this book offers competing takes on the benefits and drawbacks of the use of big data within the social sciences. While Bode is optimistic, Jungherr is less so. Putting them into these camps is, of course, an oversimplification of their positions, but their chapters do have decidedly different tones which serve to provide an overview of the complexities of using big data. Bode offers a hopeful take on the effects of big data on academic scholarship. She sees big data as being able to answer new communication questions, to push us to consider our methodological choices more deeply, and to offer stronger justifications for our work. She also believes that big data findings are more easily understandable, which represents an opportunity to better engage students and the public.
Jungherr, taking a different tack, is critical of contemporary scholarship that uses digital trace data. He identifies two fallacies that frequently crop up. First, people treat the data as though they have every possible data point (the n=all fallacy). But, often, scholars do not have complete data. Platforms may not store all the data, or may have service agreements that prevent scholars from accessing all available data. Second, people see online data as acting as a mirror of some social phenomenon, but it may not be. Twitter data may simply be statements that people were willing to make on Twitter and nothing moreâperhaps they do not capture underlying social maladies. Jungherr urges researchers to think much more carefully about what the data actually capture and to subject digital trace data to rigorous validity testing. In its infancy, studies using digital trace data were accepted merely based on the grounds that they were methodologically innovative. As we enter an era of more normalized use of these datasets, Jungherr makes a compelling case for better conceptualization.
The second section of the book expands upon questions about what big data can tell political communication researchers. The three chapters push researchers to think carefully about the validity of the inferences they can draw from big data. Freelon discusses the technical and social aspects of social media platforms and how they can constrain our ability to draw valid inferences. Guo looks at analytic strategies for dealing with big data and how they can be more or less valid. Pasek and Dailey tested how well Twitter sentiment can predict candidate preferences. Each of the three chapters models the care researchers should take in thinking through the validity of any inferences drawn from big data.
Freelon takes a close look at the construct validity of social media trace data. He recommends that researchers consider four factors; the technical design and affordances of a social media platform; the terms of service that govern how people act on a social media platform; the context of how people use social media; and the potential for misrepresentation. The chapter, then, takes a critical look at how people can disclose their gender, race/ethnicity, or location based on these four factors. Facebook, for instance, requires users to indicate their gender when they sign up for an account. Other methods of determining gender, such as inferring it from someoneâs name, have questionable validity on some platforms and among some sub-populations. As Freelon aptly notes, digital trace data were not created for the purpose of research and, because of this, researchers must carefully consider the limitations of any inferences made.
Guo offers several cautionary tales about how we analyze big data. She points out that researchers must make numerous choices when deciding how to analyze the data, and each choice can affect the conclusions reached. By sharing the results of several reliability and validity tests, Guo illustrates the extent to which human decision-making can change the results of big data analyses. Although she shows that changes do occur, the examples she shared do not seem to result in dramatic overturns of the reached conclusions. Productively, Guo offers recommendations to researchers, urging them to work with computer scientists and to ensure that they test the results of any âout-of-the-boxâ big data analytics packages.
Pasek and Dailey undertake precisely the sort of analysis that Jungherr recommends, seeking to analyze the correspondence between Twitter data and survey data regarding electoral preferences and candidate favorability. They find little evidence that sentiment expressed toward the candidates on Twitter corresponds with survey measures. This is true regardless of whether they look at; (a) candidate favorability or electoral preferences; (b) changes in sentiment or absolute levels of sentiment; and (c) survey data corresponding to demographic attributes of Twitter users. There is some suggestion that the Twitter data more closely conformed to survey data about candidate preferences later in the 2008 presidential campaign, but the authors are rightly cautious about how far they would push this conclusion. Park and Daileyâs chapter suggests that Twitter data is what it isâpublic expressions among a distinct groupâas opposed to a proxy for something else.
The third, and final, section of this book provides examples of big data analysis with relevance to political communication scholars. These demonstrations illustrate how big data can be used to answer important questions for political communication scholars and offer both methodological and theoretical insights. The four chapters in this section each examine different sources of big data, whether Yik Yak, comments from The New York Times, campaign uses of big data, or tweets. Each demonstrates the new ways in which scholars must justify the methods that they use to analyze datasets of this sizeâusing the same techniques that communication scholars typically employ when analyzing survey or experimental data is not always possible. This collection of chapters is also particularly important because they analyze the intersections among media, elites, and the public in their communication practices regarding politics.
In their chapter, Vargo and Hopp analyze the use of Yik Yak among college students. Politics, they find, comes up infrequently. Yet, major political events, such as the State of the Union address, yield an uptick in political posts on the platform. Interestingly, political comments on Yik Yak are particularly unlikely at large universities, universities with a higher percentage of large classes, and universities with more fraternities and sororitiesâperhaps the heterogeneity of these contexts depresses political talk, but more research is needed.
Muddiman looks at comments left on The New York Times website to understand how other people and ...