Introduction
Big data burst onto the scene of social science nearly a decade ago. Coined by Manovich (2011) to describe datasets too large to be stored and analyzed by conventional software and personal computers, the term has become a data-sensitive meme in fields as varied as business, sports, journalism, science, and public health, entailing a near-universal pivot toward data-driven research, business, and governance (Edelmann et al., 2020; Langlois, Redden & Elmer, 2015; Mayer-Schönberger & Cukier, 2013; Veltri, 2017). The unprecedented scope and scale of big data and the variety of qualitiesâincluding variety, velocity, volume, and valuesâthat it can sort in the process of digitally recording the traces of social transactional activities make it a compelling subject for research into the âsocial worldâ (Kitchin & McArdle, 2016; Savage & Burrows, 2007).
In the field of sociology, big data brings with it both high expectations and heated debate. On the one hand, it represents an enormous new source of âdigital footprintsâ comprising individual actions and social transactions among billions of people in real and historic time, along with a battery of new approaches to collect, describe, and analyze them (Halford & Savage, 2017; McFarland, Lewis & Goldberg, 2016; Watts, 2012). This unprecedented wealth of information greatly accelerated expectations for its potential application to social science research and scholarship, suggesting that the very foundation of empirical studies in social science would be reconstructed (King, 2014).
Many scholars have pointed to the significance of big data in arming sociologists with access to new research resources and opportunities. For example, Lazer and Radford (2017) summarized five opportunities that big data can offer sociologists, namely, accessing meaningful social behavior, monitoring social phenomena, analyzing data on social systems, providing data for experiments, and supporting data heterogeneity. Evans and Aceves (2016) surveyed computational approaches for large-scale analyses on textual data, highlighting the use of machine learning for theorizing the nature of collective attention, social relationships, and communication lurking in enormous volumes of archives. Many robust big data analyses have emerged in recent years, focusing on the application of multiple-source big data to diverse topics in core areas of contemporary sociology. Overall, as Burrows and Savage (2014, p. 5) pointed out, âsociologists need to be prepared to intervene in the world of Big Data in order to ensure we command a voice in this new terrain.â
On the other hand, despite its promise, big data analytics in sociology has two key limitations. One is that without the theoretically informed and context-driven research that come from domain expertise, the purely computational approaches of big data analytics can cause research to devolve into speculative data mining. For sociology, big data applications relying on black-box tools conflict with the hermeneutic tradition that is at the core of the discipline (Kitchin, 2014a; Pasquale, 2015).
The other limitation of big data analytics is that despite its size, big data can still be biased; the agents, applications, and devices producing and collecting the data can themselves be either selective or manipulated. This points to the paradox that despite its name, big data is likely to be either âsmall,â representing only a subset of social transactions among particular demographics and thereby capturing partial and/or fragmented information (McFarland & McFarland, 2015; OâBrien, 2016; Park & Macy, 2015; Shaw, 2015); or âartifactual,â whereby social forces, including censorship, political robots, and system error manipulate the process of information production, leading to the proliferation of artifacts, errors, and anomalies (see Lazer & Radford, 2017).
Sociology is now at a crossroads. Although pressured by burgeoning intellectual forces, in particular those harnessing computational approaches and engaging with big data, sociologists still lack a clear road map leading to effective integration of big data analytics with contemporary sociology. Their resistance has much to do with skepticism born of the deficiencies in approaches to big data. More importantly, sociologists need to find some mode of study that can lead to something more than mere fancy analytical tools and exciting results; we need tools that lead to clear solutions, and we need templates for research that formally link data, theory, and methodology in more robust, scientific, and sociological ways. Put simply, we need to choose precisely where to insert big data into a range of key facets of empirical sociologyâwhether it should best be used to portray big pictures, unveil hidden structures, verify null hypotheses, or infer causality.
The answer is first to turn back to the data themselves and to ask not what makes big data exciting, but rather which dimensions of sociology big data is most aligned with. More precisely, can big data be a kind of macro-data? What is big dataâs advantage when compared with other solutions in sociological inquiry, such as assembling survey data? In this chapter, we will address these concerns and show that the empirical strength of big data can be expected to elicit the emergence of a new type of research that we have so far largely ignored in the territory of empirical sociology: theory-guided quantitative macrosociology.
For sociology, despite an initial surge of interest and a powerful residual skepticism, big data has been expected to offer insights into each subfield of the discipline, not only because each facet of our daily lives has been penetrated in real time and over time by sophisticated big data apparatuses, but also because the recorded social environmentâthe entirety of human behavior, interaction, and thoughtâconstitutes a panoramic data repertory that offers us a rare opportunity to inspect society in an entirely new way. It is important to note that big data is a composite of myriad transactions of myriad individuals. This reminds us that despite early claims that the sheer size of big data can attenuate many of its cons and biases (Mayer-Schönberger & Cukier, 2013), ultimately it is not the size of big data that matters but the ontological level of information that we can extract from it. That is, we should critically interrogate available big data to harness its strength at the macro-level and from a macro-perspective.
Theory-guided quantitative macrosociology has made notable inroads in its integration of big data in macro-level analysis. This novel approach has the potential to contribute to sociological studies by exploiting distant reading to get a big picture of the sizable unread portions of the corpus, which cannot be achieved by traditional qualitative approaches featuring close reading on selected archives and quantitative methods defined by model regressions on limited surveyed samples. The rich spatial and temporal dynamics available through this line of research is extremely promising.
Data assemblage versus big data
Sociologists today are daunted by the same big questions that consumed sociologists in the mid-twentieth century, including the relation between economy and culture, the factors that lead to social inequality, and whether and why social behaviors can be contagious. This is because when focusing on society from an ecological or systematic perspective, no single information package is sufficiently informative to capture the big picture over large temporal and spatial scales. Consequently, to explore the configuration and regulation of sociocultural environments, macro-sociologists tend to bypass quantitative methodology and resort to abstract theory constructs, which in turn often invite criticism for inducing tautology and ambiguity. While there are certainly some exceptional macro-analyses using quantitative approaches, particularly some transnational analyses in the traditional fields of sociology such as social stratification and inequality, macro-analyses remain relatively rare compared with individual or micro-level regressions, which are predominant in the arena of quantitative sociology, thanks to the availability of a vast amount of well-designed social surveys and the lack of data about macro-social indicators. This has cast a shadow across the entire realm of macrosociology, despite the claim of self-sufficiency that macrosociology shares with philosophy and the humanities.
There are two ways to tackle this problem. One, proposed by Halford and Savage (2017), is called âsymphonic social science,â a term proposed to label a new methodology making use of data assemblage to test big theories. The other is big data itself, some inspiring empirical applications of which have been introduced in sociological areas.
Because accessing and deploying various sources of surveyed sample data is relatively easier than harnessing big data, assemblage of survey data has a distinct advantage; in fact, it can even be seen as a type of comparison analysis. Halford and Savage (2017) argued that the symphonic research paradigm in effect combines micro- and macro-level research and integrates information from conventional survey, regression statistics, and ethnographic and interview data under the same framework. By exploring the contradictions and complementarities of findings from diverse datasets, sociologists can pursue the understanding of major social questions in a symphonic way.
Specifically, Halford and Savage (2017) used three well-known books to illustrate symphonic social science research: Thomas Pikettyâs Capital in the Twenty-First Century (2013), Robert Putnamâs Bowling Alone (2000), and Richard Wilkinson and Kate Pickettâs The Spirit Level (2011). The three works similarly deployed large-scale heterogeneous data assemblages and repurposed findings from multiple data sources instead of representative samples or ethnographic case studies. The three books thus ârelied on the deployment of repeated ârefrains,â just as classical music symphonies introduce and return to recurring themes, with subtle modifications, so that the symphony as a whole is more than its specific themesâ (Halford and Savage, 2017, p. 4). Compared to conventional sociology using formal models and championing parsimony, symphonic social science draws on a more aesthetic repertoire and sets more store in prolixity.
Still, Halford, and Savage (2017) conceded that symphonic projects are time-consuming and that they require significant workload and resources. The scope of those projects also demands long-form presentation, such as books rather than shorter works such as articles, to allow for the derivation of argument from empirical and theoretical resources. More importantly, assembling conventional survey data can only construct a data repertory containing information from surveyed samples. This suggests that data assemblage improves merely the scale of data, not the informativity of data. In this regard, key factors of a macro-analysis of interest, often featured by large-scale temporal and spatial scale, are very likely to be unavailable in conventional survey datasets. Big data therefore matters more for macrosociology.
Putting big data at the heart of macrosociology
Sociologists have long recognized the enormous potential of using big data to dissect social process and phenomena. In the last decade, especially over the past five years, pioneering sociologists have endeavored to link theory, data, and computational algorithms as a composite whole to gain sociological insight (Berman & Hirschman, 2018). In this section, we group reviews of works empirically exploring two aspects of big data applications: how to operationalize core theoretical constructs and map a big picture for sociocultural structures and trends; and how to quantify a certain variable that is hard to measure using survey data, for the sake of testing theories using conventional regression models. Although these two tasks are big-data-driven and theory-guided, the respective studies are organized and presented in different ways. This divergence has largely been ignored in present debates about big dataâs application in social science.
Charting the sociocultural milieu for theorizing
For scholars and researchers determined to systematically examine the sociocultural milieu as a composite whole, big data is an uncontested resource. Almost all core constructs of macrosociology, such as social system, collective action, discourse, field, expression, and contagion, lurk in colossal volumes of digital archives, and many scholars have advocated mobilizing big data to help uncover and measure sociocultural meaning in digitalized and semantic archives (Bail, 2014; DiMaggio, 2015; Frade, 2016; Halford, Pope & Weal, 2012; Halford & Savage, 2017; Lee & Martin, 2015; MĂŒtzel, 2015). For example, a special issue in the journal Poetics was devoted to the theme of applying an array of topic models in cultural sociology, tracing the ontological tradition back to content analyses pioneered in the 1950s (Mohr & Bogdanov, 2013). The essence and strength of large-scale textual analysis lies in the synthesis adjoining conventional qualitative methods and novel computational techniques for big data analytics (Bail, 2014; Nelson, 2019), which can be counted on to advance our understanding of sociocultural processes.
As a result, cultural sociology is among the first sociology subfields to engage with big data, and it has made substantial progress in harnessing several computational approaches, ranging from accessing huge unstructured data to measure sociological meaning, to lifting the methodological capacity to empirically develop, derive, refine, and test sophisticated theories of the social origins of meaning, and to explore important theoretical constructs. Some have used a range of topic models to reveal how social position and structure (e.g., gender, organizations, and identities) work in shaping cognitive frames, discourse, and social logics in cultural archives, including organizational publications, governmental documents, academic journals, newspapers, and literature (Bail, 2012; DiMaggio, Nag & Blei, 2013; Jockers & Mimno, 2013; Mohr et al., 2013). Some have used large book corpora to map the temporal trends of tangible and intangible sociocultural phenomena and entities over a period of hundreds of years for a distant reading and comparison (Chen & Yan, 2016a, 2016b, 2018; Chen, Yan & Zhang, 2017; Chen, Yan et al., 2020; Chen, He et al., 2020; Guggenheim, 2014; Kozlowski, Taddy & Evans, 2019; Michel et al., 2011). Others have uncovered the hidden links among cultural products, such as published academic articles or music videos on YouTube or Twitter, to explore the evolution of networks as a whole and to extend relevant theories (Airoldi, Beraldo & Gandini, 2016; Foster, Rzhetsky & Evans, 2015; Goldenstein & Poschmann, 2019; Rzhetsky et al., 2015; Tangherlini & Leonard, 2013; Tinati et al., 2014).
These studies tend to provide an overview of social processes of interest in which operationalizing theory constructs serves to chart the milieu for theorizing. We know sociological theory can be divided into two subsets: concepts that trace social entities, and relationships that link and structure social entities. Although theory testing, especially testing the relationship between two social entities, remains central to quantitative research, big data can augment this line of analytical focus and clarify social concepts and structures by also âfiguring out how to structure a mountain of data into meaningful categories of knowledgeâ (Goldberg, 2015, p. 3). In this mode of sociological investigation, sociologists with methodological expertise employ theorized concepts and structures to direct the process of exploiting the richness of big data. In turn, data directs the further investigations and the process of interpretation and theoretical derivation, just as Kitchin (2014b, p. 6) proposed: âMany supposed relationships within data sets can be quickly dismissed as trivial or absurd by domain experts, with others flagged as deserving more attention.â
Quantifying elusive indicators for theory testing
Two studies using textual analysis tools merit close inspection to show how big data analysis can help theory testing. One is Jockers and Mimnoâs (2013) study on themes of 3,000 nineteenth-century works of fiction from the United Kingdom and the United States, using a topic model to reveal the topics of historic literature. The other is Bailâs (2012) investigation on how fringe anti-Muslim organizations influenced media discourse and became part of mainstream media, using discourse frames in the news media to quantify certain variables for further theory testing after a distant reading of the meaning of the large volumes of text. In both studies, textual analysis served merely as an instrument to quantify variables that are essential for model regression as the primary analysis.
Jockers and Mimno (2013) investigated the relationship between literary themes and sociodemographic attributes, such as authorsâ gender, using an assembled corpora containing 3,279 works of fiction from the United States and Great Britain (including Ireland, Scotland, and Wales) from 1750 to 1899. They found that when themes had been identified through topic-modeling technology and assigned to each work, some themes exhibited a one-gender-dominant feature of the authors, suggesting that men and women might have chosen different themes in composing their fiction. For example, the authors of works categorized under the theme âfemale fashionâ were mostly females, while the authors of works categorized as âenemiesâ were mainly males (the gender ratio of a given theme can be computed by comparing the proportions of words written by female and male authors that are assigned to the same theme).
However, to assert the presence of a skewed gender ratio for a certain theme, one needs more information about the range of proportions of male and female authors for this theme, because even if there were no underlying gender difference in topic use, it is still unlikely to observe an evenly divided (50:50) distribution. In the language of statistics, one needs to test for the null hypothesis that there is no gender distinction by estimating the probability of observing a gender difference under the framework of randomness. Therefore, having identified a range of topics as themes of the works of fiction on the c...