In this first part of the book we bring together five chapters that consider big data: definitions, epistemological issues, application and analysis.
In the opening chapter Evelyn Ruppert offers an ecology of big data. She notes the debates on definitions, and economies, arguing that a key ethical challenge is to develop accountable, and responsible, ways of working with big data. A central premise to her argument is the notion of agential realism (agency is considered as a relationship and not as something that a person âhasâ with âobjectsâ emerging through particular intra-actions) and the ethical responsibilities of the possibilities and responsibilities of collating and working with big data. Ruppert asks us to reflect on the provenance of big data, for while we may not be able to know all about the generation of big data, as researchers we need to accept responsibility as we use these data. She calls for ethical guidelines to include process, procedural and ontological questions.
The following chapter by Webber and Phillips offers an example of how big data might be used to address a potentially highly sensitive topic, namely, the pathways to success among members of different minority groups. How have groups achieved material security, power and prestige? Webber and Phillips have analysed data on surnames using the Origins reference files and illuminated the differences that exist between communities. Big data sets provide sufficient detail to demonstrate the differences between, but also variation within, ethnic groupings. They also note that while this analysis offers findings on difference, the study of variations also calls for a greater role of qualitative methods to examine why and how.
Miller and Dinan explore how mapping power structures and networks can be achieved through the analysis of big data. They illustrate how disparate sources of big data can be brought together to examine power elites, their interconnections and thus structures of power. They call for âstudying upâ and a renewed tradition of power structure research on political and corporate elites. As with Ruppert, they note the ethical issues in terms of informed consent and privacy, given these data are not necessarily collected for secondary analysis. That said, they are upbeat about the possibilities to engage with civil society organisations and social movements to study the meshing of power elites.
The possibilities of social media data to inform sociological work are the topic of the chapter by Murthy. As social media technologies such as Twitter, Instagram and YouTube have become highly ubiquitous, social life itself has become reconfigured. Many people are rarely offline, and the boundaries between media and everyday life are increasingly blurred. In this chapter, Murthy considers how Twitter provides opportunities for mixed qualitative and quantitative social analysis. He argues that the understanding of large social questions is increasingly contingent on us deciphering and understanding how social knowledge is created and evolves within social media platforms.
What is big data?
Not so long ago, to talk about data would inspire little interest outside of governments and the academy, but now its proliferation has become part of social worlds and relations and of consequence to many people. While various terms are used to describe this proliferation, such as the data deluge or data revolution, the one that is most common within and outside the academy is âbig dataâ. While its definition is a matter of some debate and controversy, I take it up for two key reasons.
One reason is that it is active and controversial across myriad communities of practice, including the computing industry, popular media, businesses, governments and, of course, many of the disciplines of the academy. And while there is some hype to be found, there is also a lot of healthy scepticism. By taking up the term I seek to engage in debates across these communities, especially those outside of the academy where much advanced generation and analysis of big data is happening. Second, I also use the term to suggest that there is something âbigâ about data today and it is to be found in our changing practices and relations to it. This specification is necessary to challenge predominant definitions that seek to capture the unique qualities of big data, and perhaps one of the most repeated is the so-called 3Vs: volume, velocity and variety (Stapleton, 2011). But, as many scholars have noted, the existence and processing of large volumes of data are not new. In the 1980s, when social scientists gained access to the entire 1980 US Census database, some 100 GB of data drawn from data sets of varying sizes, this certainly constituted big data (Jacobs, 2009). Beyond volume, the velocity of data generation and collection is advanced as another distinguishing quality. Finally, the variety of sources and formats from audio, video and image data, and the mixing and linking of these, also adds to the complexity of big data.
Taken together, these qualities demand new data structures, computational capacities, and processing tools and analytics, which are often argued to be critical aspects of the distinctiveness of big data. As Kitchin (2014) notes, a number of researchers have elaborated this most-cited definition to include additional qualities: exhaustive in scope (e.g. covering âwhole populationsâ); fine-grained in resolution and uniquely indexical; relational by being made up of common fields that enable linking; and flexible and scalable.3 He describes each in detail and argues that they constitute the âseven essential characteristicsâ that âmake them qualitatively different to previous forms of dataâ (2014: 79). The growing list of qualities attests to the diversity of what is being defined as big data, but also that the relevance and degree of each is highly variable depending on the particular data in question. Kitchin, for example, includes emails, text messages, sensor data, retail transactions and pre-paid travel cards as examples of big data, yet each of these varies considerably across these qualities. Indeed, many of these qualities could also be said to apply to data not typically considered as big data, such as surveys, which can be fine grained, indexical and relational.
Be that as it may, the 3Vs and their extension by Kitchin are useful for bringing attention to how âbignessâ is not simply about volume. But my main reason for introducing these definitions is to argue that these qualities are unhelpful in accounting for what it is about data that is changing and what is at stake. Instead, I suggest that these are qualities rather than definitions and that they are the outcomes of specific and changing data practices. Big data of varying volumes, formats, speeds, granularity and flexibility and so on are generated and sustained through multiple and selective sociotechnical practices that include not only technologies and people but also norms, values, conventions and rules. Instead of how big, how fast, how detailed, this approach to thinking about big data draws attention to common practices across diverse contexts, such as the digitisation and linking of content, interactive and recursive formats, or the digital tracking of conduct by governments or businesses. It has affinities with what Burrows and Savage (2014) note when they reflect on whether the data generated by the Great British Class Survey (GBCS) can be considered big data.4 While 300,000 survey responses can be considered a small data set, they note that the data were generated by interactive, dynamic, recursive and performative practices that are usually understood as key to the making of big data.
An orientation to data practices thus enlarges what âisâ big data to include not only ânativelyâ digital content generated through the Internet but also digitised surveys and censuses, corporate transactional data, government administrative registers, open and crowdsourced data, digital data repositories, the curated data sets of genomic and biological sciences, digitised journals and books, historic census records, and so on. But at the same time, data practices draw attention to another order of significance: changing relations to data that cut across different digital contexts and are configured by similar technologies (devices, hardware, software, algorithms, etc.) and modalities (interactivity, recursivity, intensity, etc.). These relations are of four kinds. One is social relations. The digital actions that are generative of much of what is considered big data are also inventive of new forms of sociality. From social networking sites, search engines, blogs and wikis to digital purchasing, crowdsourcing, citizen science and self-tracking apps, all of these can be understood as social and technical arrangements that instantiate social relations that are part of who we are as individuals and collectives in novel ways.
But at the same time, while making up selves and social relations, digital actions are materialising massive quantities of data and giving rise to new method relations. Not only are the data they generate materially implicated in the performance of contemporary sociality but so too are methods, theories and knowledge of it (Ruppert et al., 2013). Various actors are inventing different methods, such as social network analysis, that assemble various technologies and expertise to reuse and repurpose this data through practices that format, clean, link, mine, correlate, visualise, infer and model the data to represent and enact social worlds.
A third set of relations captures that people are ever more aware of how they are being made into âdata subjectsâ, analysed and known. Data relations are thus part of everyday lives and vocabularies, and thanks to the exposĂ© of the deep surveillance data practices of the NSA and GCHQ, many people are now familiar with terms such as metadata and that their conversations are perhaps of less interest than data about who is talking to whom, when, how much, and by what mode of communication.5 Data are also objects of interest to subjects who engage with tracking devices and apps to quantify, analyse, visualise and act upon their own conduct.
My point is that big data practices are active in social worlds and are remaking social, method and data relations involving various combinations of technical and social actors, from algorithms to data subjects and data cleaners. Finally, data practices are also changing our research relations as social scientists. Our academic craft is generating big data through digital media such as journals, websites and blogs, whereby we are digitally re-versioning and multiplying our research outputs. Additionally, we are participating in defining the themes, concepts and concerns that make up big data as a field. This includes institutionalising practices and the economic, cultural, social and symbolic investments in the term. That is, big data is being defined by innumerable practices and investments, in infrastructures such as technologies, research funding programmes and projects, university curricula, and so on, as well as journals such as Big Data & Society.6
A final reason for adopting the term is strategic. It is to problematise what is otherwise left unquestioned: how specific practices are involved in the valuation â economies â and ordering and dependencies â ecologies â of big data. Through this problematisation, I argue that one of our ethical challenges as social scientists is to find ways of being accountable, answerable and responsible for the effects of our methods that take up big data and the worlds and ways of being they elevate and promote. That is the issue I take up in the final section, but after first outlining what I mean by economies and ecologies.