The development of new masses of digital data is a reality that continues to give rise to a great deal of thought on the part of a multitude of actors and positions: researchers, engineers, journalists, business leaders, etc. Big data presents itself at first sight as a technological solution, brought about by digital companies and computer research laboratories to a problem that is not always clearly expressed. It takes the form of media and commercial discourses, rather prospective in nature, about what the abundance of digital data could change. One specificity of these discourses in relation to other technological and social changes is that they are de facto discourses on knowledge. They frequently adopt a system of enunciation and legitimation inspired by scientific research, and more specifically by the natural sciences, from which they take up the notions of data, model, hypothesis and method. In an article emblematic of the rhetoric of big data, and now refuted many times,1 Chris Anderson (2008), then editor-in-chief of the trade magazine Wired, wrote:
In a similar vein, Big Data: A Revolution That Will Transform How We Live, Work, and Think (2013), by Oxford law professor Viktor Mayer-Schƶnberger and journalist Kenneth Cukier, states that:
In these two examples, the effort to legitimize big data is built on the opposition between traditional science and the new practices of digital data analysis. To do this, the discourses are based on a sophism (known as the āstraw man argumentā), which consists of presenting a simplifying vision, habits and scientific principles in order to highlight the solution proposed by big data. In practice, the natural sciences are not systematically threatened or called into question by the promises of big data. As Leonelli (2014) points out for example:
Similarly, Lagoze (2014) analyzed the arrival of large volumes of data within existing practices and attempts to highlight a distinction between ālots of dataā and ābig dataā per se. It is not, he demonstrates, because we are in the presence of ālots of dataā that we are in a ābig dataā configuration. In the first case, there has been an increase in the volume of data, which is essentially quantitative, raising technical and methodological issues, but which is dealt with in the continuity of the existing epistemological framework. This includes contextualizing and documenting data, especially as they flow from one researcher to another, to clarify their meaning and how they can be analyzed. In the second case, the change is of a qualitative nature and challenges the scientific framework; it breaks with the existing paradigm, with mainly negative consequences. From the point of view of classical epistemology, this break indeed induces a loss of epistemic control and confidence in the integrity of data, which is not acceptable for traditional sciences. In the prism of what exists, big data therefore do not represent so much progress as a crisis in knowledge production. Manipulating āa lot of dataā is primarily a technological issue that suppliers like Microsoft have grasped. By publishing The Fourth Paradigm (Hey et al. 2009) through his research division, the editor enters a scientific discussion on big data, proposing the notion of data-intensive science and claiming the advent of a new paradigm, a term anchored in the discourse regime of the philosophy of science since Kuhn. For these suppliers, the rhetoric of big data is used as a discourse to accompany the hardware and software needed to process large volumes of scientific data. The processing of big data also raises a number of methodological issues related to the contextualization of the data and the necessary collaboration between researchers who share the skills required to process, manipulate and analyze these data (Leonelli 2016).
These new practices are emerging alongside the traditional practices of the natural sciences and the Galilean paradigm that characterizes them. In astronomy, in particle physics, the computerization of measuring instruments generates considerable masses of data but within a theoretical framework that is globally unchanged. In this context, big data tend rather to reinforce traditional scientificity regimes by providing new observables and new tools for analysis. On the other hand, there is a whole field of knowledge production, whether scientific or not, ranging from amateur practices to large genomics projects, web data analysis software and literary analysis on computers, which is transformed by big data. It is this new field that we are going to analyze by giving it two characterizations: new observables, which are not so much singular by their volume as by their very nature, and new tools that induce a specific configuration of players. In terms of the typology of players, the discourse on big data is more driven by the arguments of IT companies, which market big data processing solutions, than by players in natural science research. Relayed by the media, they are not so much aimed at researchers in the natural sciences as at the ecosystem of these software companies: customers, partners and candidates. They are particularly flattering for computer scientists (researchers, professionals or amateurs) who have to manipulate large amounts of data. Thus, whether they are developers or data scientists, they are presented and published, claims Pedro Domingos (2015), professor of computer science at the University of Washington, as gods who create universes:
In these terms, it is understandable that this type of player is quick to relay and take on board the big data rhetoric. If they are not necessarily the primary audience of the mythology of big data, those who must adhere to it, they are its heroes. This heroic status echoes a state of affairs where they are the population most capable of manipulating digital data. Thus, their technical skills make them de facto key players in big data, players whose practices are influenced by the corresponding rhetoric. Nevertheless, we will see in the following chapters that this configuration of players is incomplete. It is not enough to produce knowledge that emerges in a double technical and epistemic constitution.
Compared, for example, with a classical configuration in the natural sciences, where researchers, who possess the theoretical concepts and methodological knowledge of their disciplines, rely on other players who have the technical knowledge necessary to operate measuring instruments, the exploitation of big data by computer scientists alone is an incomplete configuration where technical skills are not complemented by theoretical knowledge relating to the object under study. There is no epistemic or methodological continuity between theory, models and tools, simply because there is, in this configuration, no theoretical framework. In terms of the classical functioning of science, this configuration is not a ānew paradigmā as the rhetoric of big data would have it, but a problematic situation in which it is not possible to generate valid new knowledge.
From this perspective, the challenge of this book is to evaluate the role of these technical skills, but also to try to place them in a methodological continuity that integrates a theoretical framework, a conceptualization of data and knowledge validation standards. First, in the absence of standards for sampling and representativeness of inferential statistics, the question is how to assign an epistemic value to big data.
We can consider that the singularity of big data in relation to previous epistemic practices arises essentially from the data itself. In a first approach, they can indeed be considered as new data that do not come from scientific measuring instruments and are produced outside the framework of the natural sciences. In the typologies of ābig dataā outlined in the academic literature, astronomical or genomic data, for example, are absent or anecdotal. The analysis of 26 ātypesā of big data proposed by Kitchin and McArdle (2016) excludes these types of data. The sources listed are mobile communications, the web and social networks, sensors and cameras, transactions (such as scanning a barcode or making a payment with a credit card) and finally administrations. All these data have in common that they are produced by activities of human origin, and are therefore difficult to include in the field of natural sciences. It is not erroneous to consider that big data are very often new observables for the cultural sciences.
We will indeed rely on the distinction between the natural and cultural sciences, but we must now qualify it. Indeed, the study of life and health and data-intensive sciences are part of the natural sciences. The specificity of big data is therefore not their object but the status of their observables, from which will derive a completely different methodological framework than that of the Galilean sciences. The measurement of the objects of the world, directly analyzed in a theoretical continuity that ranges from the instrument to the publication, is replaced by an a posteriori exploitation of data that is always already secondary, almost always already materialized when one considers processing them. It is therefore a framework in which the influence of a certain conception of the cultural sciences dominates, but in which any object can be mobilized.
Based on this conception, which we will develop further, we will examine in the following chapters the conditions of possibility of the hypothetical computational sciences of culture. This is a formula that we propose to designate a set of epistemic practices combining a certain theoretical and methodological framework developed from the cultural sciences with the capacity to mobilize massive digital data and computational processing tools. In the mythology that we have analyzed, one element deserves to be taken into account for the obvious nature it has for practitioners: big data create a technical complexity that requires specific skills and does not allow the existing tools to be mobilized as they stand. The two essential components of these hypothetic sciences are as follows: (1) the epistemic culture of the cultural sciences (in the sense we are going to give them), with their capacity to conceptualize a relationship with reality, and (2) the technical culture of computer scientists capable of translating concepts and methods into concrete tools, compatible with the previously defined conceptual framework. We are going to show, on the one hand, that these components exist, but, on the other hand, that they almost never manage to articulate themselves, and that they therefore lack epistemic continuity between them.
On the one hand, the problematization of the status of data and its epistemic value is emerging in several research communities. Technically challenged by the new modes of access to the real world that might be available to them, because they do not have the skills available to computer scientists, these researchers are, on the other hand, sensitive to the epistemological problems they pose. While they do not necessarily try to solve them, they are at least theoretically convinced that a problem exists. If these players were able to acquire concrete technical means to process data, this could be done within a homogenous epistemic culture articulating research problems, data, tools and methods.
However, the same computer scientists have the practical skill to handle large volumes of heterogeneous data, or to develop the required objects, but do not, by definition, fit into the epistemology of the cultural sciences. Their intervention takes the form of manipulations governed not by an epistemic project relating to the human fact, but by the systematic exploration of the space of manipulability provided by computer science. We will come back and confirm this in detail in Chapter 3.
Before that, we will explain what status can be given to observables in the cultural sciences, and how these sciences construct a relationship with reality. In this perspective, we will thus see what conceptualization of digital data can be proposed in the field of cultural sciences, and what redefinition of the cultural sciences themselves is induced by the upsurge of these new observables.