Big Data
eBook - ePub

Big Data

An Art of Decision Making

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Big Data

An Art of Decision Making

About this book

Manipulating and processing masses of digital data is never a purely technical activity. It requires an interpretative and exploratory outlook - already well known in the social sciences and the humanities - to convey intelligible results from data analysis algorithms and create new knowledge.

Big Data is based on an inquiry of several years within Proxem, a software publisher specializing in big data processing. The book examines how data scientists explore, interpret and visualize our digital traces to make sense of them, and to produce new knowledge. Grounded in epistemology and science and technology studies, Big Data offers a reflection on data in general, and on how they help us to better understand reality and decide on our daily actions.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Big Data by Eglantine Schmitt in PDF and/or ePUB format, as well as other popular books in Computer Science & Information Technology. We have over one million books available in our catalogue for you to explore.

Information

1
From Trace to Web Data: An Ontology of the Digital Footprint

The development of new masses of digital data is a reality that continues to give rise to a great deal of thought on the part of a multitude of actors and positions: researchers, engineers, journalists, business leaders, etc. Big data presents itself at first sight as a technological solution, brought about by digital companies and computer research laboratories to a problem that is not always clearly expressed. It takes the form of media and commercial discourses, rather prospective in nature, about what the abundance of digital data could change. One specificity of these discourses in relation to other technological and social changes is that they are de facto discourses on knowledge. They frequently adopt a system of enunciation and legitimation inspired by scientific research, and more specifically by the natural sciences, from which they take up the notions of data, model, hypothesis and method. In an article emblematic of the rhetoric of big data, and now refuted many times,1 Chris Anderson (2008), then editor-in-chief of the trade magazine Wired, wrote:
ā€œThere is now a better way. Petabytes allow us to say: ā€˜Correlation is enough’. We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannotā€.
In a similar vein, Big Data: A Revolution That Will Transform How We Live, Work, and Think (2013), by Oxford law professor Viktor Mayer-Schƶnberger and journalist Kenneth Cukier, states that:
ā€œOne of the areas that is being most dramatically shaken up by N=all is the social sciences. They have lost their monopoly on making sense of empirical social data, as big-data analysis replaces the highly skilled survey specialists of the past. The social science disciplines largely relied on sampling studies and questionnaires. But when the data is collected passively while people do what they normally do anyway, the old biases associated with sampling and questionnaires disappear. We can now collect information that we couldn’t before, be it relationships revealed via mobile phone calls or sentiments unveiled through tweets. More important, the need to sample disappearsā€.
In these two examples, the effort to legitimize big data is built on the opposition between traditional science and the new practices of digital data analysis. To do this, the discourses are based on a sophism (known as the ā€œstraw man argumentā€), which consists of presenting a simplifying vision, habits and scientific principles in order to highlight the solution proposed by big data. In practice, the natural sciences are not systematically threatened or called into question by the promises of big data. As Leonelli (2014) points out for example:
ā€œ[…] data quantity can indeed be said to make a difference to biology, but in ways that are not as revolutionary as many big data advocates would advocate. There is strong continuity with practices of large data collection and assemblage conducted since the early modern period; and the core methods and epistemic problems of biological research, including exploratory experimentation, sampling and the search for causal mechanisms, remain crucial parts of inquiry in this area of science […]ā€.
Similarly, Lagoze (2014) analyzed the arrival of large volumes of data within existing practices and attempts to highlight a distinction between ā€œlots of dataā€ and ā€œbig dataā€ per se. It is not, he demonstrates, because we are in the presence of ā€œlots of dataā€ that we are in a ā€œbig dataā€ configuration. In the first case, there has been an increase in the volume of data, which is essentially quantitative, raising technical and methodological issues, but which is dealt with in the continuity of the existing epistemological framework. This includes contextualizing and documenting data, especially as they flow from one researcher to another, to clarify their meaning and how they can be analyzed. In the second case, the change is of a qualitative nature and challenges the scientific framework; it breaks with the existing paradigm, with mainly negative consequences. From the point of view of classical epistemology, this break indeed induces a loss of epistemic control and confidence in the integrity of data, which is not acceptable for traditional sciences. In the prism of what exists, big data therefore do not represent so much progress as a crisis in knowledge production. Manipulating ā€œa lot of dataā€ is primarily a technological issue that suppliers like Microsoft have grasped. By publishing The Fourth Paradigm (Hey et al. 2009) through his research division, the editor enters a scientific discussion on big data, proposing the notion of data-intensive science and claiming the advent of a new paradigm, a term anchored in the discourse regime of the philosophy of science since Kuhn. For these suppliers, the rhetoric of big data is used as a discourse to accompany the hardware and software needed to process large volumes of scientific data. The processing of big data also raises a number of methodological issues related to the contextualization of the data and the necessary collaboration between researchers who share the skills required to process, manipulate and analyze these data (Leonelli 2016).
These new practices are emerging alongside the traditional practices of the natural sciences and the Galilean paradigm that characterizes them. In astronomy, in particle physics, the computerization of measuring instruments generates considerable masses of data but within a theoretical framework that is globally unchanged. In this context, big data tend rather to reinforce traditional scientificity regimes by providing new observables and new tools for analysis. On the other hand, there is a whole field of knowledge production, whether scientific or not, ranging from amateur practices to large genomics projects, web data analysis software and literary analysis on computers, which is transformed by big data. It is this new field that we are going to analyze by giving it two characterizations: new observables, which are not so much singular by their volume as by their very nature, and new tools that induce a specific configuration of players. In terms of the typology of players, the discourse on big data is more driven by the arguments of IT companies, which market big data processing solutions, than by players in natural science research. Relayed by the media, they are not so much aimed at researchers in the natural sciences as at the ecosystem of these software companies: customers, partners and candidates. They are particularly flattering for computer scientists (researchers, professionals or amateurs) who have to manipulate large amounts of data. Thus, whether they are developers or data scientists, they are presented and published, claims Pedro Domingos (2015), professor of computer science at the University of Washington, as gods who create universes:
ā€œA programmer – someone who creates algorithms and codes them up – is a minor god, creating universes at will. You could even say that the God of Genesis himself is a programmer: language, not manipulation, is his tool of creation. Words become worlds. Today, sitting on the couch with your laptop, you too can be a god. Imagine a universe and make it real. The laws of physics are optionalā€.
In these terms, it is understandable that this type of player is quick to relay and take on board the big data rhetoric. If they are not necessarily the primary audience of the mythology of big data, those who must adhere to it, they are its heroes. This heroic status echoes a state of affairs where they are the population most capable of manipulating digital data. Thus, their technical skills make them de facto key players in big data, players whose practices are influenced by the corresponding rhetoric. Nevertheless, we will see in the following chapters that this configuration of players is incomplete. It is not enough to produce knowledge that emerges in a double technical and epistemic constitution.
Compared, for example, with a classical configuration in the natural sciences, where researchers, who possess the theoretical concepts and methodological knowledge of their disciplines, rely on other players who have the technical knowledge necessary to operate measuring instruments, the exploitation of big data by computer scientists alone is an incomplete configuration where technical skills are not complemented by theoretical knowledge relating to the object under study. There is no epistemic or methodological continuity between theory, models and tools, simply because there is, in this configuration, no theoretical framework. In terms of the classical functioning of science, this configuration is not a ā€œnew paradigmā€ as the rhetoric of big data would have it, but a problematic situation in which it is not possible to generate valid new knowledge.
From this perspective, the challenge of this book is to evaluate the role of these technical skills, but also to try to place them in a methodological continuity that integrates a theoretical framework, a conceptualization of data and knowledge validation standards. First, in the absence of standards for sampling and representativeness of inferential statistics, the question is how to assign an epistemic value to big data.
We can consider that the singularity of big data in relation to previous epistemic practices arises essentially from the data itself. In a first approach, they can indeed be considered as new data that do not come from scientific measuring instruments and are produced outside the framework of the natural sciences. In the typologies of ā€œbig dataā€ outlined in the academic literature, astronomical or genomic data, for example, are absent or anecdotal. The analysis of 26 ā€œtypesā€ of big data proposed by Kitchin and McArdle (2016) excludes these types of data. The sources listed are mobile communications, the web and social networks, sensors and cameras, transactions (such as scanning a barcode or making a payment with a credit card) and finally administrations. All these data have in common that they are produced by activities of human origin, and are therefore difficult to include in the field of natural sciences. It is not erroneous to consider that big data are very often new observables for the cultural sciences.
We will indeed rely on the distinction between the natural and cultural sciences, but we must now qualify it. Indeed, the study of life and health and data-intensive sciences are part of the natural sciences. The specificity of big data is therefore not their object but the status of their observables, from which will derive a completely different methodological framework than that of the Galilean sciences. The measurement of the objects of the world, directly analyzed in a theoretical continuity that ranges from the instrument to the publication, is replaced by an a posteriori exploitation of data that is always already secondary, almost always already materialized when one considers processing them. It is therefore a framework in which the influence of a certain conception of the cultural sciences dominates, but in which any object can be mobilized.
Based on this conception, which we will develop further, we will examine in the following chapters the conditions of possibility of the hypothetical computational sciences of culture. This is a formula that we propose to designate a set of epistemic practices combining a certain theoretical and methodological framework developed from the cultural sciences with the capacity to mobilize massive digital data and computational processing tools. In the mythology that we have analyzed, one element deserves to be taken into account for the obvious nature it has for practitioners: big data create a technical complexity that requires specific skills and does not allow the existing tools to be mobilized as they stand. The two essential components of these hypothetic sciences are as follows: (1) the epistemic culture of the cultural sciences (in the sense we are going to give them), with their capacity to conceptualize a relationship with reality, and (2) the technical culture of computer scientists capable of translating concepts and methods into concrete tools, compatible with the previously defined conceptual framework. We are going to show, on the one hand, that these components exist, but, on the other hand, that they almost never manage to articulate themselves, and that they therefore lack epistemic continuity between them.
On the one hand, the problematization of the status of data and its epistemic value is emerging in several research communities. Technically challenged by the new modes of access to the real world that might be available to them, because they do not have the skills available to computer scientists, these researchers are, on the other hand, sensitive to the epistemological problems they pose. While they do not necessarily try to solve them, they are at least theoretically convinced that a problem exists. If these players were able to acquire concrete technical means to process data, this could be done within a homogenous epistemic culture articulating research problems, data, tools and methods.
However, the same computer scientists have the practical skill to handle large volumes of heterogeneous data, or to develop the required objects, but do not, by definition, fit into the epistemology of the cultural sciences. Their intervention takes the form of manipulations governed not by an epistemic project relating to the human fact, but by the systematic exploration of the space of manipulability provided by computer science. We will come back and confirm this in detail in Chapter 3.
Before that, we will explain what status can be given to observables in the cultural sciences, and how these sciences construct a relationship with reality. In this perspective, we will thus see what conceptualization of digital data can be proposed in the field of cultural sciences, and what redefinition of the cultural sciences themselves is induced by the upsurge of these new observables.

1.1. The epistemology of the cultural sciences

Before developing what relationship with reality and what norms govern the computational sciences of culture, we need to clarify the origin and meaning of this term, particularly in relation to other disciplinary divisions. The notion of ā€œcultural sciencesā€ thus comes to us from the neo-Kantian school of Heidel...

Table of contents

  1. Cover
  2. Table of Contents
  3. Title Page
  4. Copyright Page
  5. Introduction
  6. 1 From Trace to Web Data: An Ontology of the Digital Footprint
  7. 2 Toward an Epistemic Continuity Anchored in the Cultural Sciences
  8. 3 The Status of Computation in Data Sciences
  9. 4 A Practical Big Data Use Case
  10. 5 From Narratives to Systems: How to Shape and Share Data Analysis
  11. 6 The Art of Data Visualization
  12. 7 Knowledge and Decision
  13. Conclusion
  14. References
  15. Index
  16. Other titles from ISTE in Information Systems, Web and Pervasive Computing
  17. End User License Agreement