Introduction
The advent of the Web as a primary source will dramatically affect the practice of researching, writing, and thinking about history. Historians are entering into an era where we will have more information than ever before, left behind by people who rarely before entered the historical record. Web archives will fundamentally transform much of what a historian does, requiring a move towards computational methodologies and the digital humanities.
Web archives matter. One cannot write most histories of the 1990s or later without reference to web archives, or at the very least to do so would be to neglect a major medium of the period. Web archivists and other institutions are today engaged in the collaborative effort to ensure that people in the future know what happened in 1996, or 2001, or 2006, or today. This ensures that we as a society will have the information that we need to make arguments for justice, for equality, for policy, for a better understanding of ourselves, and beyond. Web archives will be a foundation for history.
Crucially, historians need to be ready for this shift. They will soon be writing histories of the 1990s that require web archives to do justice to their topics – and they need to be ready. While there is no exact metric for when past events become fodder for historical interpretations, it is worth noting that the first historical narratives of the 1960s in the United States and Canada for example began to appear in the 1980s; by the 1990s, established monographs and doctoral studies could be undertaken (Gitlin, 1987; Isserman, 1987; Kostash, 1980; Levitt, 1984; Owram, 1997). As the Web is now well over 25 years old, we are roughly at the time when the first serious historical studies will begin to be undertaken, and it is likely that many trailblazing doctoral students in the field are now beginning to contemplate their first degrees. To not use web archives would run the very real possibility of fundamentally misrepresenting any of the above topics. This will happen sooner than we think, too. Not only are the 1990s history, the Web is now over 25 years old, with widespread web archiving beginning over two decades ago with the Internet Archive in 1996.
This chapter explores what the changing nature of historical scale will mean for historians. It begins by discussing how web archives will become increasingly central to the historical profession. Following this, drawing upon Franco Moretti's concepts of close reading versus distant reading, it advances a typology of research projects carried out to date. The chapter then discusses the next directions for the field, especially the growing importance of metadata analysis rather than exploring content itself. It concludes by situating our contemporary trend into a ‘third wave of computational history', suggesting how historians could profit by understanding themselves in their own historical context.
The Growing Centrality of Web Archives to the Historical Profession
To reinforce the importance of web archives, consider all of the things that one could not write a history of without using web archives. Without web archives, one could not write histories of the late 1990s Tamagotchi trend, figuring out what that meant about our relationship to animals, each other, and technology; or political histories of the late 1990s on early Internet censorship, from the V-Chip to the Communications Decency Act in the United States, critical moments in our early Web history that might have fundamentally changed how we interacted with the medium; or economic and business histories of the 1990s dot.com bubble; or even events of pivotal significance like the attacks of September 11th, 2001. Each of the above would be innumerably enhanced by the use of web archives, and considerably diminished by not considering these sources. This is not a niche area. Crucially, they underscore that Web histories will not just be histories of the World Wide Web (although those are important and well represented in this Handbook), but histories that happen to use the Web as a primary source because of its significant role in knowledge production and communication.
The novelty of these web archives can be seen in two respects: that of scale, in that we have more data than ever before, and that of scope, where different kinds of sources that were rarely preserved before are now being so. In this section, I will explore scale and scope in turn.
We are now working with sources that are being preserved on a different scale than historians are previously used to working with. In this we are seeing the insights of the late American historian Roy Rosenzweig borne out, as he foresaw in a 2003 American Historical Review article that historians were shifting from an environment of scarcity to one of abundance (Rosenzweig, 2003). In other words, historians have traditionally wished we had more information about the past – now, when working with web archives, historians are threatened by having too many sources to parse and explore.
Some examples can bear out the sheer scale of born-digital content being generated every day. A constantly updated page, ‘My Data is Bigger than Your Data', published by University of Waterloo computer scientist Jimmy Lin, gives a tally of the ever-changing boasts of just how big datasets are. A few examples help to bring this into contrast. In January 2017, Twitter announced that it was storing over 500 petabytes of information (one petabyte is 1,000 terabytes). The Internet Archive has over 30 petabytes of archives, with approximately 13 to 15 terabytes per day being added to its collections. Spotify, a music streaming service, collects over a terabyte of user data every day from its over 75 million users and one and a half billion playlists. YouTube sees a petabyte of data uploaded every single day (Lin, n.d.). Not all of this will be kept. Content on Snapchat, for example, is filled with largely intended transient content; similarly, Facebook and many corporate databases will likely not be archived for historical consumption. Given ethical concerns and rights to privacy, this is not necessarily a bad thing! But even if a fraction of the above is kept, historians will be challenged to no end. In particular, the Internet Archive's activities are of interest, as they are collecting with an eye to future research access. In short, the amount of information generated on the Web means that our historical record is dramatically changing.
The expansive scope of web archives, too, has the prospect of bringing more people into the historical record. Much of what can be found in the Internet Archive are primary sources authored by people who never before would have been part of the historical record. It is not simply that instead of learning about Tamagotchis from The New York Times or The Guardian we can learn about them from Web-based sources, but that we can begin to work with the pages of people who actually used Tamagotchis. Young kids and their parents created sites in the GeoCities child-focused section, for example, allowing us to work with this innovative primary source (see my chapter on GeoCities in this Handbook). It is emblematic of a broader shift ...