TECHNOLOGY, PRESERVATION, AND MANAGEMENT ISSUES
Archiving Web Sites for Preservation and Access: MODS, METS and MINERVA
Rebecca Guenther
Leslie Myrick
BACKGROUND: ISSUES IN WEB ARCHIVING
As the world becomes increasingly dependent upon the Web as a medium for disseminating government information, scientific and academic research, news, and any variety of general information, archivists and digital librarians are no longer asking, even rhetorically, âWhy Archive the Web?â1 but rather, how best and most expediently can we capture and manage the vast store of Web-based materials to assure preservation and access? The urgency of this question is reflected in the emergence of two major broad initiatives, the International Internet Preservation Consortium (IIPC)2 and the National Digital Information Infrastructure and Preservation Program (NDIIPP),3 half of whose eight recent project grants involve a Web archiving component.
Characterized by its volatility and ephemerality, Web-based material has been rightly labeled a moving target. The BBC Web site, whose banner claims that the site is updated every minute of every day,4 is an example. Although the entry page for the BBC site may change by the minute yet continue to disseminate news for decades, the average lifespan of a Web page, according to Peter Lyman in the article cited above, is forty-four days.5 Entire Web sites disappear at an alarming rate. A case in point is a campaign Web site for a candidate in the Nigerian elections of April 2003,6 or any of the 135 candidatesâ Web sites in the California Recall campaign of 2004.7 We will focus here on the domain of political or governmental Web sitesâespecially campaign and election sitesâwhose ephemeral nature, paired with their vital import, inspire a particular urgency. This important set of born-digital materials has attracted a number of recent Web archiving endeavors including the Library of Congressâs MINERVA Project, the CRL-sponsored Political Communications Web site Archiving Project (PCWA), the California Digital Libraryâs Web-based Government Information Project, the UCLA Online Campaign Literature archive, and, covering another genre of lost or defunct political materials, the CyberCemetery at the University of North Texas.8
At the same time that we, as Web site archivists, ask ourselves, âHow do we collect it before it disappears or radically changes?â we also face the question, âHow do we define what we are collecting?â Add to the mercurial disposition of Web materials the conundrum of delineating the boundaries of any given Web site that we choose to collect, or of fully articulating a Web siteâs structure: this tangle of hyperlinks cross-referenced to an essentially simple logical tree structure delivered from a heteromorphic physical file server structure, offering up a plethora of associated MIME types.9 Then once collected, we must ask âHow do we manage and provide access to this material, especially if we have harvested versions of the same site many times a week or day?â What emerges is that a Web site has to be one of the most complex and challenging of digital objects to capture, describe, manage and preserve.
This article examines some of the vexing technical issues surrounding preservation and access in the archiving of Web sites as identified by our respective work on the MINERVA project and the PCWA project, work whose overlapping concerns led to a fruitful collaboration between the two groups, partnering with the Internet Archive as the supplier of our Web content, and intent upon exploring the use of MODS and METS10 as metadata strategies. We argue that, among the proliferation of schemas available for packaging and managing complex digital objects (DIDL, METS, IMS-CP to name a few),11 METS is uniquely suited to encapsulate a Web site object as a Submission Information Package (SIP), Archival Information Package (AIP) or Dissemination Information Package (DIP) for use in an OAIS-compliant repository.12 We also look forward to how experience gained from these two endeavors can be used to promulgate a set of METS Profiles for Web sites.
PROTOTYPES
Access to Web resources has evolved from Web-indexing in situ for discovery to a model of curation, where repositories built on the OAIS model undertake responsibility for the preservation of and access to specialized collection domains. Crawlers, agents, and robots (Excite, Yahoo, Lycos) and their comically-monikered search engine predecessors (Archie, Gopher, Veronica, and Jughead) began scouring the Web for the purposes of indexing in the early 1990s. A later development out of Stanford, Googleâs indexing with limited caching, was intended not to archive pages per se, but occasionally the Google cache has served as a ready source of a defunct page.
The notion of depositing into a continuous archive as much of the World Wide Web as a voracious crawler could harvest was the brainchild of Brewster Kahle, co-founder and President of Alexa and the force behind the not-for-profit Internet Archive,13 whose Wayback Machine has served since 1996 as a historical record of changes to extant Web sites as well as a museum of extinct Web sites. The quality of deposited materials culled from an eight-week broad-swath crawl by Alexa, coupled with its very limited search interfaceâexclusively by URL and more recently by dateâhas led to the indictment that the Wayback Machine achieves neither archival preservation nor access standards.14 To the Internet Archiveâs credit they responded with the development of a truly archival crawler, Heritrix,15 and partnered with the IIPC to marry Heritrix to a robust repository infrastructure.
The early impetus to construct truly archival Web crawlers and repositories came, not surprisingly, from the realm of National Deposit Libraries. Groundbreaking work was undertaken in the last decadeâprimarily by the National Libraries in Australia, Sweden, and Franceâto build preservation and access infrastructures for archiving Web materials. Two early prototypes standing at the antipodes, the National Library of Australiaâs (NLA) PANDORA and the National Library of Swedenâs Kulturarw3 Project,16 are interesting in many ways as paradigms of diametrically opposed approaches to capture, metadata creation, storage, and access that can be broadly construed as library-centric versus ITâcentric. Both projects were and are dedicated to the long-term preservation of national digital assets, not simply the bytes but also the original look and feel of the object.
The NLA uses a combination of push and pull technology17 and negotiates relationships with every publisher whose limited number of works from the Australian national domain have been preselected by a curator. PANDORA continues to use and improve upon an in-house Java-based application wrapped around HTTrack,18 called PANDAS, for selection, capture, and management. The project is characterized by selective capture, MARC-centered cataloging, storage of Web site mirrors, and access primarily through the NLAâs Integrated Library System, with a new full text search mechanism soon to be released.
In its early stages the Kulturarw3 Project used strictly pull processing in a broad-swath harvest of what it labeled the Swedish Web, using an altered version of the Combine harvester.19 It depended on crawler-generated metadata for discovery and management, stored its files in multi-MIME type archive files similar to the Alexa .arc format (i.e., crawler headers, HTTP headers,20 and file content packed into in an aggregate), and depended on full-text search for access, which was only available for use on dedicated terminals in the National Library of Sweden. A pioneer in its time, this project has since been integrated into the larger sphere of Nordic Web initiatives, specifically the NWA or Nordic Web Archive.21
The NWA is a collaborative effort between the National Libraries of Norway, Denmark, Finland, Iceland, and Sweden to explore issues involved in the harvesting of Web materials into national deposit libraries. Since 2000, the group has been developing an application using various programming languages to provide access to harvested Web-based materials. This NWA Toolset22 opened the way to exposing archived Web site metadata in XML in the form of their own NWA Document Format schema,23 which comprises descriptive and preservation metadata for a single Web page along with a list of links parsed from the page. The XML is then indexed and can be searched using either of two supported search engines: the proprietary FAST Search & Transfer ASA24 or Jakartaâs open source Lucene.25 Although a step in the right direction, the NWA Toolkitâs use of XML arguably does not go far enough on two counts: first, it is a proprietary schema, and second, it catalogs what is a discrete sub-object, i.e., it does not encompass the structure of the Web site as a complex, articulated digital object, nor does it address the interrelations between its components. When indexing is done at the page level only, the inevitable result is that queries will return perhaps thousands of URLs from various sites, without any clear indication where to find the entry point of the site that contains that page, and thus contextualize it. In this article we suggest that indexing and navigating METS for an entire Web site may resolve this particular access puzzle.
HARVESTING AND TECHNICAL ISSUES
As a response to the need to archive, back up, or mirror Web sites, Web capture has been taking place in various scenarios, using various tools. On a large scale, a crawler such as Alexa can harvest many terabytes of files during its eight-week crawl and deposit them into a repository infrastructure such as the Wayback Machine. On a smaller scale, any like-minded Web site peruser, librarian, archivist, or Area Studies specialist can use an offline mirroring tool (e.g., HTTrack) to download mirrors of sites onto a local PC. HTTrack has been used for early prototypes such as the MINERVA testbed, and is the kernel around which the PANDORA Projectâs PANDAS application was built. Other tools used in the early Web site archiving projects are the freely available GNU tool wget,26 and the Mercator crawler (until recently sponsored by Compaq and then HP),27 both of which can mirror sites locally onto a UNIX server.
Web crawlers are applications usually written in C or Java that, working from a seed or seed list of URLs, send requests using the HTTP protocol to a Web server and then parse any hypertext links that are found in a given fetched page that point to other pages, recursively parsing the links on those pages until a specified limit is reached (perhaps a depth or breadth limit placed on the crawl, or a time or bandwidth limit). Most industrial-strength crawlers use a sophisticated system that can parse links out of various troublesome formats, such as FLASH and Javascript,28 can apply rules to the newly discovered links and can schedule subsequent parsing to prevent duplication of effort, endless loops, or a crawl into oblivion. The crawler ...