eBook - ePub

Archives and the Digital Library

Name: Archives and the Digital Library
Author: William E. Landis, Robin L. Chandler, William E. Landis, Robin L. Chandler

William E. Landis, Robin L. Chandler, William E. Landis, Robin L. Chandler

Share book

286 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Archives and the Digital Library

William E. Landis, Robin L. Chandler, William E. Landis, Robin L. Chandler

Book details

Book preview

Table of contents

Citations

About This Book

Technological advances and innovative perspectives constantly evolve the notion of what makes up a digital library. Archives and the Digital Library provides an insightful snapshot of the current state of archiving in the digital realm. Respected experts in library and information science present the latest research results and illuminating case studies to provide a comprehensive glimpse at the theory, technological advances, and unique approaches to digital information management as it now stands. The book focuses on digitally reformatted surrogates of non-digital textual and graphic materials from archival collections, exploring the roles archivists can play in broadening the scope of digitization efforts through creatively developing policies, procedures, and tools to effectively manage digital content.Many of the important advances in digitization of materials have little to do with the efforts of archivists. Archives and the Digital Library concentrates specifically on the developments in the world of archives and the digitization of the unique content of information resources archivists deal with on a constant basis. This resource reviews the current issues and challenges, effective user assessment techniques, various digital resources projects, collaboration strategies, and helpful best practices. The book is extensively referenced and includes helpful illustrative figures.Topics in Archives and the Digital Library include:

a case study of LSTA-grant funded California Local History Digital Resources Project
expanding the scope of traditional archival digitations projects beyond the limits of a single institution
a case study of the California Cultures Project
the top ten themes in usability issues
case studies of usability studies, focus groups, interviews, ethnographic studies, and web log analysis
developing a reciprocal partnership with a digital library
the technical challenges in harvesting and managing Web archives
metadata strategies to provide descriptive, technical, and preservation related information about archived Web sites
long-term preservation of digital materials
building a trusted digital repository
collaboration in developing and supporting the technical and organizational infrastructure for sustainability in both academic and state government
the Archivists' Toolkit software application

Archives and the Digital Library is timely, important reading for archivists, librarians, library administrators, library information educators, archival educators, and students.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Archives and the Digital Library an online PDF/ePUB?

Yes, you can access Archives and the Digital Library by William E. Landis, Robin L. Chandler, William E. Landis, Robin L. Chandler in PDF and/or ePUB format, as well as other popular books in Computer Science & System Administration. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Routledge

Year

2013

ISBN

9781136453236

Edition

Topic

Computer Science

Subtopic

System Administration

Index

Computer Science

TECHNOLOGY, PRESERVATION, AND MANAGEMENT ISSUES

Archiving Web Sites for Preservation and Access: MODS, METS and MINERVA

Rebecca Guenther

Leslie Myrick

SUMMARY. Born-digital material such as archived Web sites provides unique challenges in ensuring access and preservation. This article examines some of the technical challenges involved in harvesting and managing Web archives as well as metadata strategies to provide descriptive, technical, and preservation related information about archived Web sites, with the Library of Congress’ Minerva project as an example. It explores a possible data model for archived Web sites, including discussion of a proposed METS profile for Web sites to enable the packaging of different forms of metadata. In addition it explores how the work on metadata for long-term preservation undertaken by the PREMIS working group can be applied to Web archiving. doi:10.1300/J201v04n01_08 [Article copies available for a fee from The Haworth Document Delivery Service: 1-800-HAWORTH. E-mail address: <[email protected]> Website: <http://www.HaworthPress.com> © 2006 by The Haworth Press, Inc. All rights reserved.]

KEYWORDS. MODS, Metadata Object Description Schema, METS, Metadata Encoding and Transmission Standard, Web archiving, metadata, digital preservation and access

BACKGROUND: ISSUES IN WEB ARCHIVING

As the world becomes increasingly dependent upon the Web as a medium for disseminating government information, scientific and academic research, news, and any variety of general information, archivists and digital librarians are no longer asking, even rhetorically, “Why Archive the Web?”¹ but rather, how best and most expediently can we capture and manage the vast store of Web-based materials to assure preservation and access? The urgency of this question is reflected in the emergence of two major broad initiatives, the International Internet Preservation Consortium (IIPC)² and the National Digital Information Infrastructure and Preservation Program (NDIIPP),³ half of whose eight recent project grants involve a Web archiving component.

Characterized by its volatility and ephemerality, Web-based material has been rightly labeled a moving target. The BBC Web site, whose banner claims that the site is updated every minute of every day,⁴ is an example. Although the entry page for the BBC site may change by the minute yet continue to disseminate news for decades, the average lifespan of a Web page, according to Peter Lyman in the article cited above, is forty-four days.⁵ Entire Web sites disappear at an alarming rate. A case in point is a campaign Web site for a candidate in the Nigerian elections of April 2003,⁶ or any of the 135 candidates’ Web sites in the California Recall campaign of 2004.⁷ We will focus here on the domain of political or governmental Web sites–especially campaign and election sites–whose ephemeral nature, paired with their vital import, inspire a particular urgency. This important set of born-digital materials has attracted a number of recent Web archiving endeavors including the Library of Congress’s MINERVA Project, the CRL-sponsored Political Communications Web site Archiving Project (PCWA), the California Digital Library’s Web-based Government Information Project, the UCLA Online Campaign Literature archive, and, covering another genre of lost or defunct political materials, the CyberCemetery at the University of North Texas.⁸

At the same time that we, as Web site archivists, ask ourselves, “How do we collect it before it disappears or radically changes?” we also face the question, “How do we define what we are collecting?” Add to the mercurial disposition of Web materials the conundrum of delineating the boundaries of any given Web site that we choose to collect, or of fully articulating a Web site’s structure: this tangle of hyperlinks cross-referenced to an essentially simple logical tree structure delivered from a heteromorphic physical file server structure, offering up a plethora of associated MIME types.⁹ Then once collected, we must ask “How do we manage and provide access to this material, especially if we have harvested versions of the same site many times a week or day?” What emerges is that a Web site has to be one of the most complex and challenging of digital objects to capture, describe, manage and preserve.

This article examines some of the vexing technical issues surrounding preservation and access in the archiving of Web sites as identified by our respective work on the MINERVA project and the PCWA project, work whose overlapping concerns led to a fruitful collaboration between the two groups, partnering with the Internet Archive as the supplier of our Web content, and intent upon exploring the use of MODS and METS¹⁰ as metadata strategies. We argue that, among the proliferation of schemas available for packaging and managing complex digital objects (DIDL, METS, IMS-CP to name a few),¹¹ METS is uniquely suited to encapsulate a Web site object as a Submission Information Package (SIP), Archival Information Package (AIP) or Dissemination Information Package (DIP) for use in an OAIS-compliant repository.¹² We also look forward to how experience gained from these two endeavors can be used to promulgate a set of METS Profiles for Web sites.

PROTOTYPES

Access to Web resources has evolved from Web-indexing in situ for discovery to a model of curation, where repositories built on the OAIS model undertake responsibility for the preservation of and access to specialized collection domains. Crawlers, agents, and robots (Excite, Yahoo, Lycos) and their comically-monikered search engine predecessors (Archie, Gopher, Veronica, and Jughead) began scouring the Web for the purposes of indexing in the early 1990s. A later development out of Stanford, Google’s indexing with limited caching, was intended not to archive pages per se, but occasionally the Google cache has served as a ready source of a defunct page.

The notion of depositing into a continuous archive as much of the World Wide Web as a voracious crawler could harvest was the brainchild of Brewster Kahle, co-founder and President of Alexa and the force behind the not-for-profit Internet Archive,¹³ whose Wayback Machine has served since 1996 as a historical record of changes to extant Web sites as well as a museum of extinct Web sites. The quality of deposited materials culled from an eight-week broad-swath crawl by Alexa, coupled with its very limited search interface–exclusively by URL and more recently by date–has led to the indictment that the Wayback Machine achieves neither archival preservation nor access standards.¹⁴ To the Internet Archive’s credit they responded with the development of a truly archival crawler, Heritrix,¹⁵ and partnered with the IIPC to marry Heritrix to a robust repository infrastructure.

The early impetus to construct truly archival Web crawlers and repositories came, not surprisingly, from the realm of National Deposit Libraries. Groundbreaking work was undertaken in the last decade–primarily by the National Libraries in Australia, Sweden, and France–to build preservation and access infrastructures for archiving Web materials. Two early prototypes standing at the antipodes, the National Library of Australia’s (NLA) PANDORA and the National Library of Sweden’s Kulturarw3 Project,¹⁶ are interesting in many ways as paradigms of diametrically opposed approaches to capture, metadata creation, storage, and access that can be broadly construed as library-centric versus IT–centric. Both projects were and are dedicated to the long-term preservation of national digital assets, not simply the bytes but also the original look and feel of the object.

The NLA uses a combination of push and pull technology¹⁷ and negotiates relationships with every publisher whose limited number of works from the Australian national domain have been preselected by a curator. PANDORA continues to use and improve upon an in-house Java-based application wrapped around HTTrack,¹⁸ called PANDAS, for selection, capture, and management. The project is characterized by selective capture, MARC-centered cataloging, storage of Web site mirrors, and access primarily through the NLA’s Integrated Library System, with a new full text search mechanism soon to be released.

In its early stages the Kulturarw3 Project used strictly pull processing in a broad-swath harvest of what it labeled the Swedish Web, using an altered version of the Combine harvester.¹⁹ It depended on crawler-generated metadata for discovery and management, stored its files in multi-MIME type archive files similar to the Alexa .arc format (i.e., crawler headers, HTTP headers,²⁰ and file content packed into in an aggregate), and depended on full-text search for access, which was only available for use on dedicated terminals in the National Library of Sweden. A pioneer in its time, this project has since been integrated into the larger sphere of Nordic Web initiatives, specifically the NWA or Nordic Web Archive.²¹

The NWA is a collaborative effort between the National Libraries of Norway, Denmark, Finland, Iceland, and Sweden to explore issues involved in the harvesting of Web materials into national deposit libraries. Since 2000, the group has been developing an application using various programming languages to provide access to harvested Web-based materials. This NWA Toolset²² opened the way to exposing archived Web site metadata in XML in the form of their own NWA Document Format schema,²³ which comprises descriptive and preservation metadata for a single Web page along with a list of links parsed from the page. The XML is then indexed and can be searched using either of two supported search engines: the proprietary FAST Search & Transfer ASA²⁴ or Jakarta’s open source Lucene.²⁵ Although a step in the right direction, the NWA Toolkit’s use of XML arguably does not go far enough on two counts: first, it is a proprietary schema, and second, it catalogs what is a discrete sub-object, i.e., it does not encompass the structure of the Web site as a complex, articulated digital object, nor does it address the interrelations between its components. When indexing is done at the page level only, the inevitable result is that queries will return perhaps thousands of URLs from various sites, without any clear indication where to find the entry point of the site that contains that page, and thus contextualize it. In this article we suggest that indexing and navigating METS for an entire Web site may resolve this particular access puzzle.

HARVESTING AND TECHNICAL ISSUES

As a response to the need to archive, back up, or mirror Web sites, Web capture has been taking place in various scenarios, using various tools. On a large scale, a crawler such as Alexa can harvest many terabytes of files during its eight-week crawl and deposit them into a repository infrastructure such as the Wayback Machine. On a smaller scale, any like-minded Web site peruser, librarian, archivist, or Area Studies specialist can use an offline mirroring tool (e.g., HTTrack) to download mirrors of sites onto a local PC. HTTrack has been used for early prototypes such as the MINERVA testbed, and is the kernel around which the PANDORA Project’s PANDAS application was built. Other tools used in the early Web site archiving projects are the freely available GNU tool wget,²⁶ and the Mercator crawler (until recently sponsored by Compaq and then HP),²⁷ both of which can mirror sites locally onto a UNIX server.

Web crawlers are applications usually written in C or Java that, working from a seed or seed list of URLs, send requests using the HTTP protocol to a Web server and then parse any hypertext links that are found in a given fetched page that point to other pages, recursively parsing the links on those pages until a specified limit is reached (perhaps a depth or breadth limit placed on the crawl, or a time or bandwidth limit). Most industrial-strength crawlers use a sophisticated system that can parse links out of various troublesome formats, such as FLASH and Javascript,²⁸ can apply rules to the newly discovered links and can schedule subsequent parsing to prevent duplication of effort, endless loops, or a crawl into oblivion. The crawler ...