CHAPTER I.1
METADATA RESEARCH: MAKING DIGITAL RESOURCES USEFUL AGAIN?
Miguel-Angel Sicilia
Department of Computer Science
University of Alcalá, Polytechnic building
Ctra. Barcelona km. 33.6
Alcalá de Henares(Madrid), Spain
[email protected] Keywords: Metadata, Linked Data, microdata, terminologies
1. Introduction
“Metadata” has become a term frequently used both in academia and also in the professional context. As an indicator of its growing acceptance as a common concept, the Google Scholar service1 estimates more than 1 million results when we formulate a query using only the term. The results’ estimation become more than 30 million if we use instead the non-specialized search service provided by Google. While the common usage of the term seems to be uncontroversial, there is an increasing heterogeneity in the ways metadata is defined, created, managed, and stored. This heterogeneity probably comes from the lack of a precise definition of metadata that captures the main elements behind the application of metadata technologies.
Metadata is commonly defined as “data about data”, according to its etymology. While this definition cannot be considered false, it has the problem of covering too many things and at the same time capturing only part of the aspects that are considered important by researchers and practitioners working with metadata. Following such definition, if I write some data in a piece of paper about an interesting book from my local library, that is a piece of metadata. This naïve example actually can be used to raise several of the important questions revolving around metadata research. For example, for some people metadata only applies to digital information (and this is precisely the focus of interest we take here). Or for some others metadata needs to be formulated with some form of schema or structure that brings a level of standardization or homogeneous use across Web sites and systems.
Another problem with metadata as a concept is that it has been metadata and not meta-information the term that has reached widespread use. There is a conceptual distinction according to which data, information and knowledge are different but interrelated things (Zins, 2007). However, to follow the common use of the term, we refer here to metadata as a generic term of any kind of meta-information also.
Metadata existed many years before the Web was even conceived. However, with the Web metadata has been brought to the hearth of the architecture of cyberspace. Originally the Web was only made up of HTML pages following a simple interlinked structure. But it has evolved into something much more complex in which metadata mixes with the contents of the pages or is arranged as a layer of information that “points” to the resources described via URIs.2 Also, HTML is not anymore the only way of describing information on the Web. XML first and RDF then, along with some microformats, are the main expression mediums for metadata today.
Understanding what is metadata and how it manifests today in the Web is a key skill for practitioners and researchers in a variety of domains. Here we attempt to succinctly delineate the main characteristics of metadata and the way metadata nowadays conforms a space of information that surrounds the Web.
The rest of this chapter is structured as follows. Section 2 briefly discusses the emergence of metadata as a differentiated area of inquiry. Then, in Section 3 a definition of metadata is provided with the aim of covering in a broad sense such inquiry area. Section 4 then discusses some particular kinds of metadata as illustrations of its diversity. Finally, conclusions and outlook are provided in Section 5.
2. Metadata as a research discipline
During the last years we have been starting to speak of “metadata research”. People have started to define themselves as “metadata specialists” and there have been international projects that were basically “metadata aggregation” projects. But is there anything as a discipline or area of “metadata research?”. This is difficult to say, as the discipline is not defined by any society or professional organization to our knowledge. While there exist a few scholarly journals that have “metadata” in the title, and conferences that explicitly deal with metadata, delineating the boundaries of the topic is a challenging effort.
It is also difficult to clearly define the object of metadata research. A possible tentative would be that of defining that object as to an engineering discipline. Engineering is the science of design and production, and in this case we aim at devising information mechanisms for a better access to information resources.
An information mechanism can be broadly defined as any technique or method (or sets of them) that provides an organization to other information resources. Having databases of XML records with DublinCore metadata3 is such an information mechanism. It has a defined schema, a format of expression and a way to point to the original resources, e.g. using <dc:identifier>. The Linked Open Data approach in DBPedia is another example (Morsey et al., 2012). In this case it is based on the RDF standard,4 and also follows a set of conventions that make it available via dereferenceable URLs. Many different information mechanisms can be devised for the same or different purposes. And metadata research is about how to make these more effective and efficient for particular purposes.
The purposes are the “better access” part of the definition. For example, the Europeana digital library5 is essentially a system built on top of the aggregation of metadata using primarily harvesting mechanisms, starting from the OAI-PMH protocol. Here the “better access” means several things, including (a) homogeneous presentation of cultural resource descriptions, (b) a single point of interaction for a large mass of content and (c) some form of quality control in the ingestion process. In this example, it becomes evident that information mechanisms are encompassing not only formats, schemas and database technologies but also approaches to quality, organizational issues. In general, they involve a socio-technical system with procedures, technologies, tools and people.
A further characteristic of metadata research that makes it challenging is that the evolution of the field takes place in the context of the social phenomenon of adopting particular schemas and practices. In that direction, the survival and spread of a particular metadata schema arguably depends to a large extent on its readiness to be easily implemented by the community of practitioners and researchers in that area. In consequence, there may be metadata schemas for a given purpose that are richer than others, but they are also more slowly accepted and used. This may be attributed to different causes, as the difficulty of implementing, how hard is to transition legacy metadata and the degree of openness and transparency of their curators, to name a few. This is a sort of “natural evolution” of schemas and practices that in some cases cannot be directly related to the technical merits of the different approaches. It related to the social nature of the Web (Berners-Lee et al., 2006).
In consequence, it is difficult to say if metadata research is a scientific discipline in itself with its own theories, assumptions and corpus of commonly accepted knowledge. However, it is clear that metadata research is a field of inquiry that is evolving and growing, and concepts and practice get consolidated with the years. It is in consequence worth the effort looking at the evolution of metadata research and doing an attempt to identify its foundations.
3. Defining metadata
Greenberg (2003) defines metadata as “structured data about an object that supports functions associated with the designated object”. Structure in metadata entails that information is organised systematically, and this is nowadays primarly achieved by the use of metadata schemas. The functions enabled can be diverse, but they are in many cases related to facilitating discovery or search, or to restrict access (e.g. in the case of licensing information) or to combine meta-information to relate resources described separately.
The main characteristic of metadata is its referential nature, i.e., metadata predicates about some other thing (even describing another metadata record). Such ‘other thing’ can be considered as ‘anything’ from the broadest perspective, but such a view could hardly be useful for bringing semantics to current information systems as the web. Then, we will restrict our discussion to digital resources of a diverse kind. In the scope of the current web, resources can be unambiguously identified by the concept of URI.
For metadata to become an object of scientific inquiry there is a need to make it measurable in its core attributes, beyond measures related to size or availability. Metadata then should be considered to be subject to assessment in several dimensions. They include at least the following:
—Quality
—Richness
—Interoperability
While these three aspects are not independent completely, they look at the problem of having better metadata systems from different angles. Current studies on metadata quality mainly deal with completeness of metadata records and in some cases with the degree of use of controlled vocabularies. However, there is little research on richness, i.e. the amount of useful information or possibilities of interlinking of metadata collections or systems. The problem of richness should be approached at two levels. At the schema level, there are still no metrics for assessing and comparing metadata schemas according to their expressivity and possibilities to convey more detailed information. At the record level, the problem becomes even more challenging, as the final richness depends on the schema, the completeness of the records and also some other aspects that are in many cases domain-dependent.
Interoperability should in theory be taken from granted in metadata systems, however, it is a matter of fact that there are differences. The problem of interoperability starts obviously at the syntactic level. In common, general-purpose metadata schemas, simplicity comes at the cost of reducing possibilities to integrate information. There is a sort of trade-off between using highly generic metadata schemas as Dublin C...