This introductory chapter offers both a discussion of definitions and underlying concepts in this book and a brief overview of the development of multilayer corpora in the last few decades. Section 1.1 starts out by delineating the topic and scope of the following chapters, and Section 1.2 discusses the notion of annotation layers in corpora, leading up to a discussion of what exactly constitutes a multilayer corpus in Section 1.3. In Section 1.4, early and prominent multilayer corpora are introduced and discussed, and Section 1.5 lays out the structure of the remaining chapters in this book.
1.1. What This Book Is About
This book is about using systematically annotated collections of running language data that contain a large amount of different types of information at the same time, discussing methodological issues in building such resources and gaining insight from them, particularly in the areas of discourse structure and referentiality. The main objective of the book is to present the current landscape of multilayer corpus work as a research paradigm and practical framework for theoretical and computational corpus linguistics, with its own principles, advantages and pitfalls. To realize this goal, the following chapters will give a detailed overview of the construction and use of corpora with morphological, syntactic, semantic and pragmatic annotations, as well as a range of more specific annotation types. Although the research presented here will focus mainly on discovering discourse-level connections and interactions between different levels of linguistic description, the attempt will be made to present the multilayer paradigm as a generally applicable tool and way of thinking about corpus data in a way that is accessible and usable to researchers working in a variety of areas using annotated textual data.
The term âmultilayer corporaâ and the special properties that distinguish them from other types of corpora require some clarification. Multiple layers of annotation can, in principle, simply mean that a corpus resource contains two or more analyses for the same fragment of data. For example, if each word in a corpus is annotated with its part of speech and dictionary entry (i.e. its lemma), we can already speak of multiple layers. However, part of speech tagging and lemmatization are intricately intertwined in important ways: they both apply to the exact same units (âword formsâ or, more precisely, tokens, see Chapter 2); determining one often constrains the other (bent as a noun has a different lemma than as a verb); in many cases one can be derived from the other (the lemma materialize is enough to know that the word was a verb, according to most annotation schemes for English); and consequently it makes sense to let one and the same person or program try to determine both at the same time (otherwise, they may conflict, e.g. annotating bent as a noun with the lemma âbendâ). Multilayer corpora are ones that contain mutually independent forms of information, which cannot be derived from one another reliably and can be created independently for the same text by different people in different times and places, a fact that presents a number of opportunities and pitfalls (see Chapter 3). The discussion of what exactly constitutes a multilayer corpus is postponed until Section 1.3.
Multilayer corpora bring with them a range of typical, if not always unique, constraints that deserve attention in contemporary corpus and computational linguistics. Management of parallel, independent annotation projects on the same underlying textual data leads to âground truth dataâ errors âwhat happens when disagreements arise? How can projects prepare for the increasingly likely circumstance that open-access data will be annotated further in the future by teams not in close contact with the creators of the corpus? How can the corpus design and creation plan ensure sufficiently detailed guidelines and data models to encode a resource at a reasonable cost and accuracy? How can strategies such as crowdsourcing or outsourcing with minimal training, gamification, student involvement in research and classroom annotation projects be combined into a high-quality, maintainable resource? Management plans for long-term multilayer projects need to consider many aspects that are under far less control than when a corpus is created from start to finish by one team at one time within one project and location.
What about the information that makes these corpora so valuable â what kinds of annotation can be carried out and how? For many individual layers of annotation, even in complex corpora such as syntactically annotated treebanks or corpora with intricate forms of discourse analysis, a good deal of information can be found in contemporary work (see e.g. KĂźbler and Zinsmeister 2015). There is also by now an established methodology of multifactorial models for the description of language data on many levels (Gries 2003, 2009; Szmrecsanyi 2006; Baayen 2008), usually based on manually or automatically annotated tables derived from less richly annotated corpora for a particular study. However, there is a significant gap in the description of corpus work with resources that contain such multiple layers of analysis for the entirety of running texts: What tools are needed for such resources? How can we acquire and process data for a language of interest efficiently? What are the benefits of a multilayer approach as compared to annotating subsets of data with pertinent features? What can we learn about language that we wouldnât know by looking at single layers or very narrowly targeted studies of multiple features?
In order to understand what it is that characterizes multilayer corpora as a methodological approach to doing corpus-based linguistics, it is necessary to consider the context in which multilayer corpus studies have developed within linguistics and extract working definitions that result from these developments. The next section therefore gives a brief historical overview of corpus terminology leading up to the development of multilayer approaches, and the following section discusses issues and definitions specific to multilayer corpora to delimit the scope of this book. Section 4 offers a brief survey of major contemporary resources, and Section 5 lays out the roadmap for the rest of the book.
1.2. Corpora and Annotation Layers
Although a full review of what corpora are and arenât is beyond the scope of this book, some basic previous definitions and their historical development will be briefly outlined here, with the intention of serving as a background against which to delineate multilayer corpora. In the most general terms, corpora have been defined as âa collection of texts or parts of texts upon which some general linguistic analysis can be conductedâ (Meyer 2002: xi). This definition and others like it (see Meyer 2008 for discussion) are framed in functional terms, where the intent to perform linguistic analysis is paramount. More specifically, the idea that specific criteria must be involved in the selection of the texts, making them a purposeful sample of some type of language, is often cited: Sinclair (1991: 171), for example, defines a corpus as a âcollection of naturally occurring language text, chosen to characterize a state or variety of a languageâ. The idea of characterizing or representing a specific language variety as a kind of sample was later echoed in the formulation that Sinclair proposed for the definition advocated by EAGLES (Expert Advisory Group on Language Engineering Standards) in the âPreliminary Recommendations on Corpus Typologyâ, which maintains a status as an international standard. There, a corpus is seen as âa collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the languageâ (see also McEnery et al. 2006: 4â5 for discussion).
In the electronic format, ordering is often flexible, but the initial choice of corpus design is given special prominence: Results based on corpus research will always apply in the first instance to âmore data of the same kindâ (Zeldes 2016a: 111). What the sample should be representative of has been debated extensively (see Biber 1993; Hunston 2008), but generally it is understood that the research question or purpose envisioned for a corpus will play a pivotal role in deciding its composition (see Hunston 2008; LĂźdeling 2012). As we will see in the next section, design considerations such as these require special care in multilayer resources, but remain relevant as in all corpora.
Annotation layers are often one of the main âvalue propositionsâ or points of attraction for the study of language using empirical data (Leech 1997). Although we can learn substantial amounts of things from ostensibly unannotated text, even just having tokenization, the identification of meaningful basic segments such as words in a corpus (see Section 2.1 in the following chapter), is of immense use, and in fact constitutes a form of analysis, which may be defined as a type of annotation. Formally, we can define corpus annotations in the most general way as follows:
An annotation is a consistent type of analysis, with its own guidelines for the assignment of values in individual cases.
This definition is meant to include forms of tokenization (assigning boundaries consistently based on guidelines, with values such as âboundaryâ or âno boundaryâ between characters), metadata (annotating the genre of a text out of a set of possible values) or what is perhaps most often meant, the addition of labels (tags from a tag set, numerical values and more) to some parts of corpus data (tokens, sentences or even higher-level constructs, such as adding grammatical functions to syntactic constituents, which are themselves a type of annotation). The stipulation of consistency in the analysis implies that the same analysis should be assigned to cases which are, as far as the guidelines can distinguish, âthe sameâ.
Some types of annotation layers are very common across corpora, with tag sets being subsequently reused and, ideally, the same guidelines observed. The classic example of this situation is part-of-speech (POS) tagging: Although many languages have a few commonly used tag sets (for English, primarily variants of the Penn Treebank tag set [Santorini 1990] and the CLAWS tag sets; see Garside and Smith 1997), no language has dozens of POS tag sets. Other types of annotations are very specific, with different studies using different schemes depending on a particular research question. For example, a comprehensive annotation scheme for coreference and referentiality which codes, among other things, ambiguity in the reference of anaphora was used in the ARRAU corpus (Poesio and Artstein 2008) but subsequently not widely adopted by other projects (in fact, coreference annotation is a field with particularly diverse guidelines, see Poesio et al. 2016 and Chapter 5). Often to study very specific phenomena, new annotation schemes must be created that cater to a specific research question, and these are regularly combined with more widespread types, resulting in the development of multilayer corpora. As Leech (2005: 20) points out, there is an argument âthat the annotations are more useful, the more they are designed to be specific to a parti...