eBook - ePub

Multilayer Corpus Studies

Name: Multilayer Corpus Studies
ISBN: 9781351622134

Amir Zeldes,

274 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Multilayer Corpus Studies

Amir Zeldes,

About this book

This volume explores the opportunities afforded by the construction and evaluation of multilayer corpora, an emerging methodology within corpus linguistics that brings about multiple independent parallel analyses of the same linguistic phenomena, and how the interplay of these concurrent analyses can help to push the field into new frontiers. The first part of the book surveys the theoretical and methodological underpinnings of multilayer corpus work, including an exploration of various technical and data collection issues. The second part builds on the groundwork of the first half to show multilayer corpora applied to different subfields of linguistic study, including information structure research, referentiality, discourse models, and functional theories of discourse analysis, synthesizing these different discussions in a detailed case study of non-standard language in its concluding chapter. Advancing the multilayer corpus linguistic research paradigm into new and different directions, this volume is an indispensable resource for graduate students and researchers in corpus linguistics, syntax, semantics, construction studies, and cognitive grammar.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Languages & Linguistics

Subtopic

Linguistics

Index

Languages & Linguistics

Part I
The Multilayer Approach

1
Introduction

This introductory chapter offers both a discussion of definitions and underlying concepts in this book and a brief overview of the development of multilayer corpora in the last few decades. Section 1.1 starts out by delineating the topic and scope of the following chapters, and Section 1.2 discusses the notion of annotation layers in corpora, leading up to a discussion of what exactly constitutes a multilayer corpus in Section 1.3. In Section 1.4, early and prominent multilayer corpora are introduced and discussed, and Section 1.5 lays out the structure of the remaining chapters in this book.

1.1. What This Book Is About

This book is about using systematically annotated collections of running language data that contain a large amount of different types of information at the same time, discussing methodological issues in building such resources and gaining insight from them, particularly in the areas of discourse structure and referentiality. The main objective of the book is to present the current landscape of multilayer corpus work as a research paradigm and practical framework for theoretical and computational corpus linguistics, with its own principles, advantages and pitfalls. To realize this goal, the following chapters will give a detailed overview of the construction and use of corpora with morphological, syntactic, semantic and pragmatic annotations, as well as a range of more specific annotation types. Although the research presented here will focus mainly on discovering discourse-level connections and interactions between different levels of linguistic description, the attempt will be made to present the multilayer paradigm as a generally applicable tool and way of thinking about corpus data in a way that is accessible and usable to researchers working in a variety of areas using annotated textual data.

The term ‘multilayer corpora’ and the special properties that distinguish them from other types of corpora require some clarification. Multiple layers of annotation can, in principle, simply mean that a corpus resource contains two or more analyses for the same fragment of data. For example, if each word in a corpus is annotated with its part of speech and dictionary entry (i.e. its lemma), we can already speak of multiple layers. However, part of speech tagging and lemmatization are intricately intertwined in important ways: they both apply to the exact same units (‘word forms’ or, more precisely, tokens, see Chapter 2); determining one often constrains the other (bent as a noun has a different lemma than as a verb); in many cases one can be derived from the other (the lemma materialize is enough to know that the word was a verb, according to most annotation schemes for English); and consequently it makes sense to let one and the same person or program try to determine both at the same time (otherwise, they may conflict, e.g. annotating bent as a noun with the lemma ‘bend’). Multilayer corpora are ones that contain mutually independent forms of information, which cannot be derived from one another reliably and can be created independently for the same text by different people in different times and places, a fact that presents a number of opportunities and pitfalls (see Chapter 3). The discussion of what exactly constitutes a multilayer corpus is postponed until Section 1.3.

Multilayer corpora bring with them a range of typical, if not always unique, constraints that deserve attention in contemporary corpus and computational linguistics. Management of parallel, independent annotation projects on the same underlying textual data leads to ‘ground truth data’ errors –what happens when disagreements arise? How can projects prepare for the increasingly likely circumstance that open-access data will be annotated further in the future by teams not in close contact with the creators of the corpus? How can the corpus design and creation plan ensure sufficiently detailed guidelines and data models to encode a resource at a reasonable cost and accuracy? How can strategies such as crowdsourcing or outsourcing with minimal training, gamification, student involvement in research and classroom annotation projects be combined into a high-quality, maintainable resource? Management plans for long-term multilayer projects need to consider many aspects that are under far less control than when a corpus is created from start to finish by one team at one time within one project and location.

What about the information that makes these corpora so valuable – what kinds of annotation can be carried out and how? For many individual layers of annotation, even in complex corpora such as syntactically annotated treebanks or corpora with intricate forms of discourse analysis, a good deal of information can be found in contemporary work (see e.g. Kübler and Zinsmeister 2015). There is also by now an established methodology of multifactorial models for the description of language data on many levels (Gries 2003, 2009; Szmrecsanyi 2006; Baayen 2008), usually based on manually or automatically annotated tables derived from less richly annotated corpora for a particular study. However, there is a significant gap in the description of corpus work with resources that contain such multiple layers of analysis for the entirety of running texts: What tools are needed for such resources? How can we acquire and process data for a language of interest efficiently? What are the benefits of a multilayer approach as compared to annotating subsets of data with pertinent features? What can we learn about language that we wouldn’t know by looking at single layers or very narrowly targeted studies of multiple features?

In order to understand what it is that characterizes multilayer corpora as a methodological approach to doing corpus-based linguistics, it is necessary to consider the context in which multilayer corpus studies have developed within linguistics and extract working definitions that result from these developments. The next section therefore gives a brief historical overview of corpus terminology leading up to the development of multilayer approaches, and the following section discusses issues and definitions specific to multilayer corpora to delimit the scope of this book. Section 4 offers a brief survey of major contemporary resources, and Section 5 lays out the roadmap for the rest of the book.

1.2. Corpora and Annotation Layers

Although a full review of what corpora are and aren’t is beyond the scope of this book, some basic previous definitions and their historical development will be briefly outlined here, with the intention of serving as a background against which to delineate multilayer corpora. In the most general terms, corpora have been defined as “a collection of texts or parts of texts upon which some general linguistic analysis can be conducted” (Meyer 2002: xi). This definition and others like it (see Meyer 2008 for discussion) are framed in functional terms, where the intent to perform linguistic analysis is paramount. More specifically, the idea that specific criteria must be involved in the selection of the texts, making them a purposeful sample of some type of language, is often cited: Sinclair (1991: 171), for example, defines a corpus as a “collection of naturally occurring language text, chosen to characterize a state or variety of a language”. The idea of characterizing or representing a specific language variety as a kind of sample was later echoed in the formulation that Sinclair proposed for the definition advocated by EAGLES (Expert Advisory Group on Language Engineering Standards) in the ‘Preliminary Recommendations on Corpus Typology’, which maintains a status as an international standard. There, a corpus is seen as “a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (see also McEnery et al. 2006: 4–5 for discussion).

In the electronic format, ordering is often flexible, but the initial choice of corpus design is given special prominence: Results based on corpus research will always apply in the first instance to ‘more data of the same kind’ (Zeldes 2016a: 111). What the sample should be representative of has been debated extensively (see Biber 1993; Hunston 2008), but generally it is understood that the research question or purpose envisioned for a corpus will play a pivotal role in deciding its composition (see Hunston 2008; Lüdeling 2012). As we will see in the next section, design considerations such as these require special care in multilayer resources, but remain relevant as in all corpora.

Annotation layers are often one of the main ‘value propositions’ or points of attraction for the study of language using empirical data (Leech 1997). Although we can learn substantial amounts of things from ostensibly unannotated text, even just having tokenization, the identification of meaningful basic segments such as words in a corpus (see Section 2.1 in the following chapter), is of immense use, and in fact constitutes a form of analysis, which may be defined as a type of annotation. Formally, we can define corpus annotations in the most general way as follows:

An annotation is a consistent type of analysis, with its own guidelines for the assignment of values in individual cases.

This definition is meant to include forms of tokenization (assigning boundaries consistently based on guidelines, with values such as ‘boundary’ or ‘no boundary’ between characters), metadata (annotating the genre of a text out of a set of possible values) or what is perhaps most often meant, the addition of labels (tags from a tag set, numerical values and more) to some parts of corpus data (tokens, sentences or even higher-level constructs, such as adding grammatical functions to syntactic constituents, which are themselves a type of annotation). The stipulation of consistency in the analysis implies that the same analysis should be assigned to cases which are, as far as the guidelines can distinguish, ‘the same’.

Some types of annotation layers are very common across corpora, with tag sets being subsequently reused and, ideally, the same guidelines observed. The classic example of this situation is part-of-speech (POS) tagging: Although many languages have a few commonly used tag sets (for English, primarily variants of the Penn Treebank tag set [Santorini 1990] and the CLAWS tag sets; see Garside and Smith 1997), no language has dozens of POS tag sets. Other types of annotations are very specific, with different studies using different schemes depending on a particular research question. For example, a comprehensive annotation scheme for coreference and referentiality which codes, among other things, ambiguity in the reference of anaphora was used in the ARRAU corpus (Poesio and Artstein 2008) but subsequently not widely adopted by other projects (in fact, coreference annotation is a field with particularly diverse guidelines, see Poesio et al. 2016 and Chapter 5). Often to study very specific phenomena, new annotation schemes must be created that cater to a specific research question, and these are regularly combined with more widespread types, resulting in the development of multilayer corpora. As Leech (2005: 20) points out, there is an argument “that the annotations are more useful, the more they are designed to be specific to a parti...

Cover
Title
Copyright
Dedication
Contents
Preface
PART I The Multilayer Approach
PART II Leveraging Multilayer Data
Bibliography
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Multilayer Corpus Studies by Amir Zeldes in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.