Overcoming Challenges in Corpus Construction
eBook - ePub

Overcoming Challenges in Corpus Construction

The Spoken British National Corpus 2014

  1. 202 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Overcoming Challenges in Corpus Construction

The Spoken British National Corpus 2014

About this book

This volume offers a critical examination of the construction of the Spoken British National Corpus 2014 (Spoken BNC2014) and points the way forward toward a more informed understanding of corpus linguistic methodology more broadly. The book begins by situating the creation of this second corpus, a compilation of new, publicly-accessible Spoken British English from the 2010s, within the context of the first, created in 1994, talking through the need to balance backward capability and optimal practice for today's users. Chapters subsequently use the Spoken BNC2014 as a focal point around which to discuss the various considerations taken into account in corpus construction, including design, data collection, transcription, and annotation. The volume concludes by reflecting on the successes and limitations of the project, as well as the broader utility of the corpus in linguistic research, both in current examples and future possibilities. This exciting new contribution to the literature on linguistic methodology is a valuable resource for students and researchers in corpus linguistics, applied linguistics, and English language teaching.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Overcoming Challenges in Corpus Construction by Robbie Love in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.

Information

1 Introduction

What’s This Book About?

In this book, my objective is to describe and justify the decisions made with regards to each major step in the construction of the spoken component of the new ‘British National Corpus 2014’ (the Spoken BNC2014, Love, Dembry, Hardie, Brezina, & McEnery, 2017), linking each step to the aims of its compilers for the wide utility of the corpus in linguistic research and English language teaching. Rarely are such detailed accounts of the story of the construction of a corpus – providing the ‘why’ as well as the ‘what’ – published, and the methodological discussions presented in this book are intended to help the Spoken BNC2014 to be as useful to as many people, and for as many purposes, as possible. I aim to reflect critically on the major considerations of corpus construction, including design, data collection, metadata, transcription and processing, using the Spoken BNC2014 as a detailed and transparent case study about how the challenges encountered at each step may be overcome.
The original British National Corpus (Leech, 1993) – which I call the BNC1994 for clarity – is one of the most widely known and used corpora of all. It contains approximately 90-million words of writing and ten-million words of transcribed speech. When I was an undergraduate at Lancaster University, it was introduced in my very first lecture on corpus linguistics, and discussions since with many academics in the field have suggested that this is common for students at many other institutions, too. There are many reasons for the broad and long-lasting utility of the BNC1994, which I will discuss in subsequent chapters, but suffice it to say for now that its relative size (especially that of the ten-million-word spoken component) and low-cost availability have allowed it to be used as the ‘go-to’ corpus for the study of British English for nearly three decades. While impressive, the fact that the corpus linguistics community has used the BNC1994 as a proxy for ‘present-day’ British English for so long is a problem for current and future research that needed to be addressed with increasing urgency.
Since 2014, the ESRC-funded Centre for Corpus Approaches to Social Science (CASS) at Lancaster University has been building a new British National Corpus (the BNC2014) with the aim of creating an updated ‘go-to’ dataset for the study of British English, as well as a platform for comparison with the original corpus. Like its predecessor, the BNC2014 is divided into two separate components (written and spoken), which have been compiled simultaneously by separate teams of researchers. At the time of writing, the written component is due to be released in 2019 (Hawtin, 2019; Brezina et al., 2019).
This book focuses on the spoken component, which Lancaster compiled in collaboration with the Language Research Team at Cambridge University Press (CUP). In 2017, it was completed and released publicly on Lancaster University’s CQPweb server (Hardie, 2012), with the underlying XML files downloadable the following year. The Spoken BNC2014 is a new, publicly-accessible corpus of present-day spoken ‘English English’, gathered in informal contexts. This is the first publicly-accessible corpus of its kind since the spoken component of the BNC1994, which, as mentioned, was still being used as a proxy for present-day English in research long after its release (e.g. Hadikin, 2014; Rühlemann & Gries, 2015). The Spoken BNC2014 contains 11,422,617 million words of transcribed content, featuring 668 speakers in 1,251 recordings, which were produced in the years 2012 to 2016.
The collaboration between Lancaster and CUP to build the Spoken BNC2014 came about after some years of both centres working individually on the idea of compiling a new corpus of spoken English which could match up to the Spoken BNC1994. Claire Dembry at CUP had collected two million words of new spoken data for the Cambridge English Corpus1 in 2012, trialling the public participation method which was used, along with the data itself, in the Spoken BNC2014. Meanwhile, in 2013, Tony McEnery and Andrew Hardie at Lancaster launched the Centre for Corpus Approaches to Social Science; one of its proposed projects was the compilation of a new BNC. Early in 2014, both Lancaster and CUP agreed, upon learning of each other’s work, to pool resources and work together to build what was originally named the ‘Lancaster/Cambridge Corpus of Speech’ (LCCS). Within a few months and with the blessing of Martin Wynne at the University of Oxford, this was renamed the Spoken BNC2014.
The Spoken BNC2014 project was led by Tony McEnery (Lancaster) and Claire Dembry (CUP). Apart from me, the other members of the main research team were Andrew Hardie and Vaclav Brezina (Lancaster), and Laura Grimes and Olivia Goodman (CUP). My role in the project was to contribute to every stage of the development of the corpus and to document the entire process so that users of the corpus could be well-informed about the important methodological choices we had made. The evidence of my documentation, which I submitted to Lancaster in completion of my PhD, is this book. Thus, in addition to my own work, this book accounts for decisions made and work completed in collaboration with a team of researchers of which I was a member. In this book, I use singular and plural pronouns systematically: first person singular pronouns are used when discussing work which was conducted solely by me, while first person plural pronouns and third person reference to ‘the Spoken BNC2014 research team’ etc. are used when reporting on decisions I made with the research team.
The aims of the Spoken BNC2014 project were:
  1. (1) to compile a corpus of informal British English conversation from the 2010s which is comparable to the Spoken BNC1994’s demographic component;
  2. (2) to compile the corpus in a manner which reflects, as much as possible, the state of the art with regards to methodological approach; and, in achieving steps (2) and (3),
  3. (3) to provide a fresh data source for (a) a new series of wide-ranging studies in linguistics and the social sciences and (b) English language teaching materials development.
This book focusses on goals (1), (2) and (3a). While goals (1) and (2) were shared by Lancaster and CUP, goals (3a) and (3b) represent the differing research objectives of Lancaster and CUP respectively. As such, this book is written primarily with the interests of linguistic research at the fore, as opposed to language pedagogy. Other outputs (e.g. Goodman & Love, 2019; Curry, Love, & Goodman, in prep) discuss the use of the Spoken BNC2014 for ELT materials development.
The aim of this book is to present a thorough account of the design and construction of the Spoken BNC2014, making clear the most important decisions the research team made as we collected, transcribed and processed the data, as well as to evaluate the representativeness and research potential of the corpus. The underlying theme of this book is the maximisation of the efficiency of spoken corpus creation in view of practical constraints, with a focus on theoretical principles of design as well as data and metadata collection, transcription and processing. As is not unusual in corpus construction, compromises had to be made throughout the compilation of this corpus, and I endeavour to discuss these transparently. Furthermore, this book describes the innovative aspects of the Spoken BNC2014 project – notably including the use of PPSR (public participation in scientific research) (Shirk et al., 2012) and the introduction of new speaker metadata categorisation schemes, among others.

Overview of the Contents

I Before Corpus Construction: Theory and Design

In Part I of the book, I start by contextualising the situation that has arisen whereby the collection of a second Spoken British National Corpus is necessary (Chapter 2). I introduce the Spoken British National Corpus 1994 and discuss its uses in the field of linguistics, reviewing a broad range of publications in the field. I also present the case for compiling a second edition now.
In Chapter 3, I introduce the concept of representativeness in corpus design, drawing upon Biber (1993), and discuss the major theoretical considerations necessary when planning to construct a national spoken corpus like the Spoken BNC2014. I discuss how previous spoken corpora have been designed and evaluate the representativeness of the Spoken BNC1994, before justifying the decision to sample from only one register – informal conversation – for the Spoken BNC2014. I go on to describe the design of the corpus.

II During Corpus Construction: Theory Meets Practice

In Part II of the book, I discuss each major stage of the construction of the Spoken BNC2014. Chapter 4 covers prominent aspects of spoken corpus data collection including recruitment, metadata and audio data. A major theme of this chapter is the extent to which the Spoken BNC1994 and other relevant corpora had been compiled using a principled as opposed to opportunistic approach, and our decision to embrace opportunism (supplemented by targeted interventions) in the compilation of the Spoken BNC2014.
In Chapter 5, the first of two chapters about transcription, I discuss the development of a bespoke transcription scheme for the Spoken BNC2014. I justify the rejection of automated transcription before describing how the Spoken BNC2014 transcription scheme elaborates and improves upon that of its predecessor.
Chapter 6 goes on to investigate the accuracy with which transcribers were able to assign speaker ID codes to the utterances transcribed in the Spoken BNC2014 – i.e. ‘speaker identification’. Its aim is to draw attention to the difficulty of this task for recordings which contain several speakers, and to propose ways in which users can avoid having potentially inaccurately assigned speaker ID codes affect their research.
In Chapter 7, I discuss the final stages of the compilation of the Spoken BNC2014, describing the conversion of transcripts into XML; the annotation of the corpus texts for part-of-speech, lemma and semantic category; and the public dissemination of the corpus.

III After Corpus Construction: Evaluating the Corpus

To begin Part III of the book, I review the ‘finished product’: the completed Spoken BNC2014 corpus (Chapter 8). I describe how various speaker and text metadata categories are populated, and evaluate the representativeness of the corpus. I build the argument that the corpus best represents informal spoken English, produced by L1 speakers of British English, in England, in the mid-2010s. I make clear that the corpus should not be considered representative of (a) English as spoken across the whole of the UK or (b) any type of spoken register beyond informal conversation.
Finally, Chapter 9 summarises the book and discusses the major successes and limitations of my work on the project, before suggesting future work that could extend the research capability of the corpus. I provide evidence of the usefulness of the corpus by citing several early examples of research that have successfully used the Spoken BNC2014 to push forward knowledge in the field of linguistics.

Note

1.Accessible at: www.cambridge.org/us/cambridgeenglish/better-learning/deeper-insights/linguistics-pedagogy/cambridge-english-corpus (last accessed September 2017).

References

Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257. doi:10.1093/llc/8.4.243
Brezina, V., Dayrell, C., Gillings, M., Hawtin, A., McEnery, T., Gomide, A. R., … van Dorst, I. (2019). Building and analysing a large general corpus: Exploring the written BNC2014 with #LancsBox. Workshop presented at the international corpus linguistics conference 2019. Cardiff University, UK. Retrieved July 2019, from www.cl2019.org/wp-content/uploads/2019/02/Workshop-3-Brezina-et-al.pdf
Curry, N., Love, R., & Goodman, G. (in prep). Keeping up with language change: Using the spoken BNC2014 in ELT materials development. International Journal of Corpus Linguistics.
Goodman, O., & Love, R. (2019, April). 1000 hours of conversations: What does it mean for ELT? 53rd annual IATEFL conference & exhibition. Liverpool, UK.
Hadikin, G. (2014). A, an and the environments in spoken Korean English. Corpora, 9(1), 1–28. doi:10.3366/cor.2014.0049
Hardie, A. (2012). CQPweb: Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. doi:10.1075/ijcl.17.3.04har
Hawtin, A. (2019). The written British national corpus 2014: Design, compilation and analysis (Unpublished doctoral thesis). Lancaster University.
Leech, G. (1993). 100 million words of English. English Today, 9–15. doi:10.1017/S0266078400006854
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344. doi:10.1075/ijcl.22.3.02lov
Rühlemann, C., & Gries, S. (2015). Turn order and turn distribution in multi-party storytelling. Journal of Pragmatics, 87, 171–191. doi:10.1016/j.pragma. 2015.08.003
Shirk, J. L., Ballard, H. L., Wilderman, C. C., Phillips, T., Wiggins, A., Jordan, R., … Bonney, R. (2012). Public participation in scientific research: A framework for deliberate design. Ecology and Society, 17(2), 29. doi:10.5751/ ES-04705-170229

Table of contents

  1. Cover
  2. Half Title
  3. Series
  4. Title
  5. Copyright
  6. Dedication
  7. Contents
  8. List of Tables
  9. List of Figures
  10. Foreword
  11. Preface
  12. Acknowledgments
  13. 1 Introduction
  14. Part I Before Corpus Construction: Theory and Design
  15. Part II During Corpus Construction: Theory Meets Practice
  16. Part III After Corpus Construction: Evaluating the Corpus
  17. Index