What’s This Book About?
In this book, my objective is to describe and justify the decisions made with regards to each major step in the construction of the spoken component of the new ‘British National Corpus 2014’ (the Spoken BNC2014, Love, Dembry, Hardie, Brezina, & McEnery, 2017), linking each step to the aims of its compilers for the wide utility of the corpus in linguistic research and English language teaching. Rarely are such detailed accounts of the story of the construction of a corpus – providing the ‘why’ as well as the ‘what’ – published, and the methodological discussions presented in this book are intended to help the Spoken BNC2014 to be as useful to as many people, and for as many purposes, as possible. I aim to reflect critically on the major considerations of corpus construction, including design, data collection, metadata, transcription and processing, using the Spoken BNC2014 as a detailed and transparent case study about how the challenges encountered at each step may be overcome.
The original British National Corpus (Leech, 1993) – which I call the BNC1994 for clarity – is one of the most widely known and used corpora of all. It contains approximately 90-million words of writing and ten-million words of transcribed speech. When I was an undergraduate at Lancaster University, it was introduced in my very first lecture on corpus linguistics, and discussions since with many academics in the field have suggested that this is common for students at many other institutions, too. There are many reasons for the broad and long-lasting utility of the BNC1994, which I will discuss in subsequent chapters, but suffice it to say for now that its relative size (especially that of the ten-million-word spoken component) and low-cost availability have allowed it to be used as the ‘go-to’ corpus for the study of British English for nearly three decades. While impressive, the fact that the corpus linguistics community has used the BNC1994 as a proxy for ‘present-day’ British English for so long is a problem for current and future research that needed to be addressed with increasing urgency.
Since 2014, the ESRC-funded Centre for Corpus Approaches to Social Science (CASS) at Lancaster University has been building a new British National Corpus (the BNC2014) with the aim of creating an updated ‘go-to’ dataset for the study of British English, as well as a platform for comparison with the original corpus. Like its predecessor, the BNC2014 is divided into two separate components (written and spoken), which have been compiled simultaneously by separate teams of researchers. At the time of writing, the written component is due to be released in 2019 (Hawtin, 2019; Brezina et al., 2019).
This book focuses on the spoken component, which Lancaster compiled in collaboration with the Language Research Team at Cambridge University Press (CUP). In 2017, it was completed and released publicly on Lancaster University’s CQPweb server (Hardie, 2012), with the underlying XML files downloadable the following year. The Spoken BNC2014 is a new, publicly-accessible corpus of present-day spoken ‘English English’, gathered in informal contexts. This is the first publicly-accessible corpus of its kind since the spoken component of the BNC1994, which, as mentioned, was still being used as a proxy for present-day English in research long after its release (e.g. Hadikin, 2014; Rühlemann & Gries, 2015). The Spoken BNC2014 contains 11,422,617 million words of transcribed content, featuring 668 speakers in 1,251 recordings, which were produced in the years 2012 to 2016.
The collaboration between Lancaster and CUP to build the Spoken BNC2014 came about after some years of both centres working individually on the idea of compiling a new corpus of spoken English which could match up to the Spoken BNC1994. Claire Dembry at CUP had collected two million words of new spoken data for the Cambridge English Corpus1 in 2012, trialling the public participation method which was used, along with the data itself, in the Spoken BNC2014. Meanwhile, in 2013, Tony McEnery and Andrew Hardie at Lancaster launched the Centre for Corpus Approaches to Social Science; one of its proposed projects was the compilation of a new BNC. Early in 2014, both Lancaster and CUP agreed, upon learning of each other’s work, to pool resources and work together to build what was originally named the ‘Lancaster/Cambridge Corpus of Speech’ (LCCS). Within a few months and with the blessing of Martin Wynne at the University of Oxford, this was renamed the Spoken BNC2014.
The Spoken BNC2014 project was led by Tony McEnery (Lancaster) and Claire Dembry (CUP). Apart from me, the other members of the main research team were Andrew Hardie and Vaclav Brezina (Lancaster), and Laura Grimes and Olivia Goodman (CUP). My role in the project was to contribute to every stage of the development of the corpus and to document the entire process so that users of the corpus could be well-informed about the important methodological choices we had made. The evidence of my documentation, which I submitted to Lancaster in completion of my PhD, is this book. Thus, in addition to my own work, this book accounts for decisions made and work completed in collaboration with a team of researchers of which I was a member. In this book, I use singular and plural pronouns systematically: first person singular pronouns are used when discussing work which was conducted solely by me, while first person plural pronouns and third person reference to ‘the Spoken BNC2014 research team’ etc. are used when reporting on decisions I made with the research team.
The aims of the Spoken BNC2014 project were:
- (1) to compile a corpus of informal British English conversation from the 2010s which is comparable to the Spoken BNC1994’s demographic component;
- (2) to compile the corpus in a manner which reflects, as much as possible, the state of the art with regards to methodological approach; and, in achieving steps (2) and (3),
- (3) to provide a fresh data source for (a) a new series of wide-ranging studies in linguistics and the social sciences and (b) English language teaching materials development.
This book focusses on goals (1), (2) and (3a). While goals (1) and (2) were shared by Lancaster and CUP, goals (3a) and (3b) represent the differing research objectives of Lancaster and CUP respectively. As such, this book is written primarily with the interests of linguistic research at the fore, as opposed to language pedagogy. Other outputs (e.g. Goodman & Love, 2019; Curry, Love, & Goodman, in prep) discuss the use of the Spoken BNC2014 for ELT materials development.
The aim of this book is to present a thorough account of the design and construction of the Spoken BNC2014, making clear the most important decisions the research team made as we collected, transcribed and processed the data, as well as to evaluate the representativeness and research potential of the corpus. The underlying theme of this book is the maximisation of the efficiency of spoken corpus creation in view of practical constraints, with a focus on theoretical principles of design as well as data and metadata collection, transcription and processing. As is not unusual in corpus construction, compromises had to be made throughout the compilation of this corpus, and I endeavour to discuss these transparently. Furthermore, this book describes the innovative aspects of the Spoken BNC2014 project – notably including the use of PPSR (public participation in scientific research) (Shirk et al., 2012) and the introduction of new speaker metadata categorisation schemes, among others.
II During Corpus Construction: Theory Meets Practice
In Part II of the book, I discuss each major stage of the construction of the Spoken BNC2014. Chapter 4 covers prominent aspects of spoken corpus data collection including recruitment, metadata and audio data. A major theme of this chapter is the extent to which the Spoken BNC1994 and other relevant corpora had been compiled using a principled as opposed to opportunistic approach, and our decision to embrace opportunism (supplemented by targeted interventions) in the compilation of the Spoken BNC2014.
In Chapter 5, the first of two chapters about transcription, I discuss the development of a bespoke transcription scheme for the Spoken BNC2014. I justify the rejection of automated transcription before describing how the Spoken BNC2014 transcription scheme elaborates and improves upon that of its predecessor.
Chapter 6 goes on to investigate the accuracy with which transcribers were able to assign speaker ID codes to the utterances transcribed in the Spoken BNC2014 – i.e. ‘speaker identification’. Its aim is to draw attention to the difficulty of this task for recordings which contain several speakers, and to propose ways in which users can avoid having potentially inaccurately assigned speaker ID codes affect their research.
In Chapter 7, I discuss the final stages of the compilation of the Spoken BNC2014, describing the conversion of transcripts into XML; the annotation of the corpus texts for part-of-speech, lemma and semantic category; and the public dissemination of the corpus.
Note
References
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257. doi:10.1093/llc/8.4.243
Brezina, V., Dayrell, C., Gillings, M., Hawtin, A., McEnery, T., Gomide, A. R., … van Dorst, I. (2019). Building and analysing a large general corpus: Exploring the written BNC2014 with #LancsBox. Workshop presented at the international corpus linguistics conference 2019. Cardiff University, UK. Retrieved July 2019, from www.cl2019.org/wp-content/uploads/2019/02/Workshop-3-Brezina-et-al.pdf
Curry, N., Love, R., & Goodman, G. (in prep). Keeping up with language change: Using the spoken BNC2014 in ELT materials development. International Journal of Corpus Linguistics.
Goodman, O., & Love, R. (2019, April). 1000 hours of conversations: What does it mean for ELT? 53rd annual IATEFL conference & exhibition. Liverpool, UK.
Hadikin, G. (2014). A, an and the environments in spoken Korean English. Corpora, 9(1), 1–28. doi:10.3366/cor.2014.0049
Hardie, A. (2012). CQPweb: Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. doi:10.1075/ijcl.17.3.04har
Hawtin, A. (2019). The written British national corpus 2014: Design, compilation and analysis (Unpublished doctoral thesis). Lancaster University.
Leech, G. (1993). 100 million words of English. English Today, 9–15. doi:10.1017/S0266078400006854
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344. doi:10.1075/ijcl.22.3.02lov
Rühlemann, C., & Gries, S. (2015). Turn order and turn distribution in multi-party storytelling. Journal of Pragmatics, 87, 171–191. doi:10.1016/j.pragma. 2015.08.003
Shirk, J. L., Ballard, H. L., Wilderman, C. C., Phillips, T., Wiggins, A., Jordan, R., … Bonney, R. (2012). Public participation in scientific research: A framework for deliberate design. Ecology and Society, 17(2), 29. doi:10.5751/ ES-04705-170229