1.1Introduction
Corpus linguistics has become popular. Many linguists who would not otherwise consider themselves to be corpus linguists have started to apply corpus linguistics methods to their linguistic problems, in part due to the increasing availability of corpora and tools. In this chapter, we consider some kinds of research that can be done with corpora, and the types of corpora and methods that might yield useful results.1 Corpora are also found outside of linguistics, in social sciences and digital humanities.
In this book, we argue against a simplistic ‘bigger is best’ approach to data analysis and for the centrality of underlying models, theories of what might be happening linguistically ‘behind the scenes’, when we carry out research. More data is an advantage, but there is a trade-off between large corpora with limited annotation and small ones with rich annotation. Our perspective relates theory-rich linguistics with corpus linguistics, implying that we need corpora with rich annotation.
Yet as corpus linguistics has developed as a discipline, the dominant trend has been to build ever larger lexical corpora with very limited annotation: typically structural annotation (speaker turns, overlaps, sentence breaks in spoken data and formatting in writing), wordclass or ‘part-of-speech’ tagging (identifying nouns, verbs, and so on) and lemmas. Crucially, with large ‘mega’ corpora, annotation must be automatically produced without human intervention. The multi-billion-word iWeb corpus built by Mark Davies from 22 million web pages (at the time of writing) is at the frontier of this trend.
Not every linguist is in favour of a methodological ‘turn to corpora’. Some theoretical linguists, including Noam Chomsky, have argued that, at best, collections of language data merely provide researchers with examples of actual external linguistic performance of human beings in a given context (see, e.g., Aarts, 2001). We refer to this type of evidence as ‘factual evidence’ (see Section 1.2). From this perspective, corpora do not provide insight into internal language or how it is produced in the human mind. However, Chomsky’s position raises questions about what data, if any, could be used to evaluate ‘deep’ theories.2
Nevertheless, this contrary position represents a serious challenge to corpus researchers. Is corpus research doomed to investigate surface phenomena? At the end of this chapter, and as a motivation for what follows, we will return to the question of the potential relevance of corpus linguistics for the study of language production by reporting on a recent study.
Indeed, in recent years this ‘turn to corpora’ has begun to influence generative linguists. Take language change: a systematic evaluation of how language has changed over time must rely on data. An old antipathy is replaced by engagement. Large historical corpora such as the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME, Kroch, Santorini & Delf, 2004) are inspiring a new generation of linguistics researchers to approach corpora in new and more sophisticated ways. Similarly, it is our contention that corpora can benefit psycholinguistics, not as a substitute for laboratory experiments but as a complementary source of evidence.
What do we mean by ‘a corpus’? In the most general sense, corpora are simply collections of language data processed to make them accessible for research purposes. In contrast to experimental datasets, sampled to answer a specific research question, corpora are sampled in a manner that – as far as possible – permits many different types of research question to be posed. Datasets extracted from corpora are not obtained under controlled conditions but under ‘naturalistic’ or ‘ecological’ ones. We discuss some implications of this statement in Part 2.
Corpora also typically contain substantive passages of text, rather than, say, a series of random sentences produced by random speakers or writers.3
However, the majority of corpora available today have one major drawback for the study of language production. Most data are written. Texts are generated by authors at keyboards, on screens or paper. Writing is rarely spontaneously produced, may be edited by others, and is often included in databases due to availability. Like this book, texts are usually written for an imagined audience, in contrast to spoken utterances that are typically produced – scripted performances and monologues aside – on-the-spot for a present and interacting audience.
In the era of the internet, written data are easy to obtain, so large corpora of writing may be rapidly compiled. But if ‘language’ is sampled from writing (inevitable in historical corpora), we can only draw inferences about written language. Far better to be able to test hypotheses against spontaneously produced linguistic utterances that are unmediated, or, to be more precise, that are minimally affected by processes of articulation and transmission.
Not all corpora are drawn from written sources, and it is not a necessary characteristic of corpus linguistics that limits it to the study of written data. If we had no option but to use written sources, then this would still be better than relying on intuition.
But a better option is a corpus of spoken data, ideally in the form of recordings aligned with orthographic transcriptions. Transcriptions of this kind should record the output word-for-word, including false starts and self-correction, overlapping speech, speaker turns, and so on. The transcription should be a coded record of the audio stream. Faithfully transcribed speech data from an uncued and unrehearsed context is arguably the closest source to genuinely ‘spontaneous’ naturalistic language output as it is possible to find.
A transcription can be richer than a written text. It may be time-aligned with the original audio or video recording, contain prosodic and meta-linguistic information, gestural signals, and so on. The value of these additional layers of annotation will depend on the research aims of users. Researchers interested in language production and syntax are less concerned whether transcriptions are time-aligned than whether they are accurate. But if pause duration or words per minute is considered a proxy for mental processing, then timing data are essential.
Although we refer to ‘speech’ here, we are really referring to unmediated spontaneously produced language, the majority of which will be speech. For example, we might justifiably include sign language corpora under the category of ‘speech corpora’. It may be attractive to stretch this definition to include conversational text data (e.g., online ‘chat’), but usually, a user interface will allow the language producer to edit utterances as they type. If we wish to study unmediated language production, authentic data from spoken sources seems the best option.
Prioritising speech over writing in linguistics research has other justifications aside from mere spontaneity. The most obvious is historical primacy. Hunter-gatherer societies had an oral tradition long before writing was systematised. When writing developed, it was first limited to scribes, and gradually spread through social development and education. In 1820, around 12% of the world’s population could read and write. Even today that figure is around 83% (Roser & Ortiz-Ospina, 2018). So the first reason for studying speech is its near-universality. By contrast, historical corpus linguistics – which of necessity can only study written texts prior to the invention of the phonograph – is limited to the language of the literate population of the age, and their region, social class and gender distribution.
There are other important motivations. Child development sees children usually express themselves through the spoken word before they master putting words on a page, and many writers are aware that their writing requires a more-or-less internal speech act. Which comes first, speech or writing? The answer is speech.
Then there is the question of representativeness. A corpus of British English speech has approximately 2,000 words spoken by participants every quarter of an hour. The author Stephen King (2002) recommends aspiring writers write 1,000 words a day. Allowing for individual variation – and excepting isolated individuals or those physiologically unable to produce speech – it seems likely that human beings produce, and are exposed to, an order of magnitude more speech than writing.
Of course, not all speech data are the same. Speech data may be collected for a variety of purposes, some of which are more representative and ‘natural’ than others. One of the first treebanks containing spoken data, the Penn Treebank (Marcus, Marcinkiewicz & Santorini, 1993), included parliamentary language, telephone calls and air traffic control data. Other spoken data might be captured in the laboratory: collected in controlled conditions, but unnatural, potentially psychologically stressed and not particularly representative.
Scripting and rehearsal are a feature of many text...