1.1.1 The basic idea of corpus linguistics
Corpus linguistics is essentially a specific way of studying language and languages by systematically investigating how language is used in context. A major concern for corpus linguists is that language use is massively variable. As language users, we are often aware of at least some forms of variation: we know that we use a language differently when we talk or sign to a friend face-to-face, or to our boss or colleagues during a work meeting, or when we write a text message to our partner or an email to a government agency. And we also expect to receive language with variable structure, for example, during a phone conversation or when reading a newspaper article. However, a major focus in corpus linguistics is on those forms of variation that speakers are typically not aware of, for instance, where the form and choice of expressions can be influenced in subtle ways by the structural contexts they occur in. Corpus linguists will ask questions concerning the choice of words or morphosyntactic construction, the reduction of some words (e.g. going to vs. gonna) or other variation in the sound shape of words, and so forth, depending on their context of use. The answers to these questions establish new facts about language and thus further our understanding of how human languages are used.
The corpus linguistic approach and the kind of insights it bears contrasts with structuralist approaches that focus entirely on languages as abstract systems of linguistic knowledge. It is thus closely linked with the tradition of functionalist and cognitive linguistics that have stressed that abstract representation ā whatever its exact nature ā is more strongly intertwined with usage than was assumed in the structuralist tradition, and that the langue-parole division is much less clear cut (cf. e.g. Bybee 2006; Diessel 2019). Scholars in other usage-oriented areas of linguistics, in particular, anthropological linguistics, sociolinguistics, and psycholinguistics, have stressed the importance of knowledge about language use and variation therein. Corpus linguistics ties in with these latter approaches: it is not just concerned with what expressions exist in any given languages and are possible to be produced, but what specific expressions language users are most likely to produce on any given occasion, like the ones mentioned above. Like other usage-oriented researchers, corpus linguists see language production as the result of a multi-layered decision-making process (cf. Diessel 2019:24ā25) during which language users choose between different ways of expressing approximately1 the same thing. The systematic study of usage data as represented in corpora aims at discovering the rules that govern these decisions and at understanding what ramifications usage patterns have on linguistic systems (e.g. Bybee 2006).
1.1.2 Corpus linguistics in contrast to other approaches
This distinguishes corpus linguistics from a number of other approaches to human languages, for example, those delimiting the range of possible structures through acceptability judgements (as done in much theoretical and descriptive work on grammar), those comparing languages on the basis of grammars (as done in classic typology), or those investigating how language users react during comprehension when exposed to language in experimental setups (as done in psycho- and neurolinguistics), and many more.
The contrast to judgement-based linguistic research is particularly prominent in the literature. The focus here is on determining the system of all possible structures in a language, as one would find described in the grammar, for instance. A major concern here is that contrary to judgements ā where user-judges can reject a structure as impossible ā corpora can never provide this kind of so-called negative evidence, which calls corpora into question as a reliable empirical basis for grammar writing and other descriptive and analytic statements about abstract representations. Despite its focus on usage and the variation therein, corpus linguists have developed criteria in order to evaluate the relationship between coverage of possible structures in a corpus and the range of possible structures in a language system. We will discuss these in Chapter 3. Roughly, we believe that if a structure is not attested in a sufficiently large and varied corpus it is not possible. Yet, as we will see, we can never be sure what constitutes a sufficiently rich corpus in order to cover all possible structures. While this essentially leaves real uncertainty, it should nonetheless be noted that alternatives like judgement tasks are not in a much better position: it has by now frequently been pointed out that judgements are by no means generally reliable (Gibson & Fedorenko 2013), and people can reject structures in a judgement elicitation session that they would produce and/or encounter themselves and apparently have no difficulty to process and interpret. There are two major reasons for this: first, user-judges may often be led by what they assume to be ācorrectā language as a kind of ideal, which may or may in fact not coincide with a prescriptive standard. Second, the acceptability of a given structure often depends on the specific context, and someone may simply fail to come up with that context during such a session, hence rejecting the structure. Conversely, it is quite possible that user-judges will accept structures proposed by a researcher-linguist that they would never produce; for example, they think that the expert must be right or that one should not correct a community outsider. Hence, judgements are not necessarily more reliable than corpus data. Finally, it needs to be stressed here that different people in a community may provide different, divergent judgements (in the same way that they produce different structures), which means that judgements would need to be based on as representative as possible a sample of language users. This is typically not done in judgement-based research, as is criticised by Schütze (2016).
A similar type of argument among linguists working on lesser-studied languages relates to the distinction between ācorpus data and elicitationsā. An example is Evansā (2008) criticism that targeted elicitations of specific (often rare) structures are excluded from mainstream documentary linguistics. Evans (2008) stresses the importance of elicited data to capture rare, yet possible structures in a given language (the Australian language Dalabon in his example) that may be unlikely to crop up in more authentic text data (texts that are more common in the community) that would be part of a corpus (his examples are very complex noun phrases). Documenters should, therefore, not only record the verbal behaviour characteristic of a speech community, but also collect elicited data. As pointed out by Himmelmann (2012), however, the reliability of elicitations depends to a large degree on the experience of speakers with the structures in question. Rare structures are for this reason problematic to elicit, since speakers lack routines of producing and interpreting them. Moreover, we have no reason to accept the construct of an ideal language user representing a homogenous language community, so that again targeted elicitations would need to be conducted with a representative sample of speakers to attain some degree of reliability. We should point out here that we agree with Evansā (2008) view that elicitations are a useful source of information, and we likewise reject a view that corpora should only include āreal life language useā as McEnery and Wilson (2001) call it in their textbook. Elicitations can form part of corpus data, then, but they are not a fast-and-easy alternative to other procedures of data collection to fill in the gaps. And like corpus data, elicitations underlie the same considerations of representativeness and saturation that we will discuss in Chapter 3.
We end this section with an anecdote that underscores the particular value of corpus linguistics even for system-oriented descriptive linguistics. It is reported again by Nick Evans in Meakins et al. (2018:13ā16). Due to a request by the community, Evans had been engaged in a Bible translation project in the community of Nen speakers in southern Papua New Guinea, and while he saw this project more as a sideline of his fieldwork on the language in a spirit of āgiving backā, the Nen Bible text revealed expressions corresponding to so-called āfree-selectionā pronouns in English, like āanyoneā, āwhoeverā, etc. These had ā despite the Nen speakersā overall profound proficiency in English ā been virtually impossible to elicit, but in the Bible texts, they were there, all of a sudden. This shows that language use can reveal structures in specific contexts that linguists may have a hard time imagining, and corpus linguistics also has this kind of explorative data-driven facet. In regard to targeted elicitations, it underscores how difficult it can be for speakers to imagine usages of some structure out of context ā as is the case for out-of-contexts elicitations and judgements ā but that the relevant forms may come up promptly once the relevant context has been brought up. It is in this way that the corpus linguistic approach bears great potential not only for the study of language use but also for the demarcation of possible structures.
1.1.3 Corpus linguistics and usage-oriented linguistics
The core concern of corpus linguistics is with patterns of language use and their variation. Language use involves numerous decision-taking processes whereby users choose between alternative ways of expressing the same thing during test production and recipients choose between different ways of interpreting the structures they perceive. The more specific concern of corpus linguistics is to account for these decisions by systematically investigating related variants and conditions on their choice. For instance, whether a copular verb like is is realised in its full form or appears as a clitic ās is subject to numerous factors, and corpus linguists seek to identify these and relate them to one another in modelling the variation at hand (cf. Barth 2015 for an in-depth study of such reductions in spoken English texts). In other words, what is of particular interest to corpus linguistics is not only the presence or availability of a given structure in a given language but especially the factors that govern their choice in actual language use.
A major concern with language use is shared by a range of sub-disciplines in linguistics. One of these is sociolinguistics. Sociolinguistics is concerned with the variability of language use and seeks to correlate these with the social features of language users and their interlocutors. For instance, the choice between the two variants of the copula, is and ās, in spoken discourse is related to the preceding and the following words, speech rate, and other aspects of discourse context. But it is also influenced by demographic characteristics of speakers and their audiences, the social and physical setting, and other general aspects of the communicative situation. We will turn to the role of corpus linguistics in sociolinguistics in Chapter 9.
Other areas of linguistics where details of language use are of central concern are psycho- and neurolinguistics. These fields are interested in how language is processed, for example, how language users encode and decode discourse and what structures pose particular problems, reflected in processing delays. For the most part, these fields of linguistics target processing during perception and deploy various methods of measuring aspects of processing, for example, neurological EEG measures of processing delays (N400, P600) (cf. Brown & Hagoort 1993; Gouvea et al. 2010 inter alia). However, there are also strands ā in particular in recent years ā that pay attention to discourse production. Corpus linguistic approaches are relevant here: in one type of production-oriented research, one will examine discourse production within a controlled experiment with various stimuli intended to control for various aspects of processing. From a corpus linguistic perspective, these will simply be one set of many factors that influence the choice of structures during discourse production, in addition to other factors. In more recent work (Barth 2019a; Bell et al. 2009; Jaeger 2010; Jurafsky 2003; McDonald & Schillcock 2003; Seyfarth 2014), even free text production is investigated from a psycholinguistic perspective. A general idea here is that more frequent structures in similar contexts ...