Part I
Problems and Practices
1 Introduction
This chapter briefly introduces the reader to corpus linguistics by answering two basic questions and explaining related concepts. The questions addressed are:
- What is a corpus?
- What is corpus linguistics?
What is a Corpus?
A corpus is a collection of texts that has been compiled for a particular reason. In other words, a corpus is not a collection of texts regardless of the types of texts collected or, if a variety of text types (i.e., genres) are in the corpus, the relative weightings assigned to each text type. A corpus, then, is a collection of texts based on a set of design criteria, one of which is that the corpus aims to be representative. These design criteria are discussed in detail in Chapter 4, and so here we examine some of the wider issues that have to be thought about and decided upon when building a corpus. In this book, we are interested in how corpus linguists use a corpus, or more than one corpus (i.e., âcorporaâ), in their research. This is not to say that only corpus linguists have corpora, or only corpus linguists use corpora in their research. Corpora have been around for a long time, but in the past they could only be searched manually, and so the fact that corpora are now machine-readable has had a tremendous impact on the field.
Corpora are becoming ever larger thanks to the ready availability of electronic texts and more powerful computing resources. For example, the Corpus of Contemporary American English (COCA) contains 410 million words (see http://corpus.byu.edu/coca/) and the British National Corpus (BNC) over 100 million words (see www.natcorp.ox.ac.uk/ or http://corpus.byu.edu.bnc/). Corpora are usually studied by means of computers, although some corpora are designed to allow users to also access individual texts for more qualitative analyses. It would be impossible to search todayâs large corpora manually, and so the development of fast and reliable corpus linguistic software has gone hand in hand with the growth in corpora. The software can do many things, such as generate word and phrase frequencies lists, identify words that tend to be selected with each other such as brother + sister and black + white (termed âcollocatesâ), and provide a variety of statistical functions that assist the user in deciphering the results of searches. You do not have to compile your own corpus. A number of corpora are available online, or commercially, with built-in software and user-friendly instructions.
Corpus linguists are researchers who derive their theories of language from, or base their theories of language on, corpus studies. As a result, one basic consideration when collecting spoken or written texts for a corpus is whether or not the texts should be naturally occurring. Most corpus linguists are only interested in corpora containing texts that have been spoken or written in real-world contexts. This, therefore, excludes contrived or fabricated texts, and texts spoken or written under experimental conditions. The reason for this preference is that corpus linguists want to describe language use and/or propose language theories that are grounded in actual language use. They see no benefit in examining invented texts or texts that have been manipulated by the researcher. Another consideration when collecting texts for a corpus is whether only complete texts should be included or if it is acceptable to include parts of texts. This can become an issue if, for example, the corpus compiler wants each text to be of equal length, which almost certainly means that some texts in the corpus are incomplete. Some argue that there are advantages when comparing texts to have them all of the same size, while others argue that cutting texts to fit a size requirement impairs their authenticity and possibly removes important elements, such as how a particular text type ends. The consensus, therefore, is to try to collect naturally occurring texts in their entirety. Another reason for carefully planning what goes into a corpus is to maintain a detailed record of each text and its context of use â when it happened, what kind of text it is, who the participants are, what the communicative purposes are and so on. This information is then available to users of the corpus, and is very useful in helping to interpret and explain the findings.
There are many different kinds of corpora. Some attempt to be representative of a language as a whole and are termed âgeneral corporaâ or âreference corporaâ, while others attempt to represent a particular kind of language use and are termed âspecialised corporaâ. For example, the 100 million-word British National Corpus (BNC, see http://corpus.byu.edu/bnc/) contains a wide range of texts which the compilers took to be representative of British English generally, whereas the Michigan Corpus of Academic Spoken English (MICASE, see http://micase.elicorpora.info/) is a specialised corpus representing a particular register (spoken academic English) that can also be searched based on more specific text types (genres) such as lectures or seminars. The latter corpus is also special in the sense that it is comprised only of spoken language. Spoken language is generally massively underrepresented in corpora, a problem for those corpora that aim to represent general language use, for example. The logistics and costs of collecting and transcribing naturally occurring spoken data are the reasons for this, whereas the sheer ease and convenience of the collection of electronic written texts has led to the compilation of numerous written corpora. This imbalance needs to be borne in mind by users of corpora because what one finds in spoken and written corpora may differ in all kinds of ways.
Corpora are typically described in terms of the number of words that they contain and this raises another set of considerations because of the basic question: what is a word? When you count the number of words you have typed on your computer, the number of words is not based on the number of words, but on the number of spaces in the text and this is also how some corpus linguistic software packages arrive at the number of words in a corpus. However, what about something such as havenât? Should this be counted as one word or two (have + nât)? Or what about PC (as in âpersonal computerâ)? Is this a word or two words or something else? All of these issues, of course, have to be resolved and made clear to the users of the corpus. The words in a corpus are often further categorised into âtypesâ and âtokensâ. The former comprise all of the unique word types in a corpus, excluding repetitions of the same word, and the latter are made up of all the words in a corpus, including all repetitions.
The âtypeâ category raises yet another issue. What constitutes a type? For example, do, does, doing and did. Each of these words share the same âlemmaâ (i.e., they are all derived from the same root form: DO), but should they be counted as four different words (i.e., four âtypesâ) in a word frequency list, or as one word based on the lemma and not listed separately? Most corpus linguistic software lists them as separate types. Similarly, if you search for one of these four words, do you want the search to include all the other forms as well? Some software packages allow the user to choose. Again, these are things to think about for corpus compilers, corpus linguistic software writers and corpus users. Counting words, categorising words and searching for words in a corpus all raise issues that corpus linguists have to address. An option for corpus compilers is to add additional information to the corpus, such as identifying clauses or word classes (e.g., nouns and verbs) by means of annotation (i.e., the insertion of additional information into a corpus), which enables the corpus linguistic software to find particular language features.
To summarise, a corpus is a collection of texts that has been compiled to represent a particular use of a language and it is made accessible by means of corpus linguistic software that allows the user to search for a variety of language features. The role of corpora means that corpus linguistics is evidence-based and computer-mediated. While not unique to corpus linguistics, these attributes are central to this field of study. Corpus linguistics is concerned not just with describing patterns of form, but also with how form and meaning are inseparable, and this notion is returned to throughout this book. The centrality of corpora-derived evidence is perhaps best encapsulated in the phrase âtrust the textâ (see, for example, Sinclair 2004), which underscores the empirical nature of this field of language study.
What is Corpus Linguistics?
Corpus linguists compile and investigate corpora, and so corpus linguistics is the compilation and analysis of corpora. This all seems reasonably straightforward, but not everyone engaged in corpus linguistics would agree on whether corpus linguistics is a methodology for enhancing research into linguistic disciplines such as lexicography, lexicology, grammar, discourse and pragmatics, or whether it is more than that and is, in effect, a discipline in its own right. This debate is explored later in this book, and is covered elsewhere by, for example, Tognini-Bonelli (2001) and McEnery et al. (2006). The distinction is not unimportant because, as we shall see, the position one takes is likely to influence the approach adopted in a corpus linguistic study. Simply put, those who see corpus linguistics as a methodology (e.g., McEnery et al., 2006, 7â11) use what is termed the âcorpus-based approachâ whereby they use corpus linguistics to test existing theories or frameworks against evidence in the corpus. Those who view corpus linguistics as a discipline (e.g., Tognini-Bonelli, 2001; Biber, 2009) use the corpus as the starting point for developing theories about language, and they describe their approach as âcorpus-drivenâ. These approaches and their differences are examined in detail later in this book. For now, it is sufficient to understand that there is not one shared view of exactly what corpus linguistics is and what its aims are. In other words, even though the two main groupings both compile and investigate corpora, they adopt very different approaches in their studies because one sees corpus linguistics as a tool and the other as a theory of language. The author, it should be noted, subscribes to the latter view, and this will be foregrounded as the book unfolds.
As mentioned above, the fact that corpora are machine-readable opens up the possibility for users to search them for a mu...