Exploring Corpus Linguistics
eBook - ePub

Exploring Corpus Linguistics

Language in Action

  1. 246 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Exploring Corpus Linguistics

Language in Action

About this book

Routledge Introductions to Applied Linguistics consists of introductory level textbooks covering the core topics in Applied Linguistics, designed for those entering postgraduate studies and language professionals returning to academic study. The books take an innovative "practice to theory" approach, with a 'back to front' structure which takes the reader from real life problems and issues in the field, then enters into a discussion of intervention and how to engage with these concerns. The final section concludes by tying the practical issues to theoretical foundations. Additional features include tasks with commentaries, a glossary of key terms, and an annotated further reading section.

Corpus linguistics is a key area of applied linguistics and one of the most rapidly developing. Winnie Cheng's practical approach guides readers in acquiring the relevant knowledge and theories to enable the analysis, explanation and interpretation of language using corpus methods.

Throughout the book practical classroom examples, concordance based analyses and tasks such as designing and conducting mini-projects are used to connect and explain the conceptual and practical aspects of corpus linguistics.

Exploring Corpus Linguistics is an essential textbook for post-graduate/graduate students new to the field and for advanced undergraduates studying English Language and Applied Linguistics.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Exploring Corpus Linguistics by Winnie Cheng in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.
Part I
Problems and Practices
1 Introduction
This chapter briefly introduces the reader to corpus linguistics by answering two basic questions and explaining related concepts. The questions addressed are:
  • What is a corpus?
  • What is corpus linguistics?
What is a Corpus?
A corpus is a collection of texts that has been compiled for a particular reason. In other words, a corpus is not a collection of texts regardless of the types of texts collected or, if a variety of text types (i.e., genres) are in the corpus, the relative weightings assigned to each text type. A corpus, then, is a collection of texts based on a set of design criteria, one of which is that the corpus aims to be representative. These design criteria are discussed in detail in Chapter 4, and so here we examine some of the wider issues that have to be thought about and decided upon when building a corpus. In this book, we are interested in how corpus linguists use a corpus, or more than one corpus (i.e., ‘corpora’), in their research. This is not to say that only corpus linguists have corpora, or only corpus linguists use corpora in their research. Corpora have been around for a long time, but in the past they could only be searched manually, and so the fact that corpora are now machine-readable has had a tremendous impact on the field.
Corpora are becoming ever larger thanks to the ready availability of electronic texts and more powerful computing resources. For example, the Corpus of Contemporary American English (COCA) contains 410 million words (see http://corpus.byu.edu/coca/) and the British National Corpus (BNC) over 100 million words (see www.natcorp.ox.ac.uk/ or http://corpus.byu.edu.bnc/). Corpora are usually studied by means of computers, although some corpora are designed to allow users to also access individual texts for more qualitative analyses. It would be impossible to search today’s large corpora manually, and so the development of fast and reliable corpus linguistic software has gone hand in hand with the growth in corpora. The software can do many things, such as generate word and phrase frequencies lists, identify words that tend to be selected with each other such as brother + sister and black + white (termed ‘collocates’), and provide a variety of statistical functions that assist the user in deciphering the results of searches. You do not have to compile your own corpus. A number of corpora are available online, or commercially, with built-in software and user-friendly instructions.
Corpus linguists are researchers who derive their theories of language from, or base their theories of language on, corpus studies. As a result, one basic consideration when collecting spoken or written texts for a corpus is whether or not the texts should be naturally occurring. Most corpus linguists are only interested in corpora containing texts that have been spoken or written in real-world contexts. This, therefore, excludes contrived or fabricated texts, and texts spoken or written under experimental conditions. The reason for this preference is that corpus linguists want to describe language use and/or propose language theories that are grounded in actual language use. They see no benefit in examining invented texts or texts that have been manipulated by the researcher. Another consideration when collecting texts for a corpus is whether only complete texts should be included or if it is acceptable to include parts of texts. This can become an issue if, for example, the corpus compiler wants each text to be of equal length, which almost certainly means that some texts in the corpus are incomplete. Some argue that there are advantages when comparing texts to have them all of the same size, while others argue that cutting texts to fit a size requirement impairs their authenticity and possibly removes important elements, such as how a particular text type ends. The consensus, therefore, is to try to collect naturally occurring texts in their entirety. Another reason for carefully planning what goes into a corpus is to maintain a detailed record of each text and its context of use – when it happened, what kind of text it is, who the participants are, what the communicative purposes are and so on. This information is then available to users of the corpus, and is very useful in helping to interpret and explain the findings.
There are many different kinds of corpora. Some attempt to be representative of a language as a whole and are termed ‘general corpora’ or ‘reference corpora’, while others attempt to represent a particular kind of language use and are termed ‘specialised corpora’. For example, the 100 million-word British National Corpus (BNC, see http://corpus.byu.edu/bnc/) contains a wide range of texts which the compilers took to be representative of British English generally, whereas the Michigan Corpus of Academic Spoken English (MICASE, see http://micase.elicorpora.info/) is a specialised corpus representing a particular register (spoken academic English) that can also be searched based on more specific text types (genres) such as lectures or seminars. The latter corpus is also special in the sense that it is comprised only of spoken language. Spoken language is generally massively underrepresented in corpora, a problem for those corpora that aim to represent general language use, for example. The logistics and costs of collecting and transcribing naturally occurring spoken data are the reasons for this, whereas the sheer ease and convenience of the collection of electronic written texts has led to the compilation of numerous written corpora. This imbalance needs to be borne in mind by users of corpora because what one finds in spoken and written corpora may differ in all kinds of ways.
Corpora are typically described in terms of the number of words that they contain and this raises another set of considerations because of the basic question: what is a word? When you count the number of words you have typed on your computer, the number of words is not based on the number of words, but on the number of spaces in the text and this is also how some corpus linguistic software packages arrive at the number of words in a corpus. However, what about something such as haven’t? Should this be counted as one word or two (have + n’t)? Or what about PC (as in ‘personal computer’)? Is this a word or two words or something else? All of these issues, of course, have to be resolved and made clear to the users of the corpus. The words in a corpus are often further categorised into ‘types’ and ‘tokens’. The former comprise all of the unique word types in a corpus, excluding repetitions of the same word, and the latter are made up of all the words in a corpus, including all repetitions.
The ‘type’ category raises yet another issue. What constitutes a type? For example, do, does, doing and did. Each of these words share the same ‘lemma’ (i.e., they are all derived from the same root form: DO), but should they be counted as four different words (i.e., four ‘types’) in a word frequency list, or as one word based on the lemma and not listed separately? Most corpus linguistic software lists them as separate types. Similarly, if you search for one of these four words, do you want the search to include all the other forms as well? Some software packages allow the user to choose. Again, these are things to think about for corpus compilers, corpus linguistic software writers and corpus users. Counting words, categorising words and searching for words in a corpus all raise issues that corpus linguists have to address. An option for corpus compilers is to add additional information to the corpus, such as identifying clauses or word classes (e.g., nouns and verbs) by means of annotation (i.e., the insertion of additional information into a corpus), which enables the corpus linguistic software to find particular language features.
To summarise, a corpus is a collection of texts that has been compiled to represent a particular use of a language and it is made accessible by means of corpus linguistic software that allows the user to search for a variety of language features. The role of corpora means that corpus linguistics is evidence-based and computer-mediated. While not unique to corpus linguistics, these attributes are central to this field of study. Corpus linguistics is concerned not just with describing patterns of form, but also with how form and meaning are inseparable, and this notion is returned to throughout this book. The centrality of corpora-derived evidence is perhaps best encapsulated in the phrase ‘trust the text’ (see, for example, Sinclair 2004), which underscores the empirical nature of this field of language study.
What is Corpus Linguistics?
Corpus linguists compile and investigate corpora, and so corpus linguistics is the compilation and analysis of corpora. This all seems reasonably straightforward, but not everyone engaged in corpus linguistics would agree on whether corpus linguistics is a methodology for enhancing research into linguistic disciplines such as lexicography, lexicology, grammar, discourse and pragmatics, or whether it is more than that and is, in effect, a discipline in its own right. This debate is explored later in this book, and is covered elsewhere by, for example, Tognini-Bonelli (2001) and McEnery et al. (2006). The distinction is not unimportant because, as we shall see, the position one takes is likely to influence the approach adopted in a corpus linguistic study. Simply put, those who see corpus linguistics as a methodology (e.g., McEnery et al., 2006, 7–11) use what is termed the ‘corpus-based approach’ whereby they use corpus linguistics to test existing theories or frameworks against evidence in the corpus. Those who view corpus linguistics as a discipline (e.g., Tognini-Bonelli, 2001; Biber, 2009) use the corpus as the starting point for developing theories about language, and they describe their approach as ‘corpus-driven’. These approaches and their differences are examined in detail later in this book. For now, it is sufficient to understand that there is not one shared view of exactly what corpus linguistics is and what its aims are. In other words, even though the two main groupings both compile and investigate corpora, they adopt very different approaches in their studies because one sees corpus linguistics as a tool and the other as a theory of language. The author, it should be noted, subscribes to the latter view, and this will be foregrounded as the book unfolds.
As mentioned above, the fact that corpora are machine-readable opens up the possibility for users to search them for a mu...

Table of contents

  1. Cover
  2. Half title
  3. Title page
  4. Copyright
  5. Contents
  6. Series editors’ introduction
  7. Acknowledgement
  8. Part I: Problems and Practices
  9. Part II: Interventions
  10. Part III: Approaches to and Models of Corpus Linguistic Studies
  11. Commentary on selected tasks
  12. Glossary
  13. Further reading
  14. References
  15. Index