An Introduction to Corpus Linguistics
eBook - ePub

An Introduction to Corpus Linguistics

  1. 328 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

An Introduction to Corpus Linguistics

About this book

The use of large, computerized bodies of text for linguistic analysis and description has emerged in recent years as one of the most significant and rapidly-developing fields of activity in the study of language. This book provides a comprehensive introduction and guide to Corpus Linguistics. All aspects of the field are explored, from the various types of electronic corpora that are available to instructions on how to design and compile a corpus. Graeme Kennedy surveys the development of corpora for use in linguistic research, looking back to the pre-electronic age as well as to the massive growth of computer corpora in the electronic age.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access An Introduction to Corpus Linguistics by Graeme Kennedy in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.
CHAPTER ONE
Introduction
In the language sciences a corpus is a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description. Over the last three decades the compilation and analysis of corpora stored in computerized databases has led to a new scholarly enterprise known as corpus linguistics. The purpose of this book is to introduce the various activities which come within the scope of corpus linguistics, and to set current work within its historical context. It brings together some of the findings of corpus-based studies of English, the language which has so far received the most attention from corpus linguists, and shows how quantitative analysis can contribute to linguistic description. It is hoped that, by concentrating in particular on some of the results of corpus analysis, the book will whet the appetites of the growing body of teachers and students with access to corpora to discover more for themselves about how languages work in all their variety. The book is intended primarily for those who are already familiar with general linguistic concepts but who want to know more of what can be done with a corpus and why corpus linguistics may be relevant in research on language. Corpus linguistics is not an end in itself but is one source of evidence for improving descriptions of the structure and use of languages, and for various applications, including the processing of natural language by machine and understanding how to learn or teach a language.
The main focus of this book is on four major areas of activity in corpus linguistics:
• corpus design and development (Chapter 2)
• corpus-based descriptions of aspects of English structure and use (Chapter 3)
• the particular techniques and tools used in corpus analysis (Chapter 4)
• applications of corpus-based linguistic description (Chapter 5)
Readers may choose to work through the book in the above order or to begin with the sections dealing with corpus-based descriptions of English (Chapter 3) in order first to become more familiar with some of the results of corpus analysis. In focusing on the contribution of corpus linguistics to the description of English and on some of the central issues and problems which are being addressed within corpus linguistics, the book also attempts to bring together disparate work which is often hard to get hold of. However, such is the speed of development and change in corpus linguistics at the present time that anyone writing about it must be conscious that it would be easy to produce a Ptolemaic picture of the field – with the world distorted and with Terra Australis Incognita, the Great Southern Continent, both misconceived and misplaced. Work relevant for corpus linguistics is being done in many fields, including computer science and artificial intelligence, as well as in various branches of descriptive and applied linguistics. It would not be surprising if some of the scholars contributing to corpus linguistics from these and other perspectives found that their work is inadequately represented here. However, they can be assured that such neglect is not intended.
Because corpus linguistics is a field where activity is increasing very rapidly and where there is as yet no magisterial perspective, even the very notion of what constitutes a valid corpus can still be controversial. It also needs to be understood at the outset that not every use of computers with bodies of text is part of corpus linguistics. For example, the aim of Project Gutenberg to distribute 10,000 texts to 100 million computer users by the year 2001 is not in itself part of corpus linguistics although texts included in this ambitious project may conceivably provide textual data for corpus analysis. Similarly, contemporary reviews of computing in the humanities show the enormous extent of corpus-based work in literary studies. While some of the methodology used in literary studies resembles some of the activity being undertaken in corpus linguistics, research on authorial attribution or thematic structure, for example, does not come within the scope of this book. Nor does the book attempt to cover systematically the wide range of corpus-based work being undertaken in computational linguistics in such areas of natural language processing as speech recognition and machine translation.
Although there have been spectacular advances in the development and use of electronic corpora, the essential nature of text-based linguistic studies has not necessarily changed as much as is sometimes suggested. In this book, reference is made to corpus studies which were undertaken manually before computers were available. Corpus linguistics did not begin with the development of computers but there is no doubt that computers have given corpus linguistics a huge boost by reducing much of the drudgery of text-based linguistic description and vastly increasing the size of the databases used for analysis. It should be made clear, however, that corpus linguistics is not a mindless process of automatic language description. Linguists use corpora to answer questions and solve problems. Some of the most revealing insights on language and language use have come from a blend of manual and computer analysis. It is now possible for researchers with access to a personal computer and off-the-shelf software to do linguistic analysis using a corpus, and to discover facts about a language which have never been noticed or written about previously. The most important skill is not to be able to program a computer or even to manipulate available software (which, in any case, is increasingly user-friendly). Rather, it is to be able to ask insightful questions which address real issues and problems in theoretical, descriptive and applied language studies. Many of the key problems and challenges in corpus linguistics are associated with the following questions:
• How can we best exploit the opportunities which arise from having texts stored in machine-retrievable form?
• What linguistic theories will best help structure corpus-based research?
• What linguistic phenomena should we look for?
• What applications can make use of the insights and improved descriptions of languages which come out of this research?
In answering these and other questions corpus linguistics has potential to provide solutions and new directions to some of the major issues and problems in the study of human communication.

1.1 Corpora

The definition of a corpus as a collection of texts in an electronic database can beg many questions for there are many different kinds of corpora. Some dictionary definitions suggest that corpora necessarily consist of structured collections of text specifically compiled for linguistic analysis, that they are large or that they attempt to be representative of a language as a whole. This is not necessarily so. Not all corpora which can be used for linguistic research were originally compiled for that purpose. Historically it is not even the case that corpora are necessarily stored electronically so that they can be machine-readable, although this is nowadays the norm. As discussed in Section 2.2, electronic corpora can consist of whole texts or collections of whole texts. They can consist of continuous text samples taken from whole texts; they can even be made up of collections of citations. At one extreme an electronic dictionary may serve as a kind of corpus for certain types of linguistic research while at the other extreme a huge unstructured archive of texts may be used for similar purposes by corpus linguists.
Corpora have been compiled for many different purposes, which in turn influence the design, size and nature of the individual corpus. Some current corpora intended for linguistic research have been designed for general descriptive purposes – that is, they have been designed so that they can be examined or trawled to answer questions at various linguistic levels on the prosody, lexis, grammar, discourse patterns or pragmatics of the language. Other corpora have been designed for specialized purposes such as discovering which words and word meanings should be included in a learners’ dictionary; which words or meanings are most frequently used by workers in the oil industry or economics; or what differences there are between uses of a language in different geographical, social, historical or work-related contexts.
A distinction is sometimes made between a corpus and a text archive or text database. Whereas a corpus designed for linguistic analysis is normally a systematic, planned and structured compilation of text, an archive is a text repository, often huge and opportunistically collected, and normally not structured. It is generally the case, as Leech (1991: 11) suggested, that ā€˜the difference between an archive and a corpus must be that the latter is designed or required for a particular ā€œrepresentativeā€ function’. It is nevertheless not always easy to see unequivocally what a corpus is representing, in terms of language variety.
Databases which are made up not of samples, but which constitute an entire population of data, may consist of a single book (e.g. George Eliot’s Middlemarch) or of a number of works. These corpora may be the work of a single author (e.g. the complete works of Jane Austen) or of several authors (e.g. medieval lyrics), or all the editions of a particular newspaper in a given year. Some projects have assembled all the known available texts in a particular genre or from a particular historical period. Some of these databases or text archives described in Section 2.4 are very large indeed, and although they have rarely yet been used as corpora for linguistic research, there is no reason why they should not be in the future. In many respects it is thus the use to which the body of textual material is put, rather than its design features, which define what a corpus is.
A corpus constitutes an empirical basis not only for identifying the elements and structural patterns which make up the systems we use in a language, but also for mapping out our use of these systems. A corpus can be analysed and compared with other corpora or parts of corpora to study variation. Most importantly, it can be analysed distributionally to show how often particular phonological, lexical, grammatical, discoursal or pragmatic features occur, and also where they occur.
In the early 1980s it was possible to list on a few fingers the main electronic corpora which a small band of devotees had put together over the previous two decades for linguistic research. These corpora were available to researchers on a non-profit basis, and were initially available for processing only on mainframe computers. The development of more powerful microcomputers from the mid-1970s and the advent of CD-ROM in the 1980s made corpus-based research more accessible to a much wider range of participants.
By the 1990s there were many corpus-making projects in various parts of the world. Lancashire (1991) shows the huge range of corpora, archives and other electronic databases available or being compiled for a wide variety of purposes. Some of the largest corpus projects have been undertaken for commercial purposes, by dictionary publishers. Other projects in corpus compilation or analysis are on a smaller scale, and do not necessarily become well known. Undertaken as part of graduate theses or undergraduate projects, they enabled students to gain original insights into the structure and use of language.

1.2 The role of computers in corpus linguistics

The analysis of huge bodies of text ā€˜by hand’ can be prone to error and is not always exhaustive or easily replicable. Although manual analysis has made an important contribution over the centuries, especially in lexicography, it was the availability of digital computers from the middle of the 20th century which brought about a radical change in text-based scholarship. Rather than initiating corpus research, developments in information technology changed the way we work with corpora. Instead of using index cards and dictionary ā€˜slips’, lexicographers and grammarians could use computers to store huge amounts of text and retrieve particular words, phrases or whole chunks of text in context, quickly and exhaustively, on their screens. Furthermore the linguistic items could be sorted in many different ways, for example, taking account of the items they collocate with and their typical grammatical behaviour.
Corpus linguistics is thus now inextricably linked to the computer, which has introduced incredible speed, total accountability, accurate replicability, statistical reliability and the ability to handle huge amounts of data. With modern software, computer-based corpora are easily accessible, greatly reducing the drudgery and sheer bureaucracy of dealing with the increasingly large amounts of data used for compiling dictionaries and other information sources. In addition to greatly increased reliability in such basic tasks as searching, counting and sorting linguistic items, computers can show accurately the probability of occurrence of linguistic items in text. They have thus facilitated the development of mathematical bases for automatic natural language processing, and brought to linguistic studies a high degree of accuracy of measurement which is important in all science. Computers have permitted linguists to work with a large variety of texts and thus to seek generalizations about language and language use which can go beyond particular texts or the intuitions of particular linguists. The quantification of language use through corpus-based studies has led to scientifically interesting generalizations and has helped renew or strengthen links between linguistic description and various applications. Machine translation, text-to-speech synthesis, content analysis and language teaching have been among the beneficiaries.
Some idea of the changes which the computer has made possible in text studies can be gauged from a report in an early issue of the ALLC Bulletin, the forerunner of the journal Literary and Linguistic Computing. A brief report by Govindankutty (1973) on the coming of the computer to Dravidian linguistics captures the moment of transition between manual and electronic databases. The text he was working with of 300,000 words is small by today’s standards, but what took the researcher and his long-suffering colleagues nearly six years of data management and analysis could, 20 years later, be carried out in minutes.
It took nearly six years’ hard labour and the co-operation of colleagues and students to complete the Index of Kamparāmāyaņam, the longest middle Tamil text, in the Kerala University under the supervision of Professor V. I. Subramoniam. The text consists of nearly 12,500 stanzas and each stanza has four lines; each line has an average of six words. All the words and some of the suffixes were listed on small cards by the late Mr. T. Velaven who is the architect of this voluminous index. Later, the cards were sorted into alphabetical order and each item was again arranged according to the ascending order of the stanza and line. Finally, each entry was checked with the text and the meaning and grammatical category were noted. The completed index consists of about 3,500 typed pages (28 Ɨ 20 cm).
While indexing, some suffixes such as case were listed separately. This posed some problems when I started to work on the grammar of the language of the text. When it was necessary to find out after what kind of words and after which phonemes and morphemes the alternants of a suffix occur, it became necessary again to go through all the entries. Though I have tried to work out the frequency of all the suffixes, for want of time it was not completely possible. However, the frequency study helped to unearth different strata in the linguistic excavation and indirectly emphasized that it is a sine qua non, at least, for such a descriptive and historical study.
Though it took a lot of time, energy and patience, the birth of an index brought with it an unknown optimism in the grammatical description. After completing the index and the grammatical study of Kamparāmāyaņam, three months ago I started indexing Rāmacaritam, an early Malayalam text, using small cards. This project is being carried out in the Leiden University with the guidance of Professor F. B. J. Kuiper. While I was half my way through the indexing, Dr. B. J. Hoff of the Linguistics Department informed me of the work done in the Institute for Dutch Lexicology with the help of a computer. When I discussed the problems with Dr. F. de Tollenaere, who is the head of this institute, he outlined with great enthusiasm how a computer can be utilized for this purpose. Immediately, I started transcribing the text and now it is being punched on ...

Table of contents

  1. Cover
  2. Half Title
  3. Studies in language and linguistics
  4. Title Page
  5. Copyright Page
  6. Table of Contents
  7. Author's acknowledgements
  8. Publisher's acknowledgements
  9. Chapter One: Introduction
  10. Chapter Two: The design and development of corpora
  11. Chapter Three: Corpus-based descriptions of English
  12. Chapter Four: Corpus analysis
  13. Chapter Five: Implications and applications of corpus-based analysis
  14. References
  15. Index