Introduction to Corpus Linguistics
eBook - ePub

Introduction to Corpus Linguistics

Sandrine Zufferey

Share book
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Introduction to Corpus Linguistics

Sandrine Zufferey

Book details
Book preview
Table of contents

About This Book

Over the past decades, the use of quantitative methods has become almost generalized in all domains of linguistics. However, using these methods requires a thorough understanding of the principles underlying them. Introduction to quantitative methods in linguistics aims at providing students with an up-to-date and accessible guide to both corpus linguistics and experimental linguistics. The objectives are to help students developing critical thinking about the way these methods are used in the literature and helping them to devise their own research projects using quantitative data analysis.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Introduction to Corpus Linguistics an online PDF/ePUB?
Yes, you can access Introduction to Corpus Linguistics by Sandrine Zufferey in PDF and/or ePUB format, as well as other popular books in Psychology & Cognitive Science. We have over one million books available in our catalogue for you to explore.



How to Define Corpus Linguistics

This chapter aims to offer the main defining elements of corpus linguistics in order to understand what this field includes. It also aims to lay the theoretical and methodological bases on which the discipline is based. In particular, we will introduce the difference between empirical and rationalist methodologies in linguistics, the important role of computer science for corpus linguistics, the difference between quantitative and qualitative studies, as well as the differences between corpus linguistics and experimental linguistics. In conclusion, we will briefly review the different types of corpora. In the upcoming chapters, this introduction will help us to tackle the research questions that can be answered by means of a corpus study.

1.1. Defining elements

The term corpus has a Latin origin and means “body”. A text corpus literally embodies a set of texts, a collection of a certain number of texts for study. For example, it is possible to collect a series of newspaper articles and make a corpus of them in order to study the specificities of the journalistic genre. In the field of language teaching, it is also possible to collect texts written by students having different levels, and to build a corpus of these writings in order to study the typical errors that students produce at different learning stages. A methodology using data from the outside world rather than using one’s own knowledge of the language is called an empirical methodology. Corpus linguistics can be defined as an empirical discipline par excellence, since it aims to draw conclusions based on the analysis of external data, rather than on the linguistic knowledge pertaining to researchers.
Working with corpus linguistics therefore implies being in contact with linguistic data in the form of texts, and also in the form of recordings, videos or any other sample containing language. Most of the time, these samples are collected in a computerized format, which makes it possible to study them more effectively than if they were on paper. Let us imagine, for example, we wish to know how many times and in what passages Flaubert evokes the feeling of love in his novel Madame Bovary. If we have a paper version of that book, finding these passages will be a long and tedious task, which will require going through the entire text. However, having a computerized version would make the task much easier. We simply need to look up for the terms love, in love or the verb to love in its different forms with the search function of the word processor so as to locate the appearances and easily count them. For most of the questions addressed by corpus linguistics, it would be impossible to search through a paper database, and that is why having computerized corpora becomes essential.
The problem of manual tracking and counting of occurrences is all the more acute since corpus linguistics is often based on large amounts of data which have not been drawn from a single book, in view of observing the multiple occurrences of a certain linguistic phenomenon and thus apprehending its specificities. For example, let us suppose that we wish to know whether Flaubert talks about love in his work. In this case, focusing solely on Madame Bovary would induce a bias, because this novel is not representative of the whole of his work. So, in order to be able to answer this question, it is necessary to go through the entirety of his novels, making the task even more complex to perform manually. Let us now imagine that this time we want to know whether the French authors of the 19th Century all deal with the question of love as much as Flaubert does. In this case, it would be impossible for us to look up the occurrence of terms related to love in all of the novels written by French authors in the 19th Century. In order to avoid this problem, it would be necessary to collect a sample of texts, representative of the works of this period. We will discuss this topic in Chapter 6, which is devoted to the methodological principles underlying the construction of a corpus. For the moment, the important point to bear in mind is that corpus linguistics often resorts to a quantitative methodology (see section 1.5) so as to be able to generalize the conclusions observed on the basis of a linguistic sample to the whole of the language, or belonging to a particular language register.
As we will see in the following chapters, corpus linguistics may be of use in all areas of linguistics, for instance in fundamental (see Chapter 2) or applied (see Chapter 3) linguistics. For example, it is crucial in lexicography, since it makes it possible to make an exhaustive inventory of a language’s lexicon. It also makes it easy to find examples of uses in different types of sources (literary, journalistic and others), while bringing to light the expressions in which a word is frequently used. In other words, it makes it possible to establish very useful phraseology elements for dictionaries. For example, it is useful to know what the word “knowledge” means, but it is just as important to know that this word is frequently used in phrases such as “acquire knowledge” or “having good knowledge of”, etc. Corpus linguistics is a particularly effective method for establishing the frequent contexts in which a word or an expression is used. But corpus linguistics is also used for conducting research in fundamental areas of linguistics such as the study of syntax, since it makes it possible to identify the types of syntactic structures used in different languages. For example, by making a corpus study, it is possible to determine in which textual genres the passive voice is most commonly used. Finally, thanks to the existence of a corpus of oral data, corpus linguistics also makes it possible to answer questions related to phonology and sociolinguistics. For instance, it makes it possible to establish the area of geographical distribution of certain pronunciation traits, such as differentiating the short /a/ form in the French word “patte” (paw), from the long /ɑ/ form in the word “pâte” (pastry). Answering these different questions requires the use of different types of corpora, as well as having available data regarding their contents. For example, in order to determine the geographical area of diffusion of a certain pronunciation trait, it is necessary to know where each speaker having contributed to the corpus came from. This type of information is called corpus metadata. We will review the main types of existing corpora at the end of this chapter, and discuss the issue of metadata in Chapter 6.
To sum up, in this section, we have defined corpus linguistics as an empirical discipline, which observes and analyzes quantitative language samples gathered in a computerized format. In the following sections, we will discuss in depth the different central points of the definition, indicated in bold, in order to better understand the theoretical and methodological anchoring of corpus linguistics.

1.2. Empiricism versus rationalism in linguistics

Corpus linguistics is an empirical discipline, which means that it uses data produced by speakers in order to study language. This methodology is opposed to the rationalist method, which functions by looking for answers by relying on one’s own linguistic knowledge, rather than looking for it in external data. Let us take an example. In order to determine whether the phrase “When do you think he will prepare which cake?” is grammatically correct or not, the use of empirical methodology would go through large corpora to find whether this syntactic structure is used by English speakers or not.
If sentences following such a syntactic structure never or almost never appear in the corpus, linguists might conclude that this sentence is only rarely used in English. Rationalist methodology, on the contrary, might respond to the same issue by relying on the intuitions of linguists. In this particular case, they might wonder whether they could produce such a sentence or not, whether it seems correct or incorrect depending on their knowledge of the language and might infer a grammaticality judgment from it. Grammaticality judgments are often classified into three types: correct, incorrect or marked, in the event that a sentence may seem possible, but sounds unnatural.
This example illustrates a fundamental difference between empirical and rationalist methodology. While the rationalist methodology leads to the formulation of categorical judgments, the empirical methodology provides a more refined answer to this question, since the observation of corpus data offers a precise indication of frequency, rather than a result in terms of absence or presence. This is one of the reasons why many linguists currently consider that the empirical methodology better matches a scientific approach (in the sense of confrontation against the facts) than a purely rationalist method for studying language.
Nonetheless, the choice between the use of empirical or rationalist methods is not limited to the field of linguistics. Certain scientific branches such as physics, chemistry, as well as sociology and history are essentially empirical disciplines. In fact, both physicists and historians base their insights on external data, which they collect in the world, in order to build a theory, test it and draw conclusions from it. On the other hand, other disciplines such as mathematics or philosophy are traditionally based on a rationalist approach, since mathematicians and philosophers use their own reasoning to build theories and to draw conclusions, rather than from the collection and observation of external data. Philosophers often resort to thought experiments, but these are not experiments in the empirical sense of the term, because they are based on the reflective abilities of researchers.

1.3. Chomsky’s arguments against empiricism in linguistics

Although corpus linguistics has experienced a strong growth over the past 20 years, the empirical grounding of linguistics is not new. Linguists have long used observational data. In the 19th Century, for example, linguists used to work on the comparison of Indo-European languages in an attempt to reconstruct their common origin. Research was based on existing data about the languages spoken in Europe such as German, French and English. Similarly, in the first half of the 20th Century in the United States, the so-called distributionist approach to syntax focused on the study of sentence formation in syntactic structures as they appeared in text corpora, and from there, tried to infer language’s general functioning. Around the late 1950s, the use of corpora in linguistics was almost completely interrupted in certain fields such as syntax, following the works of the American linguist Noam Chomsky. In fact, Chomsky defended a strictly rationalist methodological approach to linguistics, and fiercely opposed any use of external data. The objections made by Chomsky against the use of external data in linguistics have been numerous. We will briefly review them, to show in what ways most of them have lost their raison d’être in the context of current research.
Chomsky’s first objection to the use of corpora, which is also the most fundamental one, is that corpora contain language samples produced by speakers. According to him, linguistics should not focus on the linguistic performance of speakers, but on the competence they have in their mother tongue, something he calls their internal language. Now, here is the problem. When people speak, what they produce (their performance) does not necessarily reflect what they know about their language (their competence). For example, under the effect of stress or fatigue, speakers sometimes produce verbal slip-ups or make language mistakes. From time to time, almost everybody happens to badly conjugate an irregular verb and mistakenly produce the form “he eated” instead of “he ate”. However, if the person who produced this wrong form were recorded, and then asked whether he or she thought he or she had spoken correctly or not, we can almost be sure that he or she would realize his or her mistake and would be able to state the correct form, “he ate”. Conversely, a speaker could pronounce a word like “serendipity” after having heard it from somebody else’s lips, but without really knowing its meaning. These examples illustrate the fact that the words speakers “utter” are not always a true reflection of their linguistic competence. In this way, according to Chomsky, the fact of studying corpora places linguists on the wrong track, because they lead them to consider language from the point of view of “production”, which merely represents a biased reflection of the rules of language.
According to Chomsky, another problem related to corpus linguistics stems from the fact that corpora are not representative of the language as a whole. He illustrates this problem in an extreme way, by picking the case of an aphasic speaker recorded in a corpus. Linguists analyzing this corpus would draw totally incorrect conclusions about the language in question, since this person does not represent the linguistic competence of a typical speaker. Furthermore, even if we were not to include an atypical speaker, a corpus could never represent more than a tiny language sample when compared to all the oral and written productions in any language. It is for this very same reason that it is impossible to conclude that a word simply does not exist in a language just...

Table of contents