Languages & Linguistics

Corpus Linguistics

Corpus linguistics is a branch of linguistics that involves the analysis of large collections of written or spoken texts (corpora) to study language patterns and usage. It uses computational tools and statistical methods to identify linguistic patterns, frequencies, and relationships within a language. Corpus linguistics provides valuable insights into language structure, usage, and variation.

Written by Perlego with AI-assistance

12 Key excerpts on "Corpus Linguistics"

  • Book cover image for: Web As Corpus
    eBook - ePub

    Web As Corpus

    Theory and Practice

    2. Key issues in Corpus Linguistics
    Before plunging directly into the main topic of this book, it is perhaps useful to provide some introductory information on Corpus Linguistics as a research field. Today Corpus Linguistics could be very loosely defined as an exploration of language based on a set of authentic texts in machine-readable format. The set of texts, namely the corpus, is usually of a size which would not allow manual investigation but requires the use of specific tools to perform a quantitative and qualitative analysis of the data, through such tasks as producing frequency lists of all the words appearing in the corpus, providing data concerning recurring patterns, and computing statistics about relative frequency by comparing different data sets.
    On the basis of such a loose definition, the idea of what constitutes a corpus might appear to be more inclusive than it is generally perceived to be in contemporary Corpus Linguistics. Indeed, as McEnery and Wilson (2001) suggest, drawing on the Latin etymology of the word, virtually any collection of more than one text could, in principle, be called a ‘corpus’. The Latin word corpus means ‘body’, and hence a corpus should be by definition any body of text. In language studies, however, this term has acquired more specific connotations than this simple definition implies. Even though some branches of linguistics have always been to some extent ‘corpus-based’ (i.e. based on the study of a number of authentic texts), and concepts such as corpus and concordance have been for many years the daily bread of scholars studying the Bible or Shakespeare’s works (Kennedy 1998 : 14), Corpus Linguistics as a distinct research field is a relatively recent phenomenon which rests on certain basic assumptions about what a corpus is, and also – perhaps more crucially for the purpose of the present book – what a corpus is not. A corpus, according to Sinclair’s definition, ‘is a collection of naturally-occurring language chosen to characterize a state or variety of a language’ (Sinclair 1991 : 171, my italics). In modern linguistics this also entails such basic standards as finite size, sampling and representativeness (McEnery and Wilson 2001 : 29). Thus, as Francis observes, an anthology cannot be properly considered as a corpus; neither can, despite its name, the Corpus Iuris Civilis instigated by the Emperor Justinian in the sixth century (Francis 1992 : 17), because their purpose is other than linguistic. And while it is doubtful whether a collection of proverbs can be considered as a corpus in its own right, it may eventually be considered as such if it is the object of linguistic research carried out using Corpus Linguistics tools (Tognini Bonelli 2001 : 53). Thus the notion of what may constitute a corpus seems to defy simple definitions based on the corpus-as-object alone and is best approached in a wider perspective including considerations on both form and purpose (Hunston 2002
  • Book cover image for: Understanding Corpus Linguistics
    • Danielle Barth, Stefan Schnell(Authors)
    • 2021(Publication Date)
    • Routledge
      (Publisher)
    Duranti 1997 ).
    The main concern of Corpus Linguistics, however, is the regularities of language use. Corpus linguists seek to identify patterns of variation in language use and relate these to relevant factors of their context. Instead of asking what is possible to say, sign, or write in a given language – given relevant abstract rules – we are interested in what people have said, signed, or written in specific contexts, as observed in recorded texts contained in a corpus, and what they are therefore most likely to say, sign, or write given the same contextual circumstances.
    Our definition of corpus and characterisation of Corpus Linguistics does not specify any properties of the texts included. Some corpus linguists may restrict definitions of corpora to ‘authentic’ texts, those produced in non-academic contexts (McEnery & Wilson 2001 ; Stefanowitsch 2020 :23–25). But we would include in our definition also texts that come from specific experimental designs for linguistic or other academic purposes. Corpora containing such experimentally elicited texts are labelled ‘artificial corpora’ (cf. Section 3.3.1 ). There are specific research contexts – in particular, language documentation to be discussed in Chapter 10 – where corpus compilation is driven by a variety of considerations, some of which necessitate the inclusion of non-authentic texts and text types. The only necessary condition for the inclusion of texts in a corpus is that the expressions in it are used in the sense of constituting some social action, if only in reaction to a stimulus. This makes corpus texts different from the constructed examples abundant in many other strands of linguistics like Sapir’s the farmer killed the ducklings (1921:94) as an example of a typical simple sentence.1 A crucial feature of such examples is that they do not represent any language use, but instead, merely mention a possible structure. Likewise, irrelevant for Corpus Linguistics are any kind of evaluation of language use, for the simple reason that these do not represent language production at all, but judgements thereof. This applies to grammaticality judgements either by linguists themselves (intuitions) or by informants (Jackendoff 1994 :48–49; Schütze 2016 ; Stefanowitsch 2020 :8–17), as well as elicited evaluation of language use as are common in perceptual dialectology (Preston 1989
  • Book cover image for: Corpus linguistics : A guide to the methodology
    (McEnery & Wilson 2001: 1). This defnition is uncontroversial in that any research method that does not fall under it would not be regarded as Corpus Linguistics. However, it is also very broad, covering many methodological approaches that would not be described as Corpus Linguistics even by their own practitioners (such as discourse analysis or citation-based lexicography). Some otherwise similar defnitions o f corpus lin-guistics attempt to be more specifc in that they defne Corpus Linguistics as “the compilation and analysis of corpora.” (Cheng 2012: 6, cf. also Meyer 2002: xi), suggesting that there is a particular form of recording “real-life language use” called a corpus . The frst chapter o f this book started with a similar de fnition, characterizing Corpus Linguistics as “as any form of linguistic inquiry based on data derived from [...] a corpus”, where corpus was defned as “a large collection o f authentic text”. In order to distinguish Corpus Linguistics proper from other observational methods in linguistics, we must frst refne this defnition o f a linguistic corpus; this will be our concern in Section 2.1. We must then take a closer look at what it means to study language on the basis of a corpus; this will be our concern in Section 2.2. 2.1 The linguistic corpus The term corpus has slightly diferent meanings in diferent academic disciplines. It generally refers to a collection of texts; in literature studies, this collection may consist of the works of a particular author (e.g. all plays by William Shakespeare) or a particular genre and period (e.g. all 18th century novels); in theology, it may be (a particular translation of) the Bible. In feld linguistics, it re fers to any col-lection of data (whether narrative texts or individual sentences) elicited for the purpose of linguistic research, frequently with a particular research question in mind (cf. Sebba & Fligelstone 1994: 769).
  • Book cover image for: Corpus Linguistics and the Description of English
    • Hans Lindquist, Magnus Levin(Authors)
    • 2018(Publication Date)
    • EUP
      (Publisher)
    1 1 Corpus Linguistics 1.1 Introducing Corpus Linguistics There are many ‘hyphenated branches’ of linguistics, where the first part of the name tells you what particular aspect of language is under study: sociolinguistics (the relation between language and society), psycholin-guistics (the relation between language and the mind), neurolinguistics (the relation between language and neurological processes in the brain) and so on. Corpus Linguistics is not a branch of linguistics on a par with these other branches, since ‘corpus’ does not tell you what is studied, but rather that a particular methodology is used. Corpus Linguistics is thus a methodology, comprising a large number of related methods which can be used by scholars of many different theoretical leanings. On the other hand, it cannot be denied that Corpus Linguistics is also frequently associated with a certain outlook on language. Central to this outlook is that the rules of language are usage-based and that changes occur when speakers use language to communicate with each other. The argument is that if you are interested in the workings of a particular language, like English, it is a good idea to study English in use. One efficient way of doing this is to use corpus methodology, and that is what this book is about. We will see how the idea of using electronic corpora began around 1960, fairly soon after computers started becoming reasonably powerful, and how the field has developed over the last sixty-odd years. We will look at different types of corpora, study the techniques involved and look at the results that can be achieved. Since the field has grown phenomenally over the last decades, it is impossible to cover every aspect of it, and what is presented in this book therefore has to be a selection of what we think a student of English should know.
  • Book cover image for: New Language Technologies and Research in Linguistics
    A corpus is a social occasion of machine-intelligible works that have been made in a trademark open setting. They have been analyzed to be illustrative and balanced concerning particular factors; for example, by class—everyday paper articles, theoretical fiction, talked talk, destinations, and diaries, and legitimate files. This isn’t as round as it may sound. Essentially, if the substance of the corpus, described by particulars of linguistic marvels reviewed or considered, mirrors that of the greater people from which it is taken, by then we can express that it “addresses that lingo collection.” The prospect of a corpus being balanced is a believed that has been around since the 1980s, yet it is up ‘til now a genuinely fleecy thought and difficult to portray completely makers propose a specifying of properties that can be used to describe the sorts of substance, and thusly add to making a balanced corpus. Corpus Linguistics is the examination of lingo in perspective of significant gatherings of “honest to goodness living” tongue use set away in corpora electronic databases made for linguistic research. It is seen by a couple of etymologists as an investigation instrument or rationality, and by others as an educate or speculation in its own right. New Technologies for Linguistic Research and Linguistic Research for ..... 53 It is assume that “the reaction to the request whether Corpus Linguistics is a speculation or an instrument is essentially that it can be both. Corpus Linguistics is the examination of lingo as conveyed in corpora (tests) of “certifiable” substance. The substance corpus technique is a stomach related approach for deciding a game plan of dynamic standards, from a substance, for controlling a trademark tongue, and how that vernacular relates to and with another lingo; at first gathered physically, corpora now are normally gotten from the source compositions.
  • Book cover image for: Corpora and Language Education
    While McEnery et al. (ibid.: 6) consider Corpus Linguistics as ‘a new philosophical approach to linguistic enquiry’ with its own theoretical status, they do not view it as a discipline in its own right with its own theory. Teubert (2005: 2) describes the field as ‘a theoretical approach to the study of language’. Corpus Linguistics and Textlinguistics 83 Quote 4.2 McEnery, Xiao and Tono on status of Corpus Linguistics As Corpus Linguistics is a whole system of methods and principles of how to apply corpora in language studies and teaching/learning, it certainly has a theoretical status. Yet theoretical status is not theory itself. The qualitative methodology used in social sciences also has a theoretical basis and a set of rules relating to, for example, how to conduct an interview, or how to design a questionnaire, yet it is still labelled as a methodology upon which theories may be built. The same is true of Corpus Linguistics. (McEnery et al. 2006: 7–8) In sum, there are many competing viewpoints as to whether Corpus Linguistics should be considered a methodology, theory or approach. It is probably best regarded, in essence, as a methodology along the continuum (rather than divide) of the corpus-driven vs corpus-based approaches. Although the research results are being increasingly interpreted with reference to other linguistic (e.g. systemic-functional linguistics, see Section 3.2) or cognitive theories such as those embraced by usage-based models of language (see Section 3.1.3), this does not make Corpus Linguistics a theory in itself. For this reason it may be more appropriate to refer to this field as ‘corpus-based linguistics’, as Lee (2008) does, to clarify its status. 4.2 Corpus analysis vs discourse analysis We have seen in Section 4.1 that Corpus Linguistics is a somewhat slippery term to define and that there has been much debate over corpus-driven vs corpus- based investigations (see Section 3.1).
  • Book cover image for: Language in Context in TESOL
    • Joan Cutting(Author)
    • 2014(Publication Date)
    • EUP
      (Publisher)
    9 2 Corpus Linguistics INTRODUCTION This chapter describes a technique that can be used to analyse the sociolinguistic dimensions of English in context, and combined with approaches to data analysis. Corpus Linguistics (henceforth CL) studies corpora , which are electronic databases of authentic texts selected according to defined research purposes and stored on computers. It studies them using specialised software, which provides lists of word frequencies and typical grammatical patterns contained within the corpus. Corpora are widely used in the world of language teaching and research. Most dictionaries, grammar reference books, coursebooks and tests nowadays are based on them, for example COBUILD Dictionaries , Longman Grammar of Spoken and Written English (Biber et al. 1999), Cambridge Grammar of English (Carter and McCarthy 2006) and Touchstone (McCarthy et al. 2005). Why would you as an English teacher want to study a corpus? • You may be curious about how language is used in daily life, and muse along the lines of ‘I wonder whether “raise” or “increase” occurs most with the noun “awareness”?’ • It could be that, in class, your learners ask you questions such as ‘What’s the dif-ference between “think about” and “think of ”?’ You are not entirely sure how most English native speakers use ‘think’ grammatically, and you wonder if it depends on whether they are from the UK, Australia, New Zealand or the US. • You may be interested in the language used in your classroom, and wonder about issues such as ‘In pairwork, are the boys using the politeness expressions I just taught them as much as the girls are?’ • You may have a concern that the language in your EFL coursebook dialogues is not very realistic: ‘Do people really say “Have you ever been to Paris?” in casual conversation, or is ellipsis as in “Ever been to Paris?” more common?’ You wonder if it might depend on how old the speaker is or how well interlocutors know each other.
  • Book cover image for: Corpus Linguistics. Volume 1
    • Anke Lüdeling, Merja Kytö, Anke Lüdeling, Merja Kytö(Authors)
    • 2008(Publication Date)
    As the compilation of historical corpora and corpus-based analysis of the language of the past have so far been most intensive in the field of the English language, the following discussion will primarily focus on English. The variationist approach and the methodological questions discussed can, however, be applied to research on other lan-guages as well. We should also keep in mind that important diachronic corpus projects on German, Spanish, French, Czech, Welsh, the Scandinavian languages, Finnish, and various other languages are either completed or in progress. Useful bibliographical infor-mation can be found, for example, in article 52. 2 . Variationist approach to the study o language The increasing use and obvious advantages of computerised corpora have led to the adoption of the term “Corpus Linguistics” with reference to linguistic study based on corpora. While this is a useful term for indicating a particular focus on evidence-based linguistic research, which typically combines qualitative and quantitative analysis and pays particular attention to software developments, it should be kept in mind that the use of corpora is a methodological approach rather than an independent branch of linguistics. The aims and goals of corpus-based research are the same as those of all empirical linguistic research: to understand and explain language as a means of com-munication between people. Using corpora for collecting and analysing material simply helps us approach and appreciate the richness and variability of language and to under-stand how linguistic change is related to this variability, caused by both internal pro-cesses of change and language-external factors, socio-cultural, regional or genre-based. If we wish to define a new branch of linguistics supported by computerised corpora, attention should be called to the variationist approach to the analysis and understanding 4. Corpus Linguistics and historical linguistics 55 of language.
  • Book cover image for: Corpus Linguistics. Volume 2
    • Anke Lüdeling, Merja Kytö, Anke Lüdeling, Merja Kytö(Authors)
    • 2009(Publication Date)
    It is hardly surprising that the divorce of theory and empirical data results in either untrue or uninteresting theories because any theory that cannot account for authentic data is a false theory while data without a theory is just a meaningless pile of data. As such, with exceptions of a few extremists from either camp who argue that “Corpus Linguistics doesn’t mean anything” (see Andor 2004, 97), or that nothing meaningful can be done without a corpus (see Murison-Bowie 1996, 182), the majority of linguists (e. g. Leech 1992; Meyer 2002) are aware that the two approaches are complementary to each other. In Fillmore’s (1992, 35) words, “the two kinds of linguists need each other. Or better, […] the two kinds of linguists, wherever possible, should exist in the same body”. V. Use and exploitation of corpora 988 This article discusses the use of corpus data in developing linguistic theory (section 2) and presents an effort to achieve a marriage between theory-driven and corpus-based approaches to linguistics via a series of case studies of aspect (section 3), which has long been studied, but rarely with recourse to corpus data. 2 . Can corpora contribute to linguistic theory? To answer this question, we must first of all find out what linguistics is about. We will then discuss the use of intuitions and corpora as evidence in linguistic theorizing and explore how corpus data can contribute to linguistic theory. 2 . 1 . What linguistics is about It has been argued that linguistics is “the study of abstract systems of knowledge ideal-ized out of language as actually experienced”, i. e. “idealized internalized I-language” (Widdowson 2000, 6). If linguistics is defined as such, we must admit that any linguistic analysis involving performance data (i. e. “E-language”) has nothing to do with “linguis-tics” and should claim no place in “linguistics” at all (cf. Leech 2000, 685).
  • Book cover image for: Directions in Corpus Linguistics
    eBook - PDF

    Directions in Corpus Linguistics

    Proceedings of Nobel Symposium 82 Stockholm, 4-8 August 1991

    I have two main observations to make. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate. The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way. My conclusion is that the two kinds of linguists need each other. Or better, that the two kinds of linguists, wherever possible, should exist in the same body. 36 Charles J. Fillmore During the early decades of my career as a linguist, I thought of myself as fortunate for having escaped Corpus Linguistics. Of course, I wouldn't have used the term Corpus Linguistics in describing my good fortune: maybe I would have called it statistical linguistics. The situation was this. When I showed up as a beginning graduate student at the University of Michigan's linguistics program, a long time ago, the first person I considered as a possible dissertation director was the kind of professor I myself would like to be able to be, namely, someone with a well-articulated research agenda who asked each of the students who came under his wing to take on a predetermined assignment within that agenda. If I wanted him to be my mentor, I was to carry out the following assign-ment. First, I was to make extensive tape recordings - actually, at the time, it may have been wire recordings - of natural conversations in English and Japanese. After doing that, I was to choose and justify a set of empirical criteria for phonemic analysis that could be applied to each of these languages.
  • Book cover image for: Exploring English with Online Corpora
    • Wendy Anderson, John Corbett(Authors)
    • 2017(Publication Date)
    • Red Globe Press
      (Publisher)
    One of the incidental pleasures of corpus study is in noticing intriguing patterns that are unrelated to the immediate object of your study, and it is an invaluable aid to memory to note these observations to follow up at a later time. This chapter begins by briefly discussing the notion of a word as it relates to Corpus Linguistics, and exploring how we can retrieve information about words from a corpus. However, words are of limited interest when extracted from their linguistic co-text and wider context, so we will move quickly on to look at vari-ous aspects of a word’s linguistic environment, including the concepts of colloca-tion , colligation , multiword units (such as idioms and metaphors) and semantic prosody . Although our focus here is on the words used in such patterns, and how they interact with their environment to create meaning, some of the analysis in this chapter will necessarily anticipate the investigations of later chapters, in par-ticular Chapters 5 and 6 , on grammar and discourse, respectively. It is important to understand that although we have chosen to divide language up into different ‘levels’ for the purposes of this book, these levels are for convenience only: mean-ing is not created at each level independently of the others, but is rather created by choices made at multiple linguistic levels simultaneously. What is a word in a corpus? The question of what constitutes a word has long been discussed by linguists (see for example Crystal 1997 , p. 91). In Corpus Linguistics, ‘word’ may be used to cover the concepts of both word form and lemma . The word form is the easier to define: word forms exist on the surface of language, and are simply sequences of characters occurring between two spaces, or between other characters such 72 EXPLORING ENGLISH WITH ONLINE CORPORA as punctuation marks which word list software has been programmed to rec-ognise as boundaries. Is , are , was , were , being and been are therefore all separate word forms.
  • Book cover image for: Doing Linguistics with a Corpus
    eBook - PDF

    Doing Linguistics with a Corpus

    Methodological Considerations for the Everyday User

    It turns out that nearly every linguistic feature requires an operational definition before it can be analyzed in a text-linguistic study (see fuller discussion in Biber & Conrad, 2019: 60–2). The issues discussed so far in this section relate to the research methods required to ensure that quantitative variables are fully interpretable in linguistic terms. However, as will be discussed in more detail in Section 5, this interpret- ability becomes even more important when a researcher relies on measures that are automatically computed by corpus analysis software. To illustrate the points made in this section introduction, we present two short case studies in Sections 4.2 and 4.3. In the first, we discuss corpus-based analyses of collocation, which often rely on complex statistical measures that can be difficult to interpret in linguistic terms. The second case study discusses measures of “keyness,” which can present different types of challenges for meaningful linguistic interpretation. 4.2 Case Study 1: Measures of Collocation The question that we explore in this case study is what quantitative measure is best suited to a particular research goal, using a major application of corpus research, namely the study of “collocation”: “a relationship of habitual co-occurrence between words” (Stubbs, 1995: 1). One primary goal of such research has been to study the extended meanings of words beyond traditional dictionary definitions. For example, the verb cause is traditionally defined in 25 Corpus Linguistics neutral terms as “make something happen”. However, corpus research shows that this verb frequently co-occurs with words referring to negative events, such as trouble or problems, a pattern first observed by Stubbs (1995) (see also, e.g., Hunston, 2007; Xiao & McEnery, 2006). These “collocates” of the word cause lead to the extended meaning of cause: “make something bad happen”.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.