Statistics in Corpus Linguistics Research
eBook - ePub

Statistics in Corpus Linguistics Research

A New Approach

  1. 358 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Statistics in Corpus Linguistics Research

A New Approach

About this book

Traditional approaches focused on significance tests have often been difficult for linguistics researchers to visualise. Statistics in Corpus Linguistics Research: A New Approach breaks these significance tests down for researchers in corpus linguistics and linguistic analysis, promoting a visual approach to understanding the performance of tests with real data, and demonstrating how to derive new intervals and tests.

Accessibly written, this book discusses the 'why' behind the statistical model, allowing readers a greater facility for choosing their own methodologies. Accessibly written for those with little to no mathematical or statistical background, it explains the mathematical fundamentals of simple significance tests by relating them to confidence intervals. With sample datasets and easy-to-read visuals, this book focuses on practical issues, such as how to:

• pose research questions in terms of choice and constraint;

• employ confidence intervals correctly (including in graph plots);

• select optimal significance tests (and what results mean);

• measure the size of the effect of one variable on another;

• estimate the similarity of distribution patterns; and

• evaluate whether the results of two experiments significantly differ.

Appropriate for anyone from the student just beginning their career to the seasoned researcher, this book is both a practical overview and valuable resource.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Publisher
Routledge
Year
2020
Print ISBN
9781138589384
eBook ISBN
9780429958663
PART 1
Motivations

1

What Might Corpora Tell Us About Language?

1.1Introduction

Corpus linguistics has become popular. Many linguists who would not otherwise consider themselves to be corpus linguists have started to apply corpus linguistics methods to their linguistic problems, in part due to the increasing availability of corpora and tools. In this chapter, we consider some kinds of research that can be done with corpora, and the types of corpora and methods that might yield useful results.1 Corpora are also found outside of linguistics, in social sciences and digital humanities.
In this book, we argue against a simplistic ‘bigger is best’ approach to data analysis and for the centrality of underlying models, theories of what might be happening linguistically ‘behind the scenes’, when we carry out research. More data is an advantage, but there is a trade-off between large corpora with limited annotation and small ones with rich annotation. Our perspective relates theory-rich linguistics with corpus linguistics, implying that we need corpora with rich annotation.
Yet as corpus linguistics has developed as a discipline, the dominant trend has been to build ever larger lexical corpora with very limited annotation: typically structural annotation (speaker turns, overlaps, sentence breaks in spoken data and formatting in writing), wordclass or ‘part-of-speech’ tagging (identifying nouns, verbs, and so on) and lemmas. Crucially, with large ‘mega’ corpora, annotation must be automatically produced without human intervention. The multi-billion-word iWeb corpus built by Mark Davies from 22 million web pages (at the time of writing) is at the frontier of this trend.
Not every linguist is in favour of a methodological ‘turn to corpora’. Some theoretical linguists, including Noam Chomsky, have argued that, at best, collections of language data merely provide researchers with examples of actual external linguistic performance of human beings in a given context (see, e.g., Aarts, 2001). We refer to this type of evidence as ‘factual evidence’ (see Section 1.2). From this perspective, corpora do not provide insight into internal language or how it is produced in the human mind. However, Chomsky’s position raises questions about what data, if any, could be used to evaluate ‘deep’ theories.2
Nevertheless, this contrary position represents a serious challenge to corpus researchers. Is corpus research doomed to investigate surface phenomena? At the end of this chapter, and as a motivation for what follows, we will return to the question of the potential relevance of corpus linguistics for the study of language production by reporting on a recent study.
Indeed, in recent years this ‘turn to corpora’ has begun to influence generative linguists. Take language change: a systematic evaluation of how language has changed over time must rely on data. An old antipathy is replaced by engagement. Large historical corpora such as the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME, Kroch, Santorini & Delf, 2004) are inspiring a new generation of linguistics researchers to approach corpora in new and more sophisticated ways. Similarly, it is our contention that corpora can benefit psycholinguistics, not as a substitute for laboratory experiments but as a complementary source of evidence.
What do we mean by ‘a corpus’? In the most general sense, corpora are simply collections of language data processed to make them accessible for research purposes. In contrast to experimental datasets, sampled to answer a specific research question, corpora are sampled in a manner that – as far as possible – permits many different types of research question to be posed. Datasets extracted from corpora are not obtained under controlled conditions but under ‘naturalistic’ or ‘ecological’ ones. We discuss some implications of this statement in Part 2.
Corpora also typically contain substantive passages of text, rather than, say, a series of random sentences produced by random speakers or writers.3
However, the majority of corpora available today have one major drawback for the study of language production. Most data are written. Texts are generated by authors at keyboards, on screens or paper. Writing is rarely spontaneously produced, may be edited by others, and is often included in databases due to availability. Like this book, texts are usually written for an imagined audience, in contrast to spoken utterances that are typically produced – scripted performances and monologues aside – on-the-spot for a present and interacting audience.
In the era of the internet, written data are easy to obtain, so large corpora of writing may be rapidly compiled. But if ‘language’ is sampled from writing (inevitable in historical corpora), we can only draw inferences about written language. Far better to be able to test hypotheses against spontaneously produced linguistic utterances that are unmediated, or, to be more precise, that are minimally affected by processes of articulation and transmission.
Not all corpora are drawn from written sources, and it is not a necessary characteristic of corpus linguistics that limits it to the study of written data. If we had no option but to use written sources, then this would still be better than relying on intuition.
But a better option is a corpus of spoken data, ideally in the form of recordings aligned with orthographic transcriptions. Transcriptions of this kind should record the output word-for-word, including false starts and self-correction, overlapping speech, speaker turns, and so on. The transcription should be a coded record of the audio stream. Faithfully transcribed speech data from an uncued and unrehearsed context is arguably the closest source to genuinely ‘spontaneous’ naturalistic language output as it is possible to find.
A transcription can be richer than a written text. It may be time-aligned with the original audio or video recording, contain prosodic and meta-linguistic information, gestural signals, and so on. The value of these additional layers of annotation will depend on the research aims of users. Researchers interested in language production and syntax are less concerned whether transcriptions are time-aligned than whether they are accurate. But if pause duration or words per minute is considered a proxy for mental processing, then timing data are essential.
Although we refer to ‘speech’ here, we are really referring to unmediated spontaneously produced language, the majority of which will be speech. For example, we might justifiably include sign language corpora under the category of ‘speech corpora’. It may be attractive to stretch this definition to include conversational text data (e.g., online ‘chat’), but usually, a user interface will allow the language producer to edit utterances as they type. If we wish to study unmediated language production, authentic data from spoken sources seems the best option.
Prioritising speech over writing in linguistics research has other justifications aside from mere spontaneity. The most obvious is historical primacy. Hunter-gatherer societies had an oral tradition long before writing was systematised. When writing developed, it was first limited to scribes, and gradually spread through social development and education. In 1820, around 12% of the world’s population could read and write. Even today that figure is around 83% (Roser & Ortiz-Ospina, 2018). So the first reason for studying speech is its near-universality. By contrast, historical corpus linguistics – which of necessity can only study written texts prior to the invention of the phonograph – is limited to the language of the literate population of the age, and their region, social class and gender distribution.
There are other important motivations. Child development sees children usually express themselves through the spoken word before they master putting words on a page, and many writers are aware that their writing requires a more-or-less internal speech act. Which comes first, speech or writing? The answer is speech.
Then there is the question of representativeness. A corpus of British English speech has approximately 2,000 words spoken by participants every quarter of an hour. The author Stephen King (2002) recommends aspiring writers write 1,000 words a day. Allowing for individual variation – and excepting isolated individuals or those physiologically unable to produce speech – it seems likely that human beings produce, and are exposed to, an order of magnitude more speech than writing.
Of course, not all speech data are the same. Speech data may be collected for a variety of purposes, some of which are more representative and ‘natural’ than others. One of the first treebanks containing spoken data, the Penn Treebank (Marcus, Marcinkiewicz & Santorini, 1993), included parliamentary language, telephone calls and air traffic control data. Other spoken data might be captured in the laboratory: collected in controlled conditions, but unnatural, potentially psychologically stressed and not particularly representative.
Scripting and rehearsal are a feature of many text...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Table of Contents
  6. Preface
  7. Part 1 Motivations
  8. Part 2 Designing Experiments with Corpora
  9. Part 3 Confidence Intervals and Significance Tests
  10. Part 4 Effect Sizes and Meta-Tests
  11. Part 5 Statistical Solutions for Corpus Samples
  12. Part 6 Concluding Remarks
  13. Appendices
  14. Glossary
  15. References
  16. Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Statistics in Corpus Linguistics Research by Sean Wallis in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Ciencias computacionales general. We have over 1.5 million books available in our catalogue for you to explore.