This book unites a range of approaches to the collection and digitization of diverse language corpora. Its specific focus is on best practices identified in the exploitation of these resources in landmark impact initiatives across different parts of the globe. The development of increasingly accessible digital corpora has coincided with improvements in the standards governing the collection, encoding and archiving of 'Big Data'. Less attention has been paid to the importance of developing standards for enriching and preserving other types of corpus data, such as that which captures the nuances of regional dialects, for example. This book takes these best practices another step forward by addressing innovative methods for enhancing and exploiting specialized corpora so that they become accessible to wider audiences beyond the academy.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.

Yes, you can access Creating and Digitizing Language Corpora by Karen P. Corrigan, Adam Mearns, Karen P. Corrigan,Adam Mearns in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Sociolinguistics. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Languages & Linguistics

Subtopic

Sociolinguistics

Index

Languages & Linguistics

Karen P. Corrigan and Adam Mearns (eds.)Creating and Digitizing Language Corpora10.1057/978-1-137-38645-8_1

Begin Abstract

1. Taming Digital Texts, Voices and Images for the Wild: Models and Methods for Handling Unconventional Corpora to Engage the Public

Karen P. Corrigan¹ and Adam Mearns¹

(1)

Newcastle University, Newcastle upon Tyne, UK

The term ‘unconventional’ here relates to the distinction first articulated in Beal et al. (2007a, b) between large-scale standardized or conventional corpora like the International Corpus of English or COBUILD and smaller more specialized databases. These are often not devised at the outset as corpora strictly speaking since they initially arise from sociolinguistically oriented projects, but such resources can indeed be used as such providing they are ‘tamed’ in particular ways (Beal et al. 2007a: 1). See also D’Arcy (2011: 54–6) and Kendall (2011: 362–3).

End Abstract

1 Stimulus for the Volume and Its Overarching Aim

This volume is the third in a series of books published by Palgrave Macmillan which focus on establishing guidelines for the creation and digitization of language corpora that are unconventional in some respect (see Beal et al. 2007a, b). Volume 3 is dedicated to the issue of public engagement and questions of how linguists can and should make their corpora accessible for a broader range of uses and to a wider audience. Although in this regard the road to building a corpus is often paved with good intentions, as Rickford (1993: 130) observes, these are frequently overtaken by ‘the less escapable commitments’ of teaching and further research. While this may be understandable, it is ‘not a picture, when we step back and view it, with which we can be proud’, since it means that ‘[m]ost of us fall short of paying our debts to the communities whose data have helped to build and advance our careers’ (Rickford 1993: 130). The importance of taking public engagement initiatives more seriously has generated considerable recent scholarly debate (especially amongst researchers in the arts, humanities and social sciences) as the so-called ‘impact agenda’ has taken hold particularly, though not exclusively, in UK higher education institutions (Martin 2011; Samuel and Derrick 2015; Lawson and Sayers 2016).¹ A key objective of this volume is to examine the evidence for the view that despite the new requirements by funding bodies (and ultimately governments) that corpora should have a dual purpose as data that is deployable for engagement as well as research, twenty-first-century corpus linguists who do just that are not following conventional practices within their discipline. A second goal is to demonstrate how the issues that purportedly stand in the way of developing what one might term ‘impactful corpora’ can be circumvented (as our contributors have done) with a little ingenuity and motivation. Another objective is to sketch what we consider to be best practices in creating corpora for public engagement by offering guidance on optimal methods by which such data (audio, text and still/moving images) can be created, digitized and subsequently exploited for public engagement projects.

To these ends, all the digital, community-oriented initiatives described in this volume are predicated on the premise that linguists should not engage solely in research that ‘produces or intensifies an unequal relationship between investigator and informants’ (Cameron et al. 1997: 145) but should instead be governed by the principles of ‘linguistic gratuity’ (Wolfram 1993, 2012, 2013, 2016; Reaser and Adger 2007: 168; Wolfram et al. 2008); and ‘debt incurred’ (Labov 1982). Some of the chapters in the volume derive in part from presentations at a peer-reviewed workshop entitled Dialect and Heritage Language Corpora for the Google Generation which was organized by the editors for the 2011 Methods in Dialectology XIV Conference at the University of Western Ontario. Other papers were delivered at the Corpora Galore: Applications of Digital Corpora in Higher Education Contexts workshop also organized by Corrigan and Mearns at Newcastle University in May 2011. These papers are supplemented by invited contributions from key scholars who have an international reputation as creators of divergent digital materials relevant to the preservation, analysis and public dissemination of dialect and heritage language corpora.

1.1 How to Tame Digital Texts, Voices and Images for the Wild

A reviewer for our proposal to Palgrave Macmillan outlining the case for editing a volume on corpus creation that would focus on engagement rightly contended that ‘despite the requirement imposed by funding agencies that corpora should be constructed with public engagement in mind, this proves to be the exception rather than the rule’. There are three principal reasons why we consider this view to be justified: the diversity of aims between one corpus creation project and another; the very understandable desire amongst those who have given their blood, sweat and tears to collect the data and build a corpus from it to keep the resource for private use;² and the extent to which a corpus can ever be effectively anonymized for public access.

Corpora are created and digitized by academics for a wide range of purposes but their primary function is to address the particular hypotheses which underpin their research projects, whatever these may be. Thus, the basic processes of designing a corpus for research into the acquisition and structure of L2 phonetic and phonological systems will be rather different from those one might avail oneself of when building a corpus that will investigate divergence in the discourse marking systems across varieties of French. Although there has been an appetite in several recent publications for normalizing approaches to corpus collection, data management and annotation strategies and for advocating more collaboration amongst researchers in this field (Kretzschmar et al. 2006; Beal et al. 2007a, b; Kendall 2011; Durand et al. 2014), it is difficult to unify an activity which has to satisfy very divergent and often conflicting research goals. Moreover, with very few exceptions (such as Amador et al. and Vine, this volume, as well as some of the projects described in Anderson and Hough) corpora are built first and foremost for mining by academics, and considering how they might be adapted for societal impact uses is generally an afterthought, that is ex post rather than ex ante in the terms of Samuel and Derrick (2015). With this fact in mind, some resourcefulness is needed on the part of scholars to repurpose their corpora for wider audiences. These can sometimes involve straightforward steps such as the conversion of the Diachronic Electronic Corpus of Tyneside English (DECTE) XML files into plain text so as to make them more accessible, as described in Mearns et al. (this volume). For other data sets, in order to convert one’s corpus to a form that can engage a particular demographic of public end-users, the procedures are considerably more complex (see Norris 2001; Rowlands et al. 2008; Choudhrie et al. 2010). This would be the case, for example, when creating apps for smartphones and tablets that are built on the Survey of English Usage (SEU) corpus, which Mehl et al. (this volume) illustrate.

Wherever one stands politically on the impact agenda, there is no denying that it presents scholars with opportunities to devote research time to rethinking new end-uses like these for their data sets that may in the past have been considered nothing more than a vanity project on account of the media and wider public attention that doing so might have attracted. It is interesting to note in this regard that the new agenda can be considered a strong motivator for corpus design that can engage the public in this way. This is evidenced by the fact that the chapters in this volume by Anderson and Hough, Cheshire and Fox, Mearns et al. and Mehl et al. are all connected with Impact Case Studies submitted for the 2014 Research Excellence Framework exercise in the UK. In fact, only two of our contributions from UK authors are not linked with Impact Case Study submissions (Amador et al. and Walsh and Knight). In each case this is likely to be due simply to the recent nature of their corpus-building activities which have not yet had the kind of gestation period required for the gathering of evidence to support societal impact claims of the type expected.³ The UK, of course, is not alone in promulgating the view that one’s research cannot simply take an ‘art for art’s sake’ stance. Hence Wolfram (2016: 87) notes that: ‘Every research proposal submitted to the National Science Foundation in the United States requires a narrative section titled “Statement of Broader Impacts”. Under this heading, the principal investigator is obliged to address the project’s “benefits to society” and to “human welfare”’. The contribution to this volume by Barbiers examines how dialect atlas projects in the Dutch-speaking parts of the Netherlands and Belgium can be repurposed to deliver similar goals. As such, scholars globally are being encouraged to plan for and integrate outreach activities within their core research agenda, so a volume like this which demonstrates how such obligations can be met even with corpora that were not originally created for these purposes is not only timely but we hope will serve as a ‘go-to guide’ for obtaining awards to fund future impactful corpus-building initiatives.

This orientation of course may conflict with other goals that corpus linguists have, particularly those who have invested considerable time (and money) in creating the kind of smaller-scale unconventional corpora which are the focus of Beal et al. (2007a, b) as well as contributions to this volume. There are many good reasons to keep such initiatives private including ‘scoop’ avoidance, as Childs et al. (2011) put it, which is probably why this is their ‘default’ status (D’Arcy 2011: 55). Another issue, of course, is that granting wider access to one’s hard-won corpus might also lead to original analyses being refuted or otherwise invalidated. These issues aside, there does, however, seem to be a welcome change afoot amongst corpus creators towards widening access—if not to the public then at least to other scholars (as the data sets deposited with the Sociolinguistic Archive and Analysis Project (SLAAP) described in Kendall and Wolfram here testify).

This new mindset is partially fuelled by funding bodies who, naturally, want to add as much value as possible to research they have supported and also because they have readily adopted new policies embracing Open Access (Kendall 2008, 2011; Childs et al. 2011).⁴ Their demands can be met by having principal investigators lodge their corpus data at project end in suitable national repositories (such as Qualidata and the Oxford Text Archive (OTA) in the UK or the Linguistic Data Consortium (LDC) in the US) and indeed research councils in the UK have an expectation that they will do just that. We strongly endorse this kind of good practice amongst corpus linguists not least because we have personal experience of how one of the data sets which eventually formed a cornerstone of the Newcastle Electronic Corpus of Tyneside English (NECTE) initiative, namely the Tyneside Linguistic Survey, was only salvageable on account of the fact that one of the original project team had lodged materials from the project with the OTA in the 1970s (see Allen et al. 2007).

What of the cases, though, where private corpora were never intended to have public uses—an issue which is particularly problematic when the corpus in question derives from legacy materials as NECTE does? As Kretzschmar et al. (2006: 191) advocate there must nevertheless be ‘a clear pathway toward publication of our materials that preserves the rights of both our research participants and of the corpus builders’. That paper demonstrates how this has been done with ‘major success’ for the Linguistic Atlas Project (LAP). It will thus be useful in this context to rehearse the rights management practices of the LAP pr...

Cover
Frontmatter
1. Taming Digital Texts, Voices and Images for the Wild: Models and Methods for Handling Unconventional Corpora to Engage the Public
1. Corpora for Education and Heritage
2. Corpora for Continuing Professional Development
Backmatter

About this book

Frequently asked questions

Information

1. Taming Digital Texts, Voices and Images for the Wild: Models and Methods for Handling Unconventional Corpora to Engage the Public

1 Stimulus for the Volume and Its Overarching Aim

1.1 How to Tame Digital Texts, Voices and Images for the Wild

Table of contents