1 Stimulus for the Volume and Its Overarching Aim
This volume is the third in a series of books published by Palgrave Macmillan which focus on establishing guidelines for the creation and digitization of language corpora that are unconventional in some respect (see Beal et al. 2007a, b). Volume 3 is dedicated to the issue of public engagement and questions of how linguists can and should make their corpora accessible for a broader range of uses and to a wider audience. Although in this regard the road to building a corpus is often paved with good intentions, as Rickford (1993: 130) observes, these are frequently overtaken by âthe less escapable commitmentsâ of teaching and further research. While this may be understandable, it is ânot a picture, when we step back and view it, with which we can be proudâ, since it means that â[m]ost of us fall short of paying our debts to the communities whose data have helped to build and advance our careersâ (Rickford 1993: 130). The importance of taking public engagement initiatives more seriously has generated considerable recent scholarly debate (especially amongst researchers in the arts, humanities and social sciences) as the so-called âimpact agendaâ has taken hold particularly, though not exclusively, in UK higher education institutions (Martin 2011; Samuel and Derrick 2015; Lawson and Sayers 2016).1 A key objective of this volume is to examine the evidence for the view that despite the new requirements by funding bodies (and ultimately governments) that corpora should have a dual purpose as data that is deployable for engagement as well as research, twenty-first-century corpus linguists who do just that are not following conventional practices within their discipline. A second goal is to demonstrate how the issues that purportedly stand in the way of developing what one might term âimpactful corporaâ can be circumvented (as our contributors have done) with a little ingenuity and motivation. Another objective is to sketch what we consider to be best practices in creating corpora for public engagement by offering guidance on optimal methods by which such data (audio, text and still/moving images) can be created, digitized and subsequently exploited for public engagement projects.
To these ends, all the digital, community-oriented initiatives described in this volume are predicated on the premise that linguists should not engage solely in research that âproduces or intensifies an unequal relationship between investigator and informantsâ (Cameron et al. 1997: 145) but should instead be governed by the principles of âlinguistic gratuityâ (Wolfram 1993, 2012, 2013, 2016; Reaser and Adger 2007: 168; Wolfram et al. 2008); and âdebt incurredâ (Labov 1982). Some of the chapters in the volume derive in part from presentations at a peer-reviewed workshop entitled Dialect and Heritage Language Corpora for the Google Generation which was organized by the editors for the 2011 Methods in Dialectology XIV Conference at the University of Western Ontario. Other papers were delivered at the Corpora Galore: Applications of Digital Corpora in Higher Education Contexts workshop also organized by Corrigan and Mearns at Newcastle University in May 2011. These papers are supplemented by invited contributions from key scholars who have an international reputation as creators of divergent digital materials relevant to the preservation, analysis and public dissemination of dialect and heritage language corpora.
1.1 How to Tame Digital Texts, Voices and Images for the Wild
A reviewer for our proposal to Palgrave Macmillan outlining the case for editing a volume on corpus creation that would focus on engagement rightly contended that âdespite the requirement imposed by funding agencies that corpora should be constructed with public engagement in mind, this proves to be the exception rather than the ruleâ. There are three principal reasons why we consider this view to be justified: the diversity of aims between one corpus creation project and another; the very understandable desire amongst those who have given their blood, sweat and tears to collect the data and build a corpus from it to keep the resource for private use;2 and the extent to which a corpus can ever be effectively anonymized for public access.
Corpora are created and digitized by academics for a wide range of purposes but their primary function is to address the particular hypotheses which underpin their research projects, whatever these may be. Thus, the basic processes of designing a corpus for research into the acquisition and structure of L2 phonetic and phonological systems will be rather different from those one might avail oneself of when building a corpus that will investigate divergence in the discourse marking systems across varieties of French. Although there has been an appetite in several recent publications for normalizing approaches to corpus collection, data management and annotation strategies and for advocating more collaboration amongst researchers in this field (Kretzschmar et al. 2006; Beal et al. 2007a, b; Kendall 2011; Durand et al. 2014), it is difficult to unify an activity which has to satisfy very divergent and often conflicting research goals. Moreover, with very few exceptions (such as Amador et al. and Vine, this volume, as well as some of the projects described in Anderson and Hough) corpora are built first and foremost for mining by academics, and considering how they might be adapted for societal impact uses is generally an afterthought, that is ex post rather than ex ante in the terms of Samuel and Derrick (2015). With this fact in mind, some resourcefulness is needed on the part of scholars to repurpose their corpora for wider audiences. These can sometimes involve straightforward steps such as the conversion of the Diachronic Electronic Corpus of Tyneside English (DECTE) XML files into plain text so as to make them more accessible, as described in Mearns et al. (this volume). For other data sets, in order to convert oneâs corpus to a form that can engage a particular demographic of public end-users, the procedures are considerably more complex (see Norris 2001; Rowlands et al. 2008; Choudhrie et al. 2010). This would be the case, for example, when creating apps for smartphones and tablets that are built on the Survey of English Usage (SEU) corpus, which Mehl et al. (this volume) illustrate.
Wherever one stands politically on the impact agenda, there is no denying that it presents scholars with opportunities to devote research time to rethinking new end-uses like these for their data sets that may in the past have been considered nothing more than a vanity project on account of the media and wider public attention that doing so might have attracted. It is interesting to note in this regard that the new agenda can be considered a strong motivator for corpus design that can engage the public in this way. This is evidenced by the fact that the chapters in this volume by Anderson and Hough, Cheshire and Fox, Mearns et al. and Mehl et al. are all connected with Impact Case Studies submitted for the 2014 Research Excellence Framework exercise in the UK. In fact, only two of our contributions from UK authors are not linked with Impact Case Study submissions (Amador et al. and Walsh and Knight). In each case this is likely to be due simply to the recent nature of their corpus-building activities which have not yet had the kind of gestation period required for the gathering of evidence to support societal impact claims of the type expected.3 The UK, of course, is not alone in promulgating the view that oneâs research cannot simply take an âart for artâs sakeâ stance. Hence Wolfram (2016: 87) notes that: âEvery research proposal submitted to the National Science Foundation in the United States requires a narrative section titled âStatement of Broader Impactsâ. Under this heading, the principal investigator is obliged to address the projectâs âbenefits to societyâ and to âhuman welfareââ. The contribution to this volume by Barbiers examines how dialect atlas projects in the Dutch-speaking parts of the Netherlands and Belgium can be repurposed to deliver similar goals. As such, scholars globally are being encouraged to plan for and integrate outreach activities within their core research agenda, so a volume like this which demonstrates how such obligations can be met even with corpora that were not originally created for these purposes is not only timely but we hope will serve as a âgo-to guideâ for obtaining awards to fund future impactful corpus-building initiatives.
This orientation of course may conflict with other goals that corpus linguists have, particularly those who have invested considerable time (and money) in creating the kind of smaller-scale unconventional corpora which are the focus of Beal et al. (2007a, b) as well as contributions to this volume. There are many good reasons to keep such initiatives private including âscoopâ avoidance, as Childs et al. (2011) put it, which is probably why this is their âdefaultâ status (DâArcy 2011: 55). Another issue, of course, is that granting wider access to oneâs hard-won corpus might also lead to original analyses being refuted or otherwise invalidated. These issues aside, there does, however, seem to be a welcome change afoot amongst corpus creators towards widening accessâif not to the public then at least to other scholars (as the data sets deposited with the Sociolinguistic Archive and Analysis Project (SLAAP) described in Kendall and Wolfram here testify).
This new mindset is partially fuelled by funding bodies who, naturally, want to add as much value as possible to research they have supported and also because they have readily adopted new policies embracing Open Access (Kendall 2008, 2011; Childs et al. 2011).4 Their demands can be met by having principal investigators lodge their corpus data at project end in suitable national repositories (such as Qualidata and the Oxford Text Archive (OTA) in the UK or the Linguistic Data Consortium (LDC) in the US) and indeed research councils in the UK have an expectation that they will do just that. We strongly endorse this kind of good practice amongst corpus linguists not least because we have personal experience of how one of the data sets which eventually formed a cornerstone of the Newcastle Electronic Corpus of Tyneside English (NECTE) initiative, namely the Tyneside Linguistic Survey, was only salvageable on account of the fact that one of the original project team had lodged materials from the project with the OTA in the 1970s (see Allen et al. 2007).
What of the cases, though, where private corpora were never intended to have public usesâan issue which is particularly problematic when the corpus in question derives from legacy materials as NECTE does? As Kretzschmar et al. (2006: 191) advocate there must nevertheless be âa clear pathway toward publication of our materials that preserves the rights of both our research participants and of the corpus buildersâ. That paper demonstrates how this has been done with âmajor successâ for the Linguistic Atlas Project (LAP). It will thus be useful in this context to rehearse the rights management practices of the LAP pr...
