Corpus Linguistics and Translation Tools for Digital Humanities
eBook - ePub

Corpus Linguistics and Translation Tools for Digital Humanities

Research Methods and Applications

  1. 256 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Corpus Linguistics and Translation Tools for Digital Humanities

Research Methods and Applications

About this book

Presenting the digital humanities as both a domain of practice and as a set of methodological approaches to be applied to corpus linguistics and translation, chapters in this volume provide a novel and original framework to triangulate research for pursuing both scientific and educational goals within the digital humanities. They also highlight more broadly the importance of data triangulation in corpus linguistics and translation studies.
Putting forward practical applications for digging into data, this book is a detailed examination of how to integrate quantitative and qualitative approaches through case studies, sample analysis and practical examples.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Corpus Linguistics and Translation Tools for Digital Humanities by Stefania M. Maci, Michele Sala, Stefania M. Maci,Michele Sala in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistic Semantics. We have over one million books available in our catalogue for you to explore.
1
Corpus linguistics and translation tools for digital humanities: An introduction
Michele Sala and Stefania M. Maci
Introduction
Although much research has been carried out on Corpus Linguistics (CL) – methodologies and applications – and on Digital Humanities (DH), on the one hand, and, on the other, extensive scientific literature is available on translation and Translation Studies (TS), there is still a critical gap as to how the three research domains relate and interact with one another, and how the combination of their specific methods and approaches may be used, firstly, for better understanding texts – on the basis of the quantitative and structural distribution of micro- and macro-linguistic elements – and, then, for such understandings to be effectively managed and exploited for translation purposes.
Another possible gap resides in the fact that, although CL and DH are indeed contiguous domains – both dealing with the investigation of digitalized texts – there is still little consensus as to their relation, interconnections and even possible overlaps (cf. Hockey 2004, Svensson 2010, DHM 2011, Burdick et al. 2012, among others).
This, arguably, has to do with the fact that they pertain to two different macro-domains: CL is a branch of Linguistics, based on ‘data and methodology, analysis, and interpretation […] referring to the use of corpora and computers as tools for analysis and to the use of the results as a basis for interpretation in the study of any aspect of language’ (Kirk 1996: 251), whereas DH is a much broader domain, nested in the Humanities, studying or, more generally, handling digitalized versions of knowledge products ‘in the arts, human and social sciences’ (DHM2 2011: 2) not just for quantifications and frequency counting (which is typical of CL), but also, for instance, for text editing, text mapping, transposition and other applications made possible by the digital format.
Secondly, the difficulty in defining how CL and DH at the same time differ and relate owes to the fact that the term DH has often been taken to mean different things in the course of the last decades. Unlike CL, which is a much older language-based discipline and domain of practice in data retrieval and investigation through digital tools – established in the 1960s and steadily growing with the development of computer technologies (Kirk 1996, Owens 2011, McEnery/Hardie 2012) – what is meant by DH is not discrete and clear-cut: as a matter of fact, as the introductory chapter of this collection will detail, the label DH has been used to refer at the same time to a field of investigation (i.e. what DH is), a domain of practice (i.e. what DH does) and a set of methodologies (i.e. how DH processes texts). Indeed, most of the literature on the subject was – and still largely is – focused on the definition of DH as a research domain, and on outlining its most central concerns and debates (Gold 2012, Terras et al. 2013) – often with the purpose of neutralizing scepticism within the Humanities on the part of the academic establishment and more traditionally oriented scholars (who considered ‘toying’ with machines not proper and solid academic work,1 cf. Leech 1991, Fillmore 1992, Cohen 2010). Other scholars discussing DH, instead, focus on the practical usefulness of computational methods applied to text analysis for a variety of purposes, ranging from pedagogy, to text analysis or to editing (Cohen/Scheinfeldt 2013, Santos 2019). Finally, for another group of DH scholars the main aim of their research is to describe the range of applications and computational resources that can be applied to digitalizated texts (Carter 2013). Although these different approaches are coherent in themselves and internally consistent, the drastically different critical angle through which they assess the domain (i.e. analytic, operative and technical) impedes a cohesive framing of DH as a whole.
Another element of possible confusion as to how DH and CL relate comes from the quite restricted view that sees them as simply two parallel methodologies for carrying out research on different types of digitalized material. According to this view, for instance, CL appears to be a much more quantity-oriented set of methods, focussed on identifying or attesting patterns of language use in extended but structured collections of texts (typically, non-literary ones), whereas DH would appear to be an essentially quality-oriented method meant to locate and identify specific linguistic, discursive or stylistic elements within the same text (typically, a literary one2). This necessarily brings about a set of methodological considerations. In DH, for instance, text digitalization is perceived as a tool for text- and data-mining performed on unstructured data and ‘raw’ material (i.e. text simply usable in its electronic format, rather than material collected on the basis of some discursive or pragmatic similarity or family membership, cf. Ebensgaard Jensen 2014), thus being best suited for texts in the humanities (notably, literature, the arts, etc.). In contrast, CL works on structured collections of texts, where corpora are designed on the basis of specific criteria (ranging from text-type, to genre, domain, content, contexts, time and place of their production, channel, users, length or other extra-linguistic and contextual parameters etc.) for them to be representative of some specific language use and, at the same time, for corpora to be usable. As a matter of fact, in CL approaches, such structuring is organized on the basis of – or would allow for – annotation (i.e. part-of-speech annotation, syntactic, semantic, pragmatic, etc., cf. McEnery/Wilson 2001, Ide 2004, Baker et al. 2006), in order to make language material easily workable, that is, not only for occurrences to be systematically located and quantified, but for frequencies to be easily connected to specific patterns and for such patterns to be associated to recognizable pragmatic functions.
In sum, as will be amply discussed in the introductory chapter, DH is neither just the digital processing of literary texts, nor can it be only reduced to the exploitation of computer resources in human science investigations, or in the arts; similarly, CL is not purely concerned with quantity-based research at the expenses of quality-related analysis, and is not entirely carried out by machines, but does require human intervention for fine-tuning searches and for a fine-grained result interpretation.
For the purpose of this volume, DH and CL are neither parallel, quasi-synonymical or overlapping terms, but have distinctively different referents and are related according to a meronymic type of relationship (part-whole).
DH in fact is the extended domain, intended as the ‘digital archiving of a body of human-made artefacts (i.e. the text and the usage-event therein) that are processed and interpreted via a plethora of digital methods’ (Ebensgaard Jenses 2014: 124). In other words, DH is the overarching term for the macro-area of research which analyses texts produced in the humanities and social sciences – either taken singularly or collected in databases – by processing their electronic versions through an array of digital tools and for a variety of purposes, ranging from establishing frequencies, locating specific occurrences, finding internal or cross-references, quantifying distribution, evidencing similarities and differences otherwise difficult to find by manual text processing.
CL instead refers precisely to the ‘plethora of methods’ mentioned above, namely the set of principled approaches and tools based on or including criteria concerning corpus design (i.e. selection and organization of texts), corpus collection (i.e. on the basis of text affinity, representativeness, (proto)typicality, family membership, etc., or simply in terms of corpus size and exhaustiveness, especially for extended archives and databases) and corpus annotation (on the basis of linguistic, textual or pragmatic functions identified in connection to specific occurrences, etc.). Such collections then – structured and purpose-built ones or extended and unstructured archives – would allow for data searchers via both corpus-based approaches (i.e. from hypothesis to testing, meant ‘to expound, test or exemplify theories and descriptions’, Tognini-Bonelli 2001: 65) and corpus-driven ones (i.e. from quantification to hypotheses, cf. Tognini-Bonelli 2001), both to be carried out through annotation-oriented searches or via data-mining procedures – and thanks to a variety of dedicated software, concordancing programs, etc. (such as Wordsmith Tools, Sketch Engine, WMatrix, Antconc, etc.). In other words, where DH is the territory, CL is the map to chart it and make it manageable; where DH is the field, CL is the trajectory along which to navigate it.
Digital humanities and corpora
The importance of corpora for linguistic analysis in all its related domains and subdomains (i.e. from pragmatics to applied linguistics, from typology- and genre-based studies to cognitive analysis, etc.) can hardly be overstated.
After a long research tradition where the scholar’s intuition was the sole, or the most relevant, parameter for linguistic investigation and language description – especially before the development of computer science and technologies, but also in the decades between the 1960s and the late 1980s, where Generative Linguistics was the dominant approach to language analysis (with its focus placed on the notion of native speaker’s intuition as the basis for understanding how natural language occurs) – corpora started to be considered as invaluable resources owing to the quality, type and range of evidence they could provide, which could prove to be more solid, reliable and even quasi-empirical with respect to hypotheses and abstractions (cf. Sinclair 1991). As a matter of fact, since ‘language is a human construct and its interpretation involves a degree of subjectivity, whatever the methodology employed’ (Gotti/Giannoni 2014: 10), resorting to corpora and searching large amount of naturally occurring texts in order to test theories may not only keep any confirmation bias at bay, but may indeed produce unexpected results which falsify expectations, thus evidencing how ‘our intuition about the patterns of use is often inaccurate’ (Biber 2009: 190).
The awareness of the usefulness of textual evidence for substantiating possible abstractions is not restricted to modern research. In fact, as has been noted:
The study of language use through documentary evidence gleaned from variously large collections of authentic texts pre-dates by centuries the modern science of corpus linguistics. Since Samuel Johnson’s landmark Dictionary of the English Language (1755), lexicographers and the reading public have become aware that in language matters intuition is not enough, for the actual meaning/usage of words varies over time, from place to place and contextually. Driven by a similar interest, medieval scholars pioneered the first bible Concordances and similar concordances were compiled after the advent of print from the works of literary classics such as Chaucer, Shakespeare and Milton, to name but a few.
(Gotti/Giannoni 2014: 9)
This concern acquired relevance and started to become central in linguistic investigations from the 1950s onwards, when computational techniques started developing and became more and more reliable and (relatively) easy to use (Sinclair et al. 1970), and especially when their applicability began being implemented for text analysis, in the late 1980s, making it possible to, firstly, gather and manage extended amount of text and, secondly, to use a variety of computing software for text processing purposes.
The availability of material that could be accessed both offline and online has made it possible to compile corpora of variable size – from relatively small ones (yet too large to be handled manually) to extended ones, to even larger Big Data archives – depending on the rationale behind their compiling. As we have briefly anticipated above, the factors that contribute to conferring structure to corpora and their designing may include the language of the sample texts included (i.e. national dialects, language used by native vs non-natives, etc.), the medium (i.e. written vs spoken), the format and type of the items collected (i.e. genre, register, text type, etc.), the domain (i.e. specialized vs non specialized, hard knowledge sciences vs humanities, etc.) and the type of users (i.e. their gender, profession, role within the community, level of expertise, status and recognizability for the audience, etc.).
Moreover, corpus collection is always necessitated by very a specific purpose (Pearson 1998, Teubert/Cermakova 2004, that is, ‘in view of some kind of benefit’ cf. Gotti/Giannoni 2014: 10), and this not only influences corpora’s inherent organization, but will also determine the way they are going to be used.
On the basis of this, different types of corpus and approach to their use can be distinguished. The first is between reference and disposable corpora. The former are extensive – even gigantic – collections of texts meant to provide a testimony to and a representative sample of language use in either a given time/context or across times/contexts (British National Corpus (BNC), Corpus of Contemporary American English (COCA), Cobuild Bank of English (CBE), etc.). Disposable ones instead contain material collected by analysts in view of the type of research they are carrying out, therefore on the basis of specific requisites texts need to respect for them to be included (Pearson 1998).
Strictly connected to this is the type of investigation to be carried out with such materials, which, as we have seen above, can be intended to test hypotheses, substantiate theories, and validate assumptions (corpus-based approaches, cf. Tognini-Bonelli 2001) or instead to observe and quantify patterns of naturally occurring language in order to possibly find trends, commonalities or divergence in language use (corpus-driven approaches) – the former approach being typically the case with disposable purpose-built corpora, even though also reference corpora (or specially selected section of them) can be mined for purpose-oriented searches.
Finally, corpora can be investigated for either speculative or operative reasons, that is to say to either see how language is used with the aim to further its understanding or else to find regularities that can be singled out and taught in practical and applied contexts (i.e. for the pedagogy of general language or of specialized discourses, for highlighting terminological preferences, stylistic trends, rhetorical choices, etc., Hyland et al. 2012).
Chapters in the first part of this volume will be devoted to showing how compiling and scanning purpose-based disposable corpora, or resorting to existing ones as reference, may help trace discoursal traits and prosodic or semantic preferences which indicate how meanings are usually codified within specific contexts, how expectations are anticipated and managed and, eventually, how given interpretation may be favoured over others.
Corpora and translation
The potential offered by corpora in terms of locating items, collocations, recurring patterns, typical uses of the language in naturally occurring texts (i.e. not crafted ad hoc by analysts) was eventually appreciated also by translation scholars, and even welcomed as a ‘new paradigm in translation studies’ (Laviosa 1998: 1).
One of the earliest applications of corpus-based methods to TS was first hypothesized, in the mid-1990s (Baker 1993, 1995), primarily as a way of pointing to ‘the nature of translated text as a mediated communicative event’ (Baker 1993: 243), that is, as a way of detecting – with the purpose of becoming aware of their frequencies and, then, of possibly neu...

Table of contents

  1. Cover
  2. Halftitle Page
  3. Title Page
  4. Contents
  5. List of Figures
  6. List of Tables
  7. List of Contributors
  8. Foreword
  9. 1 Corpus linguistics and translation tools for digital humanities: An introduction
  10. Part 1 Corpus linguistics for digital humanities: Research methods and applications
  11. Part 2 Translation for digital humanities: Research methods and applications
  12. Index
  13. Imprint