eBook - ePub

Corpus Linguistics for Translation and Contrastive Studies

Name: Corpus Linguistics for Translation and Contrastive Studies
Author: Mikhail Mikhailov, Robert Cooper

A guide for research

Mikhail Mikhailov, Robert Cooper

Compartir libro

234 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Corpus Linguistics for Translation and Contrastive Studies

A guide for research

Mikhail Mikhailov, Robert Cooper

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Corpus Linguistics for Translation and Contrastive Studies provides a clear and practical introduction to using corpora in these fields. Giving special attention to parallel corpora, which are collections of texts in two or more languages, and demonstrating the potential benefits for multilingual corpus linguistics research to both translators and researchers, this book:

explores the different types of parallel corpora available, and shows how to use basic and advanced search procedures to analyse them;
explains how to compile a parallel corpus, and discusses their uses for translation purposes and to research linguistic phenomena across languages;
demonstrates the use of corpus extracts across a wide range of texts, including dictionaries, novels by authors including Jane Austen and Mikhail Bulgakov, and newspapers such as The Sunday Times;
is illustrated with case studies from a range of languages including Finnish, Russian, English and French.

Written by two experienced researchers and practitioners, Corpus Linguistics for Translation and Contrastive Studies is essential reading for postgraduate students and researchers working within the area of translation and contrastive studies.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Corpus Linguistics for Translation and Contrastive Studies un PDF/ePUB en línea?

Sí, puedes acceder a Corpus Linguistics for Translation and Contrastive Studies de Mikhail Mikhailov, Robert Cooper en formato PDF o ePUB, así como a otros libros populares de Filología y Lingüística. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Routledge

Año

2016

ISBN

9781317229384

Edición

Categoría

Filología

Categoría

Lingüística

Chapter 1
Parallel text corpora

A general overview

Nowadays, most linguistic research is based on electronic data. Whether in the field of theoretical linguistic research or in the compilation of grammars and dictionaries, corpora have become a standard tool for studying the structure of different languages, their morphology, syntax and lexis. Indeed, electronic text corpora of all kinds – collections of whole texts, text samples, transcripts of recorded speech, etc – are becoming so common that research that does not use corpus data arouses suspicion. For many languages so-called ‘national corpora’ are being compiled. The trend was started with the British National Corpus, which in turn was followed by the National Corpus of Polish, the Czech National Corpus, the (Open) American National Corpus, the Russian National Corpus, etc.¹ Megacorpora, and collections of megacorpora such as COCA,² Sketch Engine³ and Aranea⁴ include billions of running words collected by web crawlers from the internet. Indeed, for those who do not have access to suitable text corpora, or do not want to compile a corpus of their own, the internet itself can be used as a corpus. Thus although the problem of corpus availability is still far from being resolved, monolingual corpus linguistics is progressing rapidly.

Research using multilingual corpora is less encouraging. Multilingual language resources are much more limited and more modest in size. This, in many ways, is rather surprising, because parallel corpora have so many potential uses and applications. The most obvious of these are in the field of translation. Parallel corpora are an invaluable aid to translators in their day-to-day work, and such corpora can obviously be used, therefore, in the training of translators. They are also important for studying the translation process itself: the strategies used by translators, the problem of ‘free’ vs. ‘literal’ translation, the question of style, etc. But parallel corpora are also crucial in more technical applications, especially in the field of machine translation – the development and testing of automatic translation programs. Another major area where parallel corpora are needed is the more theoretical discipline known as contrastive linguistics. This explores the morphological, syntactical and lexical similarities/differences between languages, with a view to compiling contrastive grammars and dictionaries. It is also concerned with the study of language universals, those features which different languages have in common. By extension, the results of contrastive research using parallel corpora will have a bearing on the methods and course materials used in language teaching. Indeed, parallel corpora can even be used in the classroom, both by teachers and the language learners themselves.

Why, then, has the development of parallel corpora lagged behind that of monolingual corpora? The reason, quite simply, is that it is far easier to obtain a large number of texts in one language than to find texts with corresponding versions in several different languages. There is also the problem of text alignment, i.e. linking corresponding sentences in the different languages (see section 2.3 below). Compiling parallel corpora, therefore, is a time-consuming undertaking and this explains why their development has not kept pace with that of monolingual corpora (see also Salkie 2008).

As was mentioned above, multilingual data is needed when writing in a foreign language or when translating. It may be necessary to check terminology, find suitable idiomatic phrasing, locate the standard (or different existing) translations of a well-known quotation, or find out what a quotation was in the original. However, most existing parallel and comparable corpora cannot be used for these purposes because of insufficient size, or because they are compiled from samples, not from whole texts. In theory, many of these tasks can be carried out with conventional internet searches (by using Google or other commercial search engines) or by consulting multilingual resources like Wikipedia, but multilingual internet searches of this kind clearly require much more ingenuity on the part of the user than when searching in one language only.

Similarly, when used in academic research, in the study of the structures of two or more languages, or in the compilation of bilingual dictionaries, parallel corpora need to be large enough to provide the researcher with enough data to draw reliable conclusions. But they must also include a wide variety of text types, to ensure that the languages being studied are covered adequately. Finding such texts in two or more languages is far more difficult than when working with a single language.

Considerations such as these all explain why parallel corpora are far less common than monolingual corpora, and also why the benefits of parallel corpora have not been fully recognized. It is our aim in the present book to help remedy this by presenting the reader with a comprehensive overview of multilingual corpora and thereby reveal their great potential.

1.1 Different types of text corpora

Corpora can be classified according to many different parameters. Some of these are relevant to any corpus, whether multilingual or monolingual, while some only apply to certain types of corpus. In this section we present some of the most important features of text corpora, but especially those that are relevant for multilingual corpora.

1.1.1 Important features of text corpora

Text corpora can consist of extracts or of whole texts. The very first text corpora, the best-known being the Brown University Standard Corpus of Present-Day American English, were of limited size. The Brown Corpus consisted of only 1 million words, and was made up of text extracts or samples, the length of each sample being about 2,000 words (Francis 1992). This was the only reasonable solution in the case of a small-size corpus (a million words is not a lot today, of course!). Nowadays, many corpora consist of whole texts. Whole-text corpora are faster to compile and they can be used for research both in linguistics and in literary and cultural studies. Their weakness is the possible problem of representativeness and statistical reliability; if a whole-text corpus is relatively small, it will not give a good cross-section of the language generally. A possible workaround solution is to compile a samples corpus but with longer text extracts, as in the case of the English-Norwegian Parallel Corpus (ENPC), which has a sample size of 10,000–15,000 running words (Johansson 2002).

However, a small corpus can easily be somewhat artificial, because the texts or extracts that are included will depend on the choices of the compilers. When compiling a small corpus of a million or so running words, therefore, it is important to use texts of approximately the same size, whether whole texts or samples, and to ensure that they come from a variety of sources; otherwise the corpus will easily become biased in one direction or another. With a corpus of several hundred million running words, on the other hand, the irregularities that might be caused by size and choice of texts become insignificant: unusual words and structures will only occur rarely, specialist terms will have low frequency, and the stylistic peculiarities of a particular writer will not be misinterpreted as being typical.

To make searches more effective, corpus texts are often marked up, or annotated, i.e. abstract features of words and sentences are marked with special tags. The most common kind of markup is lemmatization, i.e. annotation that indicates the base form of each word (TAKE for the forms take, takes, took, taken). Lemmatization is usually combined with part of speech tagging (NOUN, ADJECTIVE, VERB, etc), and for highly inflected languages it is also desirable to include morphological information as well (ACCUSATIVE, GENITIVE; CONDITIONAL, PERFECTIVE, etc). Corpora with syntactic markup (SUBJECT, OBJECT, ADVERBIAL), which are sometimes called ‘treebanks’, are less common, and semantic markup (ARTEFACT, COLOUR, PLACE-NAME, etc) has so far only been introduced in a few corpora on an experimental basis.

Many corpora, especially in the early phases of their development, consist of collections of unannotated texts. However, corpora without any annotation may sometimes be limited in their usefulness. The absence of annotation does not produce serious problems when searching for basic examples of language usage, although even there, searches are limited to simple string matching. If a corpus is lemmatized, on the other hand, it becomes easier to produce frequency lists, and with a morphologically annotated corpus, it is possible to compile statistics on the use and occurrence of different grammatical forms.

Nowadays, most types of annotation are performed automatically, but the results require manual checking, even when sophisticated context-sensitive software is used. With very large corpora, however, manual checking is impossible, and so researchers have to be content with automated annotation, even if there is the possibility of errors. Still, this is better than no annotation at all.

Sometimes, however, there is a need for large collections of unannotated raw data, e.g. for testing software for machine translation (MT). Researchers in the field of information technology and computer science work with huge raw text archives. These researchers hold regular conferences on text processing, e.g. CLEF in Europe, TREC in the USA, ROMIP in Russia, etc.⁵

1.1.2 Text archives and text corpora

Sometimes texts are collected for regular use as a source of information. News agencies, newspapers and magazines assemble huge archives of their published material, which can be later accessed online by the general public. Similarly, government departments, banks, universities and other institutions have archives of publicly available documents, reports, regulations and the like. These are typically produced in one language only, but legislative and judicial documents are sometimes available in several languages (e.g. documents of the United Nations on the UN website, EU legislation at Eur-Lex, etc). There are even newspapers which are published online in two or more languages, not to mention the day-today reports of international news agencies like Reuters. Text archives of this kind are a valuable source of multilingual language data, but they are of limited use in linguistic research. This is because the corresponding texts are all stored separately. To access any given text in two or more different language versions it would be necessary to search first one version, then the other, and then align the corresponding segments (paragraphs, sentences). This would clearly be extremely tedious.

Text archives, whether monolingual or multilingual, are designed to help retrieve information. They are not designed for studying languages or for doing language research. Text corpora, on the other hand, are created to enable linguists to study particular linguistic phenomena. They have search engines that are designed specifically to find such phenomena. Text corpora are typically monolingual, but with a multilingual parallel corpus, researchers have ready access to linguistic data in two or more languages. This is because the texts in the corpus are aligned, i.e. the corresponding segments (paragraphs or sentences) of the texts in different languages are linked together and output simultaneously. Such corpora are of little use to a person who requires information, but are invaluable when investigating linguistic phenomena, and in particular, the similarities and differences between different languages.

1.1.3 Monolingual vs. bilingual vs. multilingual corpora

As has already been mentioned, most corpora are monolingual. These also include comparable corpora of different varieties of the same language, e.g. the International Corpus of English (ICE).⁶ As regards parallel text corpora, the commonest type includes only two languages, but there do exist parallel corpora with several languages. However, because it is often difficult to find corresponding texts for a corpus consisting of many different languages, compiling such a corpus can be time-consuming and costly. Inevitably, therefore, multilingual corpora will always be smaller and less comprehensive than bilingual corpora. Nonetheless, in some kinds of research (e.g. studies in language typology) multilingual text collections, however small, can be very useful.

Multilingual data can consist of original texts (i.e. texts originally written in a given language), and/or translations from other languages. The possible combinations are as follows:

(a) original texts in language A vs. (different) authentic texts in language B
(b) original texts in language A vs. their translations in language B
(c) original texts in langua...