Machine Translation and Transliteration involving Related, Low-resource Languages
eBook - ePub

Machine Translation and Transliteration involving Related, Low-resource Languages

  1. 208 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Machine Translation and Transliteration involving Related, Low-resource Languages

About this book

Machine Translation and Transliteration involving Related, Low-resource Languages discusses an important aspect of natural language processing that has received lesser attention: translation and transliteration involving related languages in a low-resource setting. This is a very relevant real-world scenario for people living in neighbouring states/provinces/countries who speak similar languages and need to communicate with each other, but training data to build supporting MT systems is limited. The book discusses different characteristics of related languages with rich examples and draws connections between two problems: translation for related languages and transliteration. It shows how linguistic similarities can be utilized to learn MT systems for related languages with limited data. It comprehensively discusses the use of subword-level models and multilinguality to utilize these linguistic similarities. The second part of the book explores methods for machine transliteration involving related languages based on multilingual and unsupervised approaches. Through extensive experiments over a wide variety of languages, the efficacy of these methods is established.

Features

  • Novel methods for machine translation and transliteration between related languages, supported with experiments on a wide variety of languages.
  • An overview of past literature on machine translation for related languages.
  • A case study about machine translation for related languages between 10 major languages from India, which is one of the most linguistically diverse country in the world.

The book presents important concepts and methods for machine translation involving related languages. In general, it serves as a good reference to NLP for related languages. It is intended for students, researchers and professionals interested in Machine Translation, Translation Studies, Multilingual Computing Machine and Natural Language Processing. It can be used as reference reading for courses in NLP and machine translation.

Anoop Kunchukuttan is a Senior Applied Researcher at Microsoft India. His research spans various areas on multilingual and low-resource NLP. Pushpak Bhattacharyya is a Professor at the Department of Computer Science, IIT Bombay. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). Prof. Bhattacharyya has published more than 350 research papers in various areas of NLP.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Chapter 1

Introduction

Language is one of the most remarkable and uniquely human abilities. Natural language is a complex and versatile tool for communication which differentiates homo sapiens from other species. The manipulation of symbols allows representation of objects, emotions, motivations, abstract thought, etc. Other species communicate in less complex ways. It is widely believed by linguists that full language capacity had evolved by 100,000 B.C.E.1. Language is undoubtedly one of the major factors in the rapid progress of humans since it allows faster dissemination of information/knowledge than what evolution and genetic mutation would permit. It enables diverse modes of social organization allowing humans to co-operate for achieving complex goals. As human societies grew and became ever more complex, humans invented writing around the 4th century B.C.E. to maintain and organize information. Writing enabled representation and recording of language, making long-distance and long-term dissemination of information and thought possible. It also enabled analysis and manipulation of large amounts of knowledge and information.
Ironically, language can also act as a barrier to communication. Language is not static, it is a system of shared conventions that changes over time. Over time, modern humans first spread all over Africa by around 150,000 B.C.E and then stepped out of Africa around 70,000 years ago. As humans spread out over the entire planet, different communities got segregated and language evolved independently in each of these communities. It led to the creation of multiple languages resulting in the great diversity we see in human languages. A vast majority of these languages are not mutually intelligible with each other. Languages also became repositories of culture, heritage and identity, something that persists to this day. A variety of writing systems also evolved making it non-trivial to understand knowledge recorded in multiple writing systems.
We know that, at least in recorded history, different cultures have felt the need to develop mechanisms to overcome language barriers and establish lines of communication. Thus arose the need for translation of languages and transliteration of written text from one script to another. One definition of translation is2:
Definition 1.0.1. Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text.
1 https:/​/​blog.linguistlist.org/​ll-main/​ask-a-linguist-how-old-is-language 2 https:/​/​en.wikipedia.org/​wiki/​Translation
Traditionally, translation was achieved using human translators who would master more than one language. In addition, most cross-lingual communication was mediated through some languages which served as lingua franca3 e.g., Latin and Greek in ancient Europe, Sanskrit in ancient India, Arabic in the Middle East and English at a global level in modern times. Thus, translation played a major role throughout history in connecting peoples and cultures. Given the expertise required in mastering the nuances of multiple languages, the benefits of translation would have been limited to only a section of society who needed these services for the conduct of their professions. Manual translation is not scalable.

1.1 Need for Machine Translation and Transliteration

The modern age heralded the industrial and digital revolutions which transformed means of transportation and communication. We can travel to different parts of the globe in a short time. Information and knowledge from across the globe are available at our fingertips, especially with the advent of the Internet. We can communicate with people across the globe instantaneously. Hence, we have seen an explosion in our communications for administrative, business and cultural purposes. The world is highly interconnected and our interactions have global manifestations and implications.
While advances in transportation and communication technologies have reduced the physical barriers to communication, barriers due to linguistic divergences poses greater challenges. Given the degree of interconnectedness and resulting human communication needs, manual translation is no longer scalable to satisfy these requirements. Hence we need methods to automate translation of natural language i.e.,machine translation (MT).
Different paradigms of machine translation have been proposed in the previous 60 years or so since investigations into machine translation began. In the earlier days, rule-based machine translation (Hutchins and Somers, 1992) was the dominant paradigm. This system relied on experts writing intricate and exhaustive rules based on deep understanding of language structure and language divergence. With the increased availability of translated data viz. parallel corpora, empirical approaches to translation (e.g., Statistical Machine Translation (SMT) (Koehn et al., 2003) and Neural Machine Translation (NMT) (Sutskever et al., 2014; Bahdanau et al., 2015)) which try to automate discovery of translation patterns (word translations, phrase translation, translation rules etc.) were extensively explored. These empirical, data-oriented methods are the state-of-the-art methods and most research in translation has gravitated to such methods. The principal drivers of this shift are: (i) less dependence on expensive and scarce linguistic expertise, (ii) robustness to the diversity of language phenomenon and noisy input, (iii) ease of maintenance and (iv) rapid prototyping and development.
3 A language that is adopted as a common language between speakers whose native languages are different (Oxford Dictionary).
Given this evolutionary trend and its advantages, the work described in this monograph is based on these empirical methods, but draws upon linguistic knowledge to make learning more resource-efficient.
Sometimes, we need to convert text in one script to another script i.e., machine transliteration. Transliteration is particularly required for reading proper names written in one script in another script. It is a useful component of machine translation, cross-lingual information retrieval and similar multilingual tasks. Li et al. (2009) define transliteration as:
Definition 1.1.1. Transliteration is the conversion of a given name in the source language (a text string in the source writing system or orthography) to a name in the target language (another text string in the target writing system or orthography), such that the target language name is: (i) phonemically equivalent to the source name, (ii) conforms to the phonology of the target language and (iii) matches the user intuition of the equivalent of the source language name in the target language.

1.2 Need for Machine Translation involving Related Languages

From a practical standpoint, the demand for translation services is not uniform across all language pairs. There is little demand for translation among many language pairs. For instance, there is little interest in translation between Hindi4 and Hausa5. Building good translation systems needs investment in creation/collection of parallel corpora and other linguistic resources as well as linguistic and machine learning expertise. It would, therefore, be prudent to focus on languages which need translation services.
One major use-case for translation services arises among people living in contiguous areas, speaking related languages. These people have cultural and economic ties and communicate heavily amongst themselves for administrative, business and social needs. For instance, the European Union (EU) is home to around 700 million6 people speaking a wide variety of Indo-European languages with deep economic ties and substantial political and cultural interactions. Another example is the Indian subcontinent, whose 1.5 billion7 people predominantly speak various Indo-Aryan and Dravidian languages. Hence, translation services for these languages is an important requirement. Two translation scenarios involving related languages are important:
4 Hindi is an Indo-Aryan language spoken primarily in North India. 5 Hausa is an Afro-Asiatic language spoken primarily in Niger and Nigeria. 6 https:/​/​en.wikiped...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Contents
  6. Preface
  7. List of Figures
  8. List of Tables
  9. 1 Introduction
  10. 2 Past Work on MT for Related Languages
  11. I Machine Translation
  12. II Machine Transliteration
  13. Appendices
  14. Bibliography
  15. Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Machine Translation and Transliteration involving Related, Low-resource Languages by Anoop Kunchukuttan,Pushpak Bhattacharyya in PDF and/or ePUB format, as well as other popular books in Computer Science & Statistics for Business & Economics. We have over 1.5 million books available in our catalogue for you to explore.