Chapter 1
Multilingual Text Analysis: History, Tasks, and Challenges
Natalia Vanetik∗ and Marina Litvak†
Text analytics (TA) is a very broad research area that deals with knowledge discovery in written text. Almost all techniques of machine learning, data mining and information retrieval are applied to TA tasks which include text categorization, summarization, question answering and many more. Among a very large variety of TA methods, multilingual techniques hold a special place. In order to be deemed as multilingual, a system or an algorithm must be able to handle texts in several languages equally well; a very good method should be able to produce good results for languages from different language families. Multilingual techniques and algorithms need to apply analysis that is not related to a linguistic structure of text in one specific language but rather relies on general statistical and mathematical properties common to many languages.
In this chapter we provide an overview of the field of multilingual text analysis, starting with description of various TA tasks and the history of TA. We then survey TA challenges related to the multilingual domain.
1. Introduction
Text analytics is a very wide research area. Its overarching goal is to discover and present knowledge — facts, rules, and relationships — that is otherwise hidden in textual content and unattainable by automated processing. Prior to applying analytical methods, text needs to be turned into structured data through the application of natural language processing (NLP). Then, data mining techniques, including link and association analysis, visualization, and predictive analytics, can be applied to the structured input and used to produce a requested output. Typical TA tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, question answering, slot filling, and entity relation modeling.
A list of possible subtasks composing the TA process includes but is not limited to:
• Information retrieval (IR) as a preparatory step: collecting or identifying a set of textual materials for analysis; the set may be comprised of material found in any number of places, including the Web or a file system, database, or content corpus manager;
• Advanced statistical methods, like computing word frequency distributions;
• Extensive NLP, such as part of speech (POS) tagging, syntactic parsing, and other types of linguistic analysis;
• Named entity recognition (NER) using gazetteers or statistical techniques to identify named text features such as people, organizations, places, or certain abbreviations;
• Disambiguation, which involves the use of contextual clues, may be required to decide where, for instance, “apple” can refer to a fruit, a software company, a multimedia corporation, a movie, or some other entity;
• Recognition of pattern-identified entities: features such as telephone numbers, email addresses, or quantities (with units), can be discerned through regular expression or other pattern matches;
• Coreference resolution involves finding all expressions that refer to the same entity in a text;
• Relationship, fact, concept, and event extraction involve the identification of associations among entities and other information in text;
• Sentiment analysis involves discerning subjective material and extracting various forms of attitudinal information, such as sentiment, opinion, mood, and emotion;
• Topic modeling enables the discovery of the abstract “topics” that occur in a collection of documents;
• Quantitative TA is the process of extracting semantic or grammatical relationships between words.
This chapter introduces the main directions and challenges in TA, both in general and with respect to multilingual domain. The next section summarizes the history of the TA area. Section 3 describes primary TA subareas and tasks. Section 4 provides an overview of challenges to the use of TA in the multilingual domain. Section 5 provides a brief overview of the remaining chapters in the book.
2. TA evolution
The ability to understand the key content of a text has become extremely important recently, when more and more sources in different languages are available on the net. New ideas for interesting, and even crucial, applications arise every day. Extracting the most critical facts and reducing information overload, mining opinions from social media and other domains, predicting important events, detecting fraud and security threats — these are just a small sample of current “hot topics” in the TA area. Globalization dictates its own rules — more text sources are published in original language; language that is different from English as an international standard. Therefore, all proposed methodologies must deal with an additional requirement — they must be able to process multiple languages.
The idea of using computers to analyze text and search for relevant pieces of information was raised for the first time in an article by Vannevar Bush in 1945.1 In the 1950s, this idea was followed by several works. One of the most influential was the 1957 work of Luhn,2 where he proposed to use words as indexing units for documents, and measure word overlap as a criterion for retrieval. A year later, Luhn published the first work on automated summarization,3 where he proposed a statistical method for ranking sentences. Several key developments in the field happened in the 1960s. Most notable were the development of the SMART system by Gerard Salton and his students, first at Harvard University and later at Cornell University.4 The 1970s and 1980s showed many developments built on the advances of the 1960s. An example is the famous Vector Space Model that was proposed by Salton,5 which is still very powerful in multiple and diverse tasks of TA. However, due to lack of available large text collections, the question whether proposed models and techniques would scale to large corpora remained open. This changed in 1992 with the inception of the Text Retrieval Conference (TREC),a followed by the Document Understanding Conference (DUC)b in 2001, which was later transformed into the Text Analysis Conference (TAC)c in 2008. Each of these is part of a series of evaluation conferences sponsored by various US government agencies under the auspices of the National Institute of Standards and Technology (NIST), which aims at encouraging research in different areas of information retrieval (IR) from large text collections. These conferences have branched IR into related and important fields like retrieval of spoken information, multilingual and cross-language retrieval, information filtering, summarization, information extraction, and automatic evaluation. This book describes multiple approaches to different TA tasks. The main focus is the multilinguality of those approaches, specifically, their ability to be applied to multiple languages.
One of the most representative examples of joint international effort in the field of multilingual TA is a series of MultiLing conferences.6 The first MultiLing was organized in 2011, as a summarization track of DUC 2011.7 It gathered several scientists from different countries with a joint purpose — to create the first big collection of documents in multiple languages, a collection that will permit scientists around the world to evaluate their summarization systems based on different languages. The secondary goal was to encourage people to work on summarization systems that can be applied to multiple languages. For example, in order to participate in the MultiLing contest, a team was required to apply its system to at least two languages.
3. TA overview
In this section, we describe the main areas of TA, text preprocessing methods, and a process of evaluation of TA tasks. A good overview of the main tasks in the field of TA is given in Ref. 8.
3.1. TA areas
Text analysis is roughly divided into several broad areas, as follows.
Text mining (TM) (first mentioned in Ref. 9) is the process of seeking or extracting useful information from the textual data. It is an exciting research area as it tries to discover knowledge from unstructured texts.10 The scope of TM is its treatment of textual data through an application or adaptation of general knowledge discovery in databases11,12 techniques. In order to apply these techniques, a suitable procedure among knowledge discovery methods is selected, modified to fit and handle the text data, and applied to large amounts of text. In general, text data is assumed to be available as character-based data in a standard encoding, although in many cases ...