LMF Lexical Markup Framework
eBook - ePub

LMF Lexical Markup Framework

Gil Francopoulo, Gil Francopoulo

Share book
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

LMF Lexical Markup Framework

Gil Francopoulo, Gil Francopoulo

Book details
Book preview
Table of contents
Citations

About This Book

The community responsible for developing lexicons for Natural Language Processing (NLP) and Machine Readable Dictionaries (MRDs) started their ISO standardization activities in 2003. These activities resulted in the ISO standard – Lexical Markup Framework (LMF).
After selecting and defining a common terminology, the LMF team had to identify the common notions shared by all lexicons in order to specify a common skeleton (called the core model) and understand the various requirements coming from different groups of users.
The goals of LMF are to provide a common model for the creation and use of lexical resources, to manage the exchange of data between and among these resources, and to enable the merging of a large number of individual electronic resources to form extensive global electronic resources.
The various types of individual instantiations of LMF can include monolingual, bilingual or multilingual lexical resources. The same specifications can be used for small and large lexicons, both simple and complex, as well as for both written and spoken lexical representations. The descriptions range from morphology, syntax and computational semantics to computer-assisted translation. The languages covered are not restricted to European languages, but apply to all natural languages.
The LMF specification is now a success and numerous lexicon managers currently use LMF in different languages and contexts.
This book starts with the historical context of LMF, before providing an overview of the LMF model and the Data Category Registry, which provides a flexible means for applying constants like /grammatical gender/ in a variety of different settings. It then presents concrete applications and experiments on real data, which are important for developers who want to learn about the use of LMF.

Contents

1. LMF – Historical Context and Perspectives, Nicoletta Calzolari, Monica Monachini and Claudia Soria.
2. Model Description, Gil Francopoulo and Monte George.
3. LMF and the Data Category Registry: Principles and Application, Menzo Windhouwer and Sue Ellen Wright.
4. Wordnet-LMF: A Standard Representation for Multilingual Wordnets, Piek Vossen, Claudia Soria and Monica Monachini.
5. Prolmf: A Multilingual Dictionary of Proper Names and their Relations, Denis Maurel, BĂ©atrice Bouchou-Markhoff.
6. LMF for Arabic, Aida Khemakhem, Bilel Gargouri, Kais Haddar and Abdelmajid Ben Hamadou.
7. LMF for a Selection of African Languages, Chantal Enguehard and Mathieu Mangeot.
8. LMF and its Implementation in Some Asian Languages, Takenobu Tokunaga, Sophia Y.M. Lee, Virach Sornlertlamvanich, Kiyoaki Shirai, Shu-Kai Hsieh and Chu-Ren Huang.
9. DUELME: Dutch Electronic Lexicon of Multiword Expressions, Jan Odijk.
10. UBY-LMF – Exploring the Boundaries of Language-Independent Lexicon Models, Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Michael Matuschek and Christian M. Meyer.
11. Conversion of Lexicon-Grammar Tables to LMF: Application to French, Éric Laporte, Elsa Tolone and Matthieu Constant.
12. Collaborative Tools: From Wiktionary to LMF, for Synchronic and Diachronic Language Data, Thierry Declerck, Pirsoka Lendvai and Karlheinz Mörth.
13. LMF Experiments on Format Conversions for Resource Merging: Converters and Problems, Marta Villegas, Muntsa PadrĂł and NĂșria Bel.
14. LMF as a Foundation for Servicized Lexical Resources, Yoshihiko Hayashi, Monica Monachini, Bora Savas, Claudia Soria and Nicoletta Calzolari.
15. Creating a Serialization of LMF: The Experience of the RELISH Project, Menzo Windhouwer, Justin Petro, Irina Nevskaya, Sebastian Drude, Helen Aristar-Dry and Jost Gippert.
16. Global Atlas: Proper Nouns, From Wikipedia to LMF, Gil Francopoulo, Frédéric Marcoul, David Causse and Grégory Piparo.
17. LMF in U.S. Government Language Resource Management, Monte George.

About the Authors

Gil Francopoulo works for Tagmatica (www.tagmatica.com), a company specializing in software development in the field of linguistics and documentation in the semantic web, in Paris, France, as well as for Spotter (www.spotter.com), a company specializing in media and social media analytics.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is LMF Lexical Markup Framework an online PDF/ePUB?
Yes, you can access LMF Lexical Markup Framework by Gil Francopoulo, Gil Francopoulo in PDF and/or ePUB format, as well as other popular books in Technik & Maschinenbau & Elektrotechnik & Telekommunikation. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Wiley-ISTE
Year
2013
ISBN
9781118712597

Chapter 1

LMF – Historical Context and Perspectives

1.1. Introduction

The value of agreeing on standards for lexical resources was first recognized in the 1980s, with the pioneering initiatives in the field of machine-readable dictionaries, and afterwards with EC-sponsored projects ACQUILEX, MULTILEX and GENELEX. Later on, the importance of designing standards for language resources (LR) was firmly established, starting with the Expert Advisory Group for Language Engineering (EAGLES) and International Standards for Language Engineering (ISLE) initiatives. EAGLES drew inspiration from the results of previous major projects, set up the basic methodological principles for standardization and contributed to advancing the common understanding of harmonization issues. ISLE consolidated the uncontroversial basic notion of a lexical metamodel, that is an abstract representation format for lexical entries, the Multilingual ISLE Lexical Entry (MILE). MILE was a general schema for the encoding of multilingual lexical information, and was intended as a common representational layer for multilingual lexical resources. As such, all these initiatives contain the seeds of what later evolved into Lexical Markup Framework (LMF). From a methodological point of view, MILE was based on a very extended survey of common practices in lexical encoding, and was the result of cooperative work toward a consensual view, carried out by several groups of experts worldwide. Both EAGLES and ISLE stressed the importance of reaching a consensus on (linguistic and non-linguistic) “content”, in addition to agreement on formats and encoding issues, and also began to address the needs of content processing and Semantic Web technologies. The recommendations for standards and best practices issued within these projects then became, through the INTERA and mainly the LIRICS project, the International Organization for Standardization (ISO) within the ISO TC37/SC4 committee, where LMF was developed. Thanks to the results of these initiatives that culminated in LMF, there is worldwide recognition that the EU is at the forefront in the areas of LRs and standards. LMF now testifies the full maturity reached by the field of LRs.

1.2. The context

The 1990s saw a widespread acknowledgment of the crucial role covered by LRs in language technology (LT). LR started to be considered as having an infrastructural role, that is as an enabling component of Human Language Technologies (HLTs). HLTs (i.e. natural language processing tools, systems, applications and evaluations) depend on LRs, which also strongly influence their quality and indirectly generate value for producers and users.
This recognition was also shown through the financial support from the European Commission to projects aiming at designing and building different types of LRs. Under the support of US agencies (NSF, DARPA, NSA, etc.) and the EC, LRs were unanimously indicated as themes of utmost priority.
One of the major tenets was the recognition of the essential infrastructural role that LRs play as the necessary common platform on which new technologies and applications must be based. To avoid massive and wasteful duplication of effort, public funding – at least partially – of LR development is critical to ensure public availability (although not necessarily at no cost). A prerequisite to such a publicly funded effort is careful consideration of the needs of the community, in particular the needs of industry. In a multilingual setting such as today’s global economy, the need for standardized wide-coverage LRs is even stronger. Another tenet is the recognition of the need for a global strategic vision, encompassing different types of (and methodologies of building) LR, for an articulated and coherent development of this field.
The infrastructural role of LRs requires that they are (1) designed, built and validated together with potential users (therefore, the need for involving companies), (2) built reusing available “partial” resources, (3) made available to the whole community and (4) harmonized with the resources of other languages (therefore, the importance and the reference to international standards).
The major building blocks to set up an LR infrastructure are presented in [CAL 99]:
– LR reusability: directly related to the importance of “large-scale” LRs within the increasingly dominant data-driven approach;
– LR development;
– LR distribution.
Other dimensions were soon added as a necessary complement to achieve the required robustness and data coverage and to assess results obtained with current methodologies and techniques, that is:
– automatic acquisition of LRs or of linguistic information;
– use of LRs for evaluation campaigns.
Crucial to LR reusability and development was the theme of the definition of operational standards, but the value of agreeing on International Standards was also suddenly recognized as critical. Without standards underlying applications and resources, users of LT would have remained ill-served. The application areas would have continued to be severely hampered and only niche or highly specialized applications would have seen success (e.g. speech aids for the disabled and spelling checkers). In general, it had never been possible to build on the results of past work, whether in terms of resources or the systems that used them.
The significance of standardization was thus recognized, in that it would open up the application field, allow an expansion of activities, sharing of expensive resources, reuse of components and rapid construction of integrated, robust, multilingual language processing environments for end-users.

1.3. The foundations: the Grosseto Workshop and the “X-Lex” projects

During the 1980s there was a dramatic growth in interest in the lexicon. The main reasons for this were, on the one hand, the theoretical developments in linguistics that placed increasing emphasis on the lexical component, and on the other hand the awareness about the wealth of information in lexicons that could be exploited by automatic NLP systems. A turning point in the field was marked by the workshop “On automating the lexicon” held at Marina di Grosseto (Italy) in 1986 [WAL 95], when a pool of actors in the field gathered to establish a baseline for the current state of research and issued a set of recommendations for the sector. The most relevant recommendation – as far as the future LMF is concerned – was the need for a metaformat for the representation of lexical entries, that is an abstract model of a computerized lexicon enabling accommodation of different theories and linguistic models. The following years saw a flourishing of events around this new notion of a “meta-entry”, for instance the workshop on “The Lexical Entry”, held in New York City immediately after Grosseto, and the meeting held in Pisa by the so-called Polytheoretical Group in 1987, where the possibilities of a neutral lexicon were explored [WAL 87].
This has contributed to the creation of a favorable climate for converging toward the common goal of demonstrating the feasibility of large lexicons, which needed to be reusable, polytheoretical and multifunctional. This reflection has led to the definition of the concept of reusability of lexical resources as (1) the possibility of reusing the wealth of information contained in machine-readable dictionaries, by converting their data for incorporation into a variety of different NLP modules; (2) the feasibility of building large-scale lexical resources that can be reused in different theoretical frameworks, for different types of application, and by different users [CAL 91].
The first sense of reusability was clearly addressed by the ACQUILEX project, funded by the European ESPRIT Basic Research Program [BOG 88]. The second sense inspired the Eurotra-7 (ET-7) project, which had the goal of providing a methodology and recommending steps toward the construction of sharable lexical resources [HEI 91].
The need for standards in the second sense of reusability was represented by other initiatives, often publicly funded, such as the EUREKA industrial project GENELEX [GEN 94], which concentrated on a generic model for monolingual reusable lexicons [ANT 94] and the CEC ESPRIT project MULTILEX, whose objective was to devise a model for multilingual lexicons [KHA 93]. GENELEX, with its generic model, fulfilled the requirements of being “theory welcoming”, and having a wide linguistic coverage. A standardized format was designed as a means for encoding information originating from different lexicographic theories, with the aim to make it possible to exchange lexical data and to allow the development of a set of tools for a lexicographic workstation.
These “X-Lex” projects assessed the feasibility of some elementary standards for the description of lexical entries at different levels of linguistic description (phonetic, phonological, etc.) and laid the foundations for all the subsequent standardization initiatives.
It became evident that progress in NLP and speech applications were hampered by a lack of generic technologies and reusable LRs, by a proliferation of different information formats, by variable linguistic specificity of existing information and by the high cost of development of resources. This had to be changed to be able to build on the results of past work, whether in terms of resources or the systems that use them.

1.4. EAGLES and ISLE

EAGLES, started in 1993, is a direct descendant of the previous initiatives, and represented the bridge between them and a number of subsequent projects funded by the EC [CAL 96]. EAGLES was set up to improve the situation of many lexical initiatives, through bringing together representatives of major collaborative European R&D projects in relevant areas, to determine which aspects of our field are open to short-term de facto standardization and to encourage the development of such standards for the benefit of consumers and producers of LT. This work was conducted with a view to providing the foundation for any future recommendations for International Standards that may be formulated under the aegis of ISO.
The aim of EAGLES was to support academic and industrial research and development in HLT by accelerating the provision of standards, common guidelines and best practice recommendations for:
– very large-scale LRs (such as text corpora, computational lexicons and speech and multimodal resources);
– means of manipulating such knowledge, via computational linguistic formalisms, mark-up languages and various software tools;
– means of assessing and evaluating resources, tools and products.
The structure of EAGLES resulted from recommendations made by leading industrial and academic centers, and by the EC Language Engineering strategy committees. More than 30 research centers, industrial organizations, professional associations and networks across the EU provided labor toward the common effort, and more than 100 sites were involved in different EAGLES groups or subgroups. In addition, reports from EC Language Engineering strategy committees had strongly endorsed standardization efforts in language engineering.
Moreover, there was a recognition that standardization work is not only important, but is also a necessary component of any strategic program to create a coherent market, which demands sustained effort and investment. ISLE, a standard-oriented transatlantic initiative under the HLT program, started in 2000, was a continuation of the long-standing European EAGLES initiative [CAL 01, CAL 02].
It is important to note that the work of EAGLES/ISLE must be seen in a long-term perspective. This is especially true for any attempt aiming at standardization in terms of international standards. EAGLES did not and could not result in standards of such an impact: this is the preserve of the ISO. The basic idea behind EAGLES/ISLE work was for the group to act as a catalyst in order to pool concrete results coming from major international/national/industrial projects.

1.5. Setting up methodologies and principles for standards

From a retrospective point of view, it is important to note that EAGLES and its guidelines were the first attempt at defining standards directly responding to commonly perceived needs in order to overcome common problems. In terms of offering workable, compromise solutions, they must be based on a solid platform of accepted facts and acceptable practices.
Since the formation of EAGLES, the work related to standards in the EU has largely been concentrated within this initiative. Related efforts elsewhere were closely linked with EAGLES and feed off it. The Lexicon and Corpus groups’ recommendations were soon applied in a large number of Eu...

Table of contents