Chapter 1
LMF â Historical Context and Perspectives
1.1. Introduction
The value of agreeing on standards for lexical resources was first recognized in the 1980s, with the pioneering initiatives in the field of machine-readable dictionaries, and afterwards with EC-sponsored projects ACQUILEX, MULTILEX and GENELEX. Later on, the importance of designing standards for language resources (LR) was firmly established, starting with the Expert Advisory Group for Language Engineering (EAGLES) and International Standards for Language Engineering (ISLE) initiatives. EAGLES drew inspiration from the results of previous major projects, set up the basic methodological principles for standardization and contributed to advancing the common understanding of harmonization issues. ISLE consolidated the uncontroversial basic notion of a lexical metamodel, that is an abstract representation format for lexical entries, the Multilingual ISLE Lexical Entry (MILE). MILE was a general schema for the encoding of multilingual lexical information, and was intended as a common representational layer for multilingual lexical resources. As such, all these initiatives contain the seeds of what later evolved into Lexical Markup Framework (LMF). From a methodological point of view, MILE was based on a very extended survey of common practices in lexical encoding, and was the result of cooperative work toward a consensual view, carried out by several groups of experts worldwide. Both EAGLES and ISLE stressed the importance of reaching a consensus on (linguistic and non-linguistic) âcontentâ, in addition to agreement on formats and encoding issues, and also began to address the needs of content processing and Semantic Web technologies. The recommendations for standards and best practices issued within these projects then became, through the INTERA and mainly the LIRICS project, the International Organization for Standardization (ISO) within the ISO TC37/SC4 committee, where LMF was developed. Thanks to the results of these initiatives that culminated in LMF, there is worldwide recognition that the EU is at the forefront in the areas of LRs and standards. LMF now testifies the full maturity reached by the field of LRs.
1.2. The context
The 1990s saw a widespread acknowledgment of the crucial role covered by LRs in language technology (LT). LR started to be considered as having an infrastructural role, that is as an enabling component of Human Language Technologies (HLTs). HLTs (i.e. natural language processing tools, systems, applications and evaluations) depend on LRs, which also strongly influence their quality and indirectly generate value for producers and users.
This recognition was also shown through the financial support from the European Commission to projects aiming at designing and building different types of LRs. Under the support of US agencies (NSF, DARPA, NSA, etc.) and the EC, LRs were unanimously indicated as themes of utmost priority.
One of the major tenets was the recognition of the essential infrastructural role that LRs play as the necessary common platform on which new technologies and applications must be based. To avoid massive and wasteful duplication of effort, public funding â at least partially â of LR development is critical to ensure public availability (although not necessarily at no cost). A prerequisite to such a publicly funded effort is careful consideration of the needs of the community, in particular the needs of industry. In a multilingual setting such as todayâs global economy, the need for standardized wide-coverage LRs is even stronger. Another tenet is the recognition of the need for a global strategic vision, encompassing different types of (and methodologies of building) LR, for an articulated and coherent development of this field.
The infrastructural role of LRs requires that they are (1) designed, built and validated together with potential users (therefore, the need for involving companies), (2) built reusing available âpartialâ resources, (3) made available to the whole community and (4) harmonized with the resources of other languages (therefore, the importance and the reference to international standards).
The major building blocks to set up an LR infrastructure are presented in [CAL 99]:
â LR reusability: directly related to the importance of âlarge-scaleâ LRs within the increasingly dominant data-driven approach;
â LR development;
â LR distribution.
Other dimensions were soon added as a necessary complement to achieve the required robustness and data coverage and to assess results obtained with current methodologies and techniques, that is:
â automatic acquisition of LRs or of linguistic information;
â use of LRs for evaluation campaigns.
Crucial to LR reusability and development was the theme of the definition of operational standards, but the value of agreeing on International Standards was also suddenly recognized as critical. Without standards underlying applications and resources, users of LT would have remained ill-served. The application areas would have continued to be severely hampered and only niche or highly specialized applications would have seen success (e.g. speech aids for the disabled and spelling checkers). In general, it had never been possible to build on the results of past work, whether in terms of resources or the systems that used them.
The significance of standardization was thus recognized, in that it would open up the application field, allow an expansion of activities, sharing of expensive resources, reuse of components and rapid construction of integrated, robust, multilingual language processing environments for end-users.
1.3. The foundations: the Grosseto Workshop and the âX-Lexâ projects
During the 1980s there was a dramatic growth in interest in the lexicon. The main reasons for this were, on the one hand, the theoretical developments in linguistics that placed increasing emphasis on the lexical component, and on the other hand the awareness about the wealth of information in lexicons that could be exploited by automatic NLP systems. A turning point in the field was marked by the workshop âOn automating the lexiconâ held at Marina di Grosseto (Italy) in 1986 [WAL 95], when a pool of actors in the field gathered to establish a baseline for the current state of research and issued a set of recommendations for the sector. The most relevant recommendation â as far as the future LMF is concerned â was the need for a metaformat for the representation of lexical entries, that is an abstract model of a computerized lexicon enabling accommodation of different theories and linguistic models. The following years saw a flourishing of events around this new notion of a âmeta-entryâ, for instance the workshop on âThe Lexical Entryâ, held in New York City immediately after Grosseto, and the meeting held in Pisa by the so-called Polytheoretical Group in 1987, where the possibilities of a neutral lexicon were explored [WAL 87].
This has contributed to the creation of a favorable climate for converging toward the common goal of demonstrating the feasibility of large lexicons, which needed to be reusable, polytheoretical and multifunctional. This reflection has led to the definition of the concept of reusability of lexical resources as (1) the possibility of reusing the wealth of information contained in machine-readable dictionaries, by converting their data for incorporation into a variety of different NLP modules; (2) the feasibility of building large-scale lexical resources that can be reused in different theoretical frameworks, for different types of application, and by different users [CAL 91].
The first sense of reusability was clearly addressed by the ACQUILEX project, funded by the European ESPRIT Basic Research Program [BOG 88]. The second sense inspired the Eurotra-7 (ET-7) project, which had the goal of providing a methodology and recommending steps toward the construction of sharable lexical resources [HEI 91].
The need for standards in the second sense of reusability was represented by other initiatives, often publicly funded, such as the EUREKA industrial project GENELEX [GEN 94], which concentrated on a generic model for monolingual reusable lexicons [ANT 94] and the CEC ESPRIT project MULTILEX, whose objective was to devise a model for multilingual lexicons [KHA 93]. GENELEX, with its generic model, fulfilled the requirements of being âtheory welcomingâ, and having a wide linguistic coverage. A standardized format was designed as a means for encoding information originating from different lexicographic theories, with the aim to make it possible to exchange lexical data and to allow the development of a set of tools for a lexicographic workstation.
These âX-Lexâ projects assessed the feasibility of some elementary standards for the description of lexical entries at different levels of linguistic description (phonetic, phonological, etc.) and laid the foundations for all the subsequent standardization initiatives.
It became evident that progress in NLP and speech applications were hampered by a lack of generic technologies and reusable LRs, by a proliferation of different information formats, by variable linguistic specificity of existing information and by the high cost of development of resources. This had to be changed to be able to build on the results of past work, whether in terms of resources or the systems that use them.
1.4. EAGLES and ISLE
EAGLES, started in 1993, is a direct descendant of the previous initiatives, and represented the bridge between them and a number of subsequent projects funded by the EC [CAL 96]. EAGLES was set up to improve the situation of many lexical initiatives, through bringing together representatives of major collaborative European R&D projects in relevant areas, to determine which aspects of our field are open to short-term de facto standardization and to encourage the development of such standards for the benefit of consumers and producers of LT. This work was conducted with a view to providing the foundation for any future recommendations for International Standards that may be formulated under the aegis of ISO.
The aim of EAGLES was to support academic and industrial research and development in HLT by accelerating the provision of standards, common guidelines and best practice recommendations for:
â very large-scale LRs (such as text corpora, computational lexicons and speech and multimodal resources);
â means of manipulating such knowledge, via computational linguistic formalisms, mark-up languages and various software tools;
â means of assessing and evaluating resources, tools and products.
The structure of EAGLES resulted from recommendations made by leading industrial and academic centers, and by the EC Language Engineering strategy committees. More than 30 research centers, industrial organizations, professional associations and networks across the EU provided labor toward the common effort, and more than 100 sites were involved in different EAGLES groups or subgroups. In addition, reports from EC Language Engineering strategy committees had strongly endorsed standardization efforts in language engineering.
Moreover, there was a recognition that standardization work is not only important, but is also a necessary component of any strategic program to create a coherent market, which demands sustained effort and investment. ISLE, a standard-oriented transatlantic initiative under the HLT program, started in 2000, was a continuation of the long-standing European EAGLES initiative [CAL 01, CAL 02].
It is important to note that the work of EAGLES/ISLE must be seen in a long-term perspective. This is especially true for any attempt aiming at standardization in terms of international standards. EAGLES did not and could not result in standards of such an impact: this is the preserve of the ISO. The basic idea behind EAGLES/ISLE work was for the group to act as a catalyst in order to pool concrete results coming from major international/national/industrial projects.
1.5. Setting up methodologies and principles for standards
From a retrospective point of view, it is important to note that EAGLES and its guidelines were the first attempt at defining standards directly responding to commonly perceived needs in order to overcome common problems. In terms of offering workable, compromise solutions, they must be based on a solid platform of accepted facts and acceptable practices.
Since the formation of EAGLES, the work related to standards in the EU has largely been concentrated within this initiative. Related efforts elsewhere were closely linked with EAGLES and feed off it. The Lexicon and Corpus groupsâ recommendations were soon applied in a large number of Eu...