1 | Connectionist Natural Language Processing |
Noel E. Sharkey
Department of Computer Science, University of Exeter, Exeter, U.K.
Ronan G. Reilly
Department of Computer Science, University College Dublin, Dublin, Ireland
INTRODUCTION
Computational research on natural language has been going on for decades in artificial intelligence and computational linguistics. These disciplines generated enormous excitement in the ā60s and ā70s, but they have not entirely realised their promise and have now reached what seems to be a plateau. Why should connectionist natural language processing (CNLP) be any different? There are a number of reasons. For many, the connectionist approach provides a new way of looking at old issues. For these researchers, connectionism provides an expanded toolkit with which to invigorate old research projects with new ideas. For instance, connectionist systems can learn from examples so that, in the context of a rule-based system, all of the rules need not be specified a priori. Connectionist systems have very powerful generalisation capabilities. Content addressable memory or pattern completion falls naturally out of distributed connectionist systems, making them ideal for filling in missing information.
However, for many new researchers, connectionism is a whole new way of looking at language. The big promise is that the integration of learning and representation (e.g. Hanson & Burr, 1990) will be a source of new theoretical ideas. Connectionist devices are very good at constructing representations from the statistical regularities of a domain. They do so in a form that is not directly interpretable. Nevertheless, it is the nature of these representations, uninfluenced by a priori theoretical considerations, that hold the most promise for the discipline. Currently, connectionists are seeking ways of analysing such representations as a means of developing a new understanding of the problems facing automated language processing.
A Brief History
As far as we know, the first paper that discussed language in terms of parallel distributed processing was by Hinton (1981)1. Although that paper was really about implementing semantic nets in parallel hardware, many of the problem areas described by Hinton have been explored further in the natural language papers of the 1980s. The Hinton system took as input a distributed representation of word triples consisting of ROLE1 RELATION ROLE2. In other words, simple propositions such as ELEPHANT COLOUR GREY. When the system had finished learning the propositions, its task was to complete the third term of an input triple given only two of the terms. For example, given the terms ELEPHANT and COLOUR the system filled in the missing term GREY. This was very similar to the notion of default reasoning in AI. But Hinton went further, to discuss how his system could generalise its experience to novel examples. If the system knew that CLYDE was an elephant (i.e. the token CLYDE contained the type ELEPHANT microfeatures), then, given the two terms CLYDE and COLOUR, the third term GREY would be filled in.
What was interesting about Hintonās work was that he described two types of representation that have become commonplace in CNLP. The first concerns the input to a language system. In any sort of natural language system it is important to preserve the ordering of the input elements. Hinton did this by partitioning the input vector so that the first n bits represented the ROLE1 words, the second n bits represented the RELATION words, and the final n bits represented the ROLE2 words. There are a number of problems with this representational approach, such as redundancy, fixed length, and absence of semantic similarity among identical elements in different roles. Nonetheless, it has been widely used in the literature, both for input and output, and has only been superseded in the last two years, as we shall see.
The second type of representation used by Hinton was a distributed coarse-coded or compact representation. That is, the vector of input activations was recoded into a compact representation by random weights connected to a second layer of units. The states of this second layer of units were then fed back to the input layer and the weights were adjusted until the states from the second layer accurately reproduced the input. This is how the system filled in the missing term. It was also from the distributed representation that this system gained its generalisation abilities. Although such content-addressable memory systems were already well known, no one had used them in a language-related problem before.
The next four years from 1981 onwards saw only a few published papers, and most of these did not employ distributed representations. Distributed representations have a number of advantages over nondistributed or ālocalistā representations. For example, they have a greater psychological plausibility, they are more economical in the use of memory resources, and they are more resistant to disruption. However, prior to the development of sufficiently powerful learning algorithms, researchers found localist representations to be easier to work with, since they could readily be hand-coded. Small, Cottrell, and Shastri (1982) made a first brave stab at connectionist parsing. Though not greatly successful, this localist work opened the way for other linguistic-style work and provided a basis for Cottrellās (1985) thesis research, at Rochester, on word sense disambiguation. That year also saw a Technical Report from another Rochester student, Fanty (1985), that attempted to employ localist techniques to do context-free parsing. The same year, Selman (1985) presented a masterās thesis that utilised the Boltzmann learning algorithm (Hinton, Sejnowski & Ackley, 1984) for syntactic parsing. There were many interesting ideas in Selmanās thesis, but the use of simulated annealing proved to be too cumbersome for language (but see Sampson, 1989). Also in that year, a special issue of the Cognitive Science journal featured a language article by Waltz and Pollack (1985) who were not only concerned with parsing but also with contextual semantics. Prior to this paper, only Reilly (1984) had attempted a connectionist approach to the higher-level phenomena in his paper on anaphoric resolution.
Then, in 1986, there was a relative explosion of language-related papers. First, there were papers on the use of connectionist techniques for language work using AI style theory (e.g. Golden, 1986; Lehnert, 1986; Sharkey, Sutcliffe, & Wobcke, 1986). These papers were followed closely by the publication of the two volumes on parallel distributed processing (PDP) edited by David Rumelhart and Jay McClelland (Rumelhart & McClelland, 1986b; McClelland & Rumelhart, 1986). The two volumes contained a number of papers relating to aspects of natural language processing such as case-role assignment (McClelland & Kawamoto, 1986); learning the past tense of verbs (Rumelhart & McClelland, 1986a); and word recognition in reading (McClelland, 1986). Furthermore, the two volumes opened up the issue of representation in natural language which had started with Hinton (1981).
However, one paper (Rumelhart, Hinton, & Williams, 1986) in the PDP volumes significantly changed the style of much of connectionist research. This paper described a new learning algorithm employing a generalisation of a learning rule first proposed by Widrow and Hoff (1960). The new algorithm, usually referred to as the backpropagation algorithm, opened up the field of connectionist research, because now we could process input patterns that were not restricted by the constraint that they be in linearly separable classes (c.f. Allen, 1987 for a number of language studies employing the new algorithm). In the same year, Sejnowski and Rosenberg (1986) successfully applied the backpropagation algorithm to the problem of text-to-speech translation. And Hinton (1986) applied it to the learning of family trees (inheritance relations). These papers began a line of research devoted to examining the type of internal representation learned by connectionist networks in order to compute the required inputāoutput mapping (c.f. Hanson & Burr, 1990).
A significant extension to the representational capacity of connectionist networks was made by Jordan (1986). He proposed an architectural variant of the standard feed-forward backpropagation network. This variant involved feedback from the output layer to the input layer (thus, forming a recurrent network) which enabled the construction of powerful sequencing systems. By using the recurrent links to store a contextual history of any particular sequence, they overcame many of the difficulties that connectionist systems had in dealing with problems having a temporal structure. Later work by Elman (1988; 1989) utilised a similar architecture, but ran the recurrent links from the hidden units rather than from the output units. This variant enabled Elman to develop a CNLP model that appeared to have many of the properties of conventional symbol-processing models, such as sensitivity to compositional structure. This latter property had earlier been pinpointed by Fodor and Pylyshyn (1988) in their critique of connectionism as a significant and irredeemable deficit in CNLP systems. Another important advantage of Elmanās approach was that words (from a sentence) could be presented to the system in sequence. This departure from Hintonās (1981) vector partitioning approach overcame problems of redundancy, lack of semantic similarity between identical items, and fixed input length.
Since 1986, many more papers on language issues have begun to appear which are too numerous to mention here. Among these was further work on the application of world knowledge to language understanding (e.g. Chun & Mimo, 1987; Dolan & Dyer, 1987; Miikkulainen, 1990; Sharkey, 1989a). Research on various aspects of syntax and parsing has increased sharply (e.g. Benello, Makie, & Anderson, 1989; Hanson & Kegl, 1987; Howells, 1988; Kwasny & Faisal, 1990; Rager & Berg, 1990). Moreover, there has been an increase in research on other aspects of natural language such as speech production (Dell, 1986; Seidenberg & McClelland, 1989), sentence and phrase generation (e.g. Gasser, 1988; Kukich, 1987), question answering (Allen, 1988), prepositional attachment (e.g. Cosic & Munro, 1988), anaphora (Allen & Riecken, 1988), cognitive linguistics (Harris, 1990), discourse topic (Karen, 1990), lexical processing (Sharkey, 1989b; Sharkey & Sharkey, 1989; Kawamoto, 1989), variable binding (Smolensky, 1987), and speech processing (e.g., Kohonen, 1989; Hare, 1990; Port, 1990).
OVERVIEW OF CHAPTERS
The book is divided into four sections. The first section (Semantics) contains four chapters that deal with connectionist issues in both lexical and structural semantics. The second section (Syntax) contains two chapters dealing with connectionist parsing. The third section (Representational Adequacy) contains three chapters dealing with the controversial issue of the representational adequacy of connectionist representations. The fourth and final section (Computational Psycholinguistics) contains four chapters which focus on the cognitive modelling role of connectionism and which address a variety of topics in the area of computational psycholinguistics.
In what follows we will give a brief introduction to each of the chapters. For a more detailed discussion of some of the relevant issues, we provide an introduction at the beginning of each section.
Semantics
The four chapters in this section can be divided into two. The first pair of chapters deal with what can be best characterised as lexical semantics and the second pair with sentential or structural semantics.
In the first chapter of this section, Dyer et al. discuss a method for modifying distributed representations dynamically, by maintaining a separate, distributed connectionist network as a symbol memory, where each symbol is composed of a pattern of activation. Symbol representations start out as random patterns of activation. Over time they are ārecirculatedā through the symbolic tasks being demanded of them, and as a result, gradually form distributed representations that aid in the performance of these tasks. These distributed symbols enter into structured relations with other symbols, while exhibiting features of distributed representations, e.g. tolerance to noise and similarity-based generalisation to novel cases. Dyer et al. discuss in detail a method of symbol recirculation based on using entire weight matrices, formed in one network, as patterns of activation in a larger network. In the case of natural language processing, the resulting symbol memory can serve as a store for lexical entries, symbols, and relations among symbols, and thus represent semantic information.
In his chapter, Sutcliffe focuses on how the meaning of concepts is represented using microfeatures. He shows how microfeatural representations can be constructed, how they can be compared using the dot product, and why normalisation of microfeature vectors is required. He then goes on to describe the use of such representations in the construction of a lexicon for a story-paraphrasing system. Finally, he discusses the properties of the chosen representation and describes possible further developments of the work.
Wermter and Lehnert describe an approach combining natural language processing and connectionist learning. Concentrating on the domain of scientific language and the task of structural noun phrase disambiguation they present NOCON, a system which shows how learning can supply a memory model as the basis for understanding noun phrases. NOCON consists of two levels: a learning level at the bottom for learning semantic relationships between nouns and an integration level at the top for integrating semantic and syntactic constraints needed for structural noun phrase disambiguation. Wermter and Lehnert argue that this architecture is potentially strong enough to provide a learning and integrating memory model for natural language systems.
In the final chapter of this section St. John and McClelland argue that the parallel constraint satisfaction mechanism of connectionist models is a useful language comprehension algorithm; it allows syntactic and semantic constraints to be combined easily so that an interpretation which satisfies the most constraints can be found. It also allows interpretations to be revised easily, knowledge from different contexts to be shared, and it makes inferences an inherent part of comprehension. They present a model of sentence comprehension that addresses a range of important language phenomena. They show that the model can be extended to story comprehension. Both the sentence and story models view their input as evidence that constrains a complete interpretation. This view facilitates difficult aspects of sentence comprehension such as assigning thematic roles. It also facilitates difficult aspects of story comprehension such as inferring missing propositions, resolving pronouns, and sharing knowledge between contexts.
Syntax
The section on syntax provides a number of differing perspectives on how to deal with syntax in a connectionist network. The chapters are similar in that the models described are predominantly localist in nature. Rager focuses on robustness in parsing and Schnelle...