Introduction: Text Power and Intelligent Systems
Paul S. Jacobs
Artificial Intelligence Program
GE Research and Development Center
Schenectady, NY 12301 USA
1.1 A New Opportunity
Huge quantities of readily available on-line text raise new challenges and opportunities for artificial intelligence systems. The ease of acquiring text knowledge suggests replacing, or at least augmenting, knowledge-based systems with âtext-basedâ intelligence wherever possible. Making use of this text knowledge demands more work in robust processing, retrieval, and presentation of information, but raises a host of new applications of AI technologies, where on-line information exists but knowledge bases do not.
Most AI programs have failed to âscale upâ because of the difficulty of developing large, robust knowledge bases. At the same time, rapid advances in networks and information storage now provide access to knowledge bases millions of times largerâin text form. No knowledge representation claims the expressive power or the compactness of this raw text. The next generation of AI applications, therefore, may well be âtext-basedâ rather than knowledge based, deriving more power from large quantities of stored text than from hand-crafted rules.
Text-based intelligent systems can combine artificial intelligence techniques with more robust but âshallowerâ methods. Natural language processing (NLP) research has been hampered, on the one hand, by the limitations of deep systems that work only on a very small number of texts (often only one), and, on the other hand, by the failure of more mature technologies, such as parsing, to apply to practical systems. Information retrieval (IR) systems offer a vehicle where selected NLP methods can produce useful results; hence, there is a natural and potentially important marriage between IR and NLP. This synergy extends beyond the traditional realms of either technology to a variety of emerging applications.
As examples, we must consider what a knowledge-based system can offer in the domain of medical diagnosis, on-line operating systems, fault diagnosis in engines, or financial advising, that cannot be found in a medical textbook, a userâs manual, a design specification, or a tax preparation handbook. Computers should help make the right information from these documents accessible and comprehensible to the user. Harnessing the power of volumes of available textâthrough information retrieval, natural language analysis, knowledge representation, and conceptual information extractionâwill pose a major challenge for AI into the next century.
Advocates of the text-based approach to intelligent systems must accept its inherent limitations. Some of the traditional AI problems, such as reasoning, inference, and pragmatics, must necessarily play a limited role. But there is evidence of substantial progress in building robust text processing systems that rely more heavily on shallower methods. The rest of this paper describes the combination of applications, methodologies and techniques that forms the backbone of work on Text-Based Intelligent Systems.
1.2 A New Name
To merit their own label, âtext-based intelligent systemsâ must suggest something distinctly different from prevailing research. As the introduction has implied, a text-based intelligent system (TBIS) is a program that derives its power from large quantities of raw text, in an intelligent manner. Such systems differ from traditional information retrieval systems in that they must be more flexible and responsive, possibly segmenting, combining, or synthesizing a response rather than just retrieving texts. The systems differ from traditional natural language programs in that they must be much more robust.
The category of text-based intelligent systems includes, for example:
⢠Text extraction systemsâprograms that analyze volumes of unstructured text, selecting certain features from the text and potentially storing such features in a structured form. These systems currently exist in limited domains. Examples of this type of system are news reading programs [Jacobs and Rau, 1990] (see the papers by Hobbs et al. and McDonald in this volume), database generation programs that produce fixed-field information from free text, and transaction handling programs, such as those that read banking transfer messages [Lytinen and Gershman, 1986; Young and Hayes, 1985].
⢠Automated indexing and hypertextâknowledge-based programs that determine key terms and topics by which to select texts or portions of text [Jonak, 1984] or automatically link portions of text that relate to one another (see the paper by Salton and Buckley in this volume).
⢠Summarization and abstractingâprograms that integrate multiple texts that repeat, correct, or augment one another, as in following the course of a news story over time such as a corporate merger or political event [Rau, 1987].
⢠Intelligent information retrievalâsystems with enhanced information retrieval capabilities, through robust query processing, user modeling, or limited inference [IPM, 1987] (see also the paper by Croft and Turtle in this volume).
This volume contains position papers covering all of the topics above, along with discussions of underlying problems in constructing TBISâs, such as the representation and storage of knowledge about texts or about language, and robust text processing techniques. Many of the positions describe research related to substantial systems in one of the above categories, and virtually all address the issue of robust processing of some sort. The next section describes the apparent methodological themes of this sort of research.
1.3 No More âDonkeysâ
Much of this research combines the discipline of information retrieval with some of the techniques of natural language processing. Historically, the methodology of information retrieval has been to develop new methods and conduct experiments to compare those methods with other approaches. By contrast, the methodology of natural language processing has been either to develop theories that apply to broad but carefully selected linguistic phenomena, or to develop programs that apply to carefully selected texts. In other words, there has been very little effort within natural language to produce results such as âThis program performs the following task with 95% accuracy on the following set of 1000 textsâ.
As a result of its more theoretical orientation, natural language as a field has devoted much of its attention to paradigmatic but improbable examples. Researchers in natural language were trained to think about contrived sentencesââEvery man who owns a donkey beats itâ or âThe box is in the pen.â These are so familiar that one might stand up with a question at the end of a presentation and ask, âBut what about the âdonkeyâ sentences?â Researchers are acquainted enough with the examples that they neednât be repeated, in spite of the fact that they hardly seem representative of examples or problems that we might encounter.
The current methodological shift in the experimental element of natural language processing (by no means the dominant segment of the field) brings text processing, as experimental computer science, closer to information retrieval. Rather than seek out examples that support or challenge theories, the experimental methodology uses sets of naturally occurring examples as test cases, possibly ignoring certain interesting problems that simply do not occur in a particular task. While this approach has some disadvantages, it has the benefit of focussing work on the issues in natural language processing that inhibit robustness.
Another example of the experimental shift is the area of language acquisition. During the 1970âs and most of the 1980âs, the field of language acquisition concentrated on the techniques through which knowledge, especially grammatical knowledge, could be acquired. The result of this effort was a host of theories and techniques, but very little in the way of sizable knowledge bases. Recently, however, the research focus in language acquisition has been on achieving the goal of acquisition rather than on the process, resulting in extensive lexicons and knowledge bases for use in processing texts [Zernik, 1991].
While the methodology of natural language may be drifting toward information retrieval, information retrieval is slowly changing in focus. The extreme difficulty of producing significant improvements using traditional document retrieval metrics suggests exploring new retrieval strategies as well as devising new measures. As the combined fields of natural language processing and information retrieval continue to make progress, the demand grows for test collections and metrics that evaluate meaningful tasks, including not only the accuracy of document retrieval, but also the accuracy, speed, transportability, and ease of use of systems that perform functions such as those outlined in the previous section. This new direction involves the constant interplay of two goals: (1) produce new measurable results and (2) produce new measures of new results.
The resulting experimental methodology has spawned a host of research projects emphasizing robust processing, large-scale systems, knowledge acquisition, and performance evaluation. As the new research is still taking shape, one shouldnât expect any breakthroughs as yet. The next section considers the limited progress that has already resulted.
1.4 Where We Are Now
While text-based intelligent systems are very much a futuristic concept, the recent emphasis on experiment and performance has brought some noticeable changes during the last several years:
⢠Evaluation:
In government, academia, and industry, the desire for results has led to new metrics for evaluating system performance. While metrics and benchmarks often spark debate, they also show clear progress. For example, a government-sponsored message processing conference three years ago featured a small set of programs performing different functions in different domains, while a more recent similar conference included nine substantial programs performing a common task on a set of over 100 real messages, and produced meaningful results [Sundheim, 1989] (see Hobbs et al., this volume). New evaluation metrics have appeared also in other tasks, such as text categorization (cf. Hayes, this volume).
⢠Scale:
Natural language programs typically have operated on a handful of texts; recently, programs have emerged that process streams of hundreds of thousands of words or more, depending on the level of semantic processing. Along with their broader capabilities, the knowledge bases that such programs use have been expanding. While a typical lexicon recently might have included 100 or 200 words, many systems now have real lexicons of 10, 000 roots or more.
⢠Commercialization:
The number of industrial scientists represented in this volume is an indicator of the emerging commercial applications of robust text processing and information retrieval technology, as is the increasing number of commercially available systems. Many commercial applications that formerly used relational databases or other structured knowledge sources are shifting to textual databases because of the availability of on-line text information, and many hardware and software vendors are packaging their products with substantial text databases. These products generally do not employ the sort of technology discussed here, but do provide a vehicle for the ultimate application of the technology.
⢠Cooperation and Competition:
Until recently, schools of thought in text processing and information retrieval were dogmatic enough to ignore most other related work. In many areas, recent projects have spawned cooperative efforts in collecting data and lexical knowledge, assembling test collections, and cooperating between industry and academia. Competition, on the other hand, was never allowed because of the general lack of evaluation criteria. Now there is a growing interest in holding âshowdownsâ that objectively compare different methods.
While there has been some visible progress toward text-based intelligent systems, we arenât very close to a desirable state of technology. The next section addresses some of the obstacles we must overcome.
1.5 Why We Arenât There Yet
Many of us have workstations on top of our desks that have access via computer networks to trillions of words of textâencyclopedias, almanacs, dictionaries, literature, news, and electronic bulletin boards. Ironically, we are loath to attempt to use most of this information because a combination of factorsâmainly the difficulty of finding any particular bit of knowledge we desireâmakes it a gross waste of time.
Much of this problem in crudeness of information access boils down to issues that are relatively mundane, having little to do with text contentâthe speed of transmission across networks, compatibility of hardware, security, legal and copyright concerns, the lack of standards for storing and transmitting on-line text, etc. As the motivation for using on-line text helps to dissolve some of these issues, we can hope for better opportunities to use the advanced technologies for content analysis that are reported here.
In addition to these mundane communication and standardization issues, there is a more relevant problem of how to market the technology that we are developing. Too often we ignore the strengths of the competitionâin this case, simple text search, Boolean query, and keyword retrieval methods. While these simpler methods lack the power and intuitive appeal, say, of natural language analysis or concept-based information retrieval, they have certain features that appeal to users of large text databases: they are fast, portable, relatively inexpensive, and relatively easy to learn. The techniques are compatible with many software packages, run on many hardware platforms, and are easier to implement in hardware. By contrast, natural language processing can be slow, brittle, and expensive. In order to bring the technology to the marketplace in the near future (such as the next dozen or so years), we will either have to minimize these disadvantages or prove dramatic improvements over simpler methods.
Some key technical barriers stand in the way of the all-knowing desktop librarian. These technical barriers will form some of the focal points the research reported in this volume as well as the progress that is likely to be made in the rest of the century. Four such issues are (1) robustness of analysis, (2) retrieval strategy, (3) presentation of information, and (4) cultivation of applications. The next section will outline the technical challenges in each of these areas.
1.6 Challenges for the 1990âs
The intelligent access to information from texts is the central theme of this research. The following are some of the key thrusts of this theme, including the topics of many of the papers here:
⢠Robustness:
The next generation of language analyzers must do much of the same sort of processing that current systems do, but must do it more accurately, faster, and with less domain-dependent knowledge. Robustness applies both to extending techniques that are already robust, such as parsing and morphology, and to increasing the robustness of more knowledge-intensive techniques, such as semantic analysis.
⢠Retrieval Strategy:
Current retrieval methods are oriented toward the retrieval of documents, not information in general. Text-based systems must address the broader issue of satisfying the information needs of many different systems and users. Within this broader information processing context, the concept of success but be redefined to be more than reproducing ârelevantâ texts, and new retrieval strategies must address this new notion of success. For example, if a user wants to know a specific piece of informatio...