eBook - ePub

Text-based intelligent Systems

Name: Text-based intelligent Systems
ISBN: 9781317782087

Current Research and Practice in information Extraction and Retrieval

Paul S. Jacobs,

290 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Text-based intelligent Systems

Current Research and Practice in information Extraction and Retrieval

Paul S. Jacobs,

About this book

The symposium on which this volume was based brought together approximately fifty scientists from a variety of backgrounds to discuss the rapidly-emerging set of competing technologies for exploiting a massive quantity of textual information. This group was challenged to explore new ways to take advantage of the power of on-line text. A billion words of text can be more generally useful than a few hundred logical rules, if advanced computation can extract useful information from streams of text and help find what is needed in the sea of available material. While the extraction task is a hot topic for the field of natural language processing and the retrieval task is a solid aspect in the field of information retrieval, these two disciplines came together at the symposium and have been cross-breeding more than ever.

The book is organized in three parts. The first group of papers describes the current set of natural language processing techniques used for interpreting and extracting information from quantities of text. The second group gives some of the historical perspective, methodology, and current practice of information retrieval work; the third covers both current and emerging applications of these techniques. This collection of readings should give students and scientists alike a good idea of the current techniques as well as a general concept of how to go about developing and testing systems to handle volumes of text.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.

Yes, you can access Text-based intelligent Systems by Paul S. Jacobs in PDF and/or ePUB format, as well as other popular books in Psychology & Cognitive Psychology & Cognition. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Year

eBook ISBN

Topic

Subtopic

Cognitive Psychology & Cognition

Index

Psychology

Introduction: Text Power and Intelligent Systems

Paul S. Jacobs
Artificial Intelligence Program
GE Research and Development Center
Schenectady, NY 12301 USA

1.1 A New Opportunity

Huge quantities of readily available on-line text raise new challenges and opportunities for artificial intelligence systems. The ease of acquiring text knowledge suggests replacing, or at least augmenting, knowledge-based systems with “text-based” intelligence wherever possible. Making use of this text knowledge demands more work in robust processing, retrieval, and presentation of information, but raises a host of new applications of AI technologies, where on-line information exists but knowledge bases do not.

Most AI programs have failed to “scale up” because of the difficulty of developing large, robust knowledge bases. At the same time, rapid advances in networks and information storage now provide access to knowledge bases millions of times larger—in text form. No knowledge representation claims the expressive power or the compactness of this raw text. The next generation of AI applications, therefore, may well be “text-based” rather than knowledge based, deriving more power from large quantities of stored text than from hand-crafted rules.

Text-based intelligent systems can combine artificial intelligence techniques with more robust but “shallower” methods. Natural language processing (NLP) research has been hampered, on the one hand, by the limitations of deep systems that work only on a very small number of texts (often only one), and, on the other hand, by the failure of more mature technologies, such as parsing, to apply to practical systems. Information retrieval (IR) systems offer a vehicle where selected NLP methods can produce useful results; hence, there is a natural and potentially important marriage between IR and NLP. This synergy extends beyond the traditional realms of either technology to a variety of emerging applications.

As examples, we must consider what a knowledge-based system can offer in the domain of medical diagnosis, on-line operating systems, fault diagnosis in engines, or financial advising, that cannot be found in a medical textbook, a user’s manual, a design specification, or a tax preparation handbook. Computers should help make the right information from these documents accessible and comprehensible to the user. Harnessing the power of volumes of available text—through information retrieval, natural language analysis, knowledge representation, and conceptual information extraction—will pose a major challenge for AI into the next century.

Advocates of the text-based approach to intelligent systems must accept its inherent limitations. Some of the traditional AI problems, such as reasoning, inference, and pragmatics, must necessarily play a limited role. But there is evidence of substantial progress in building robust text processing systems that rely more heavily on shallower methods. The rest of this paper describes the combination of applications, methodologies and techniques that forms the backbone of work on Text-Based Intelligent Systems.

1.2 A New Name

To merit their own label, “text-based intelligent systems” must suggest something distinctly different from prevailing research. As the introduction has implied, a text-based intelligent system (TBIS) is a program that derives its power from large quantities of raw text, in an intelligent manner. Such systems differ from traditional information retrieval systems in that they must be more flexible and responsive, possibly segmenting, combining, or synthesizing a response rather than just retrieving texts. The systems differ from traditional natural language programs in that they must be much more robust.

The category of text-based intelligent systems includes, for example:

• Text extraction systems—programs that analyze volumes of unstructured text, selecting certain features from the text and potentially storing such features in a structured form. These systems currently exist in limited domains. Examples of this type of system are news reading programs [Jacobs and Rau, 1990] (see the papers by Hobbs et al. and McDonald in this volume), database generation programs that produce fixed-field information from free text, and transaction handling programs, such as those that read banking transfer messages [Lytinen and Gershman, 1986; Young and Hayes, 1985].

• Automated indexing and hypertext—knowledge-based programs that determine key terms and topics by which to select texts or portions of text [Jonak, 1984] or automatically link portions of text that relate to one another (see the paper by Salton and Buckley in this volume).

• Summarization and abstracting—programs that integrate multiple texts that repeat, correct, or augment one another, as in following the course of a news story over time such as a corporate merger or political event [Rau, 1987].

• Intelligent information retrieval—systems with enhanced information retrieval capabilities, through robust query processing, user modeling, or limited inference [IPM, 1987] (see also the paper by Croft and Turtle in this volume).

This volume contains position papers covering all of the topics above, along with discussions of underlying problems in constructing TBIS’s, such as the representation and storage of knowledge about texts or about language, and robust text processing techniques. Many of the positions describe research related to substantial systems in one of the above categories, and virtually all address the issue of robust processing of some sort. The next section describes the apparent methodological themes of this sort of research.

1.3 No More “Donkeys”

Much of this research combines the discipline of information retrieval with some of the techniques of natural language processing. Historically, the methodology of information retrieval has been to develop new methods and conduct experiments to compare those methods with other approaches. By contrast, the methodology of natural language processing has been either to develop theories that apply to broad but carefully selected linguistic phenomena, or to develop programs that apply to carefully selected texts. In other words, there has been very little effort within natural language to produce results such as “This program performs the following task with 95% accuracy on the following set of 1000 texts”.

As a result of its more theoretical orientation, natural language as a field has devoted much of its attention to paradigmatic but improbable examples. Researchers in natural language were trained to think about contrived sentences—“Every man who owns a donkey beats it” or “The box is in the pen.” These are so familiar that one might stand up with a question at the end of a presentation and ask, “But what about the ‘donkey’ sentences?” Researchers are acquainted enough with the examples that they needn’t be repeated, in spite of the fact that they hardly seem representative of examples or problems that we might encounter.

The current methodological shift in the experimental element of natural language processing (by no means the dominant segment of the field) brings text processing, as experimental computer science, closer to information retrieval. Rather than seek out examples that support or challenge theories, the experimental methodology uses sets of naturally occurring examples as test cases, possibly ignoring certain interesting problems that simply do not occur in a particular task. While this approach has some disadvantages, it has the benefit of focussing work on the issues in natural language processing that inhibit robustness.

Another example of the experimental shift is the area of language acquisition. During the 1970’s and most of the 1980’s, the field of language acquisition concentrated on the techniques through which knowledge, especially grammatical knowledge, could be acquired. The result of this effort was a host of theories and techniques, but very little in the way of sizable knowledge bases. Recently, however, the research focus in language acquisition has been on achieving the goal of acquisition rather than on the process, resulting in extensive lexicons and knowledge bases for use in processing texts [Zernik, 1991].

While the methodology of natural language may be drifting toward information retrieval, information retrieval is slowly changing in focus. The extreme difficulty of producing significant improvements using traditional document retrieval metrics suggests exploring new retrieval strategies as well as devising new measures. As the combined fields of natural language processing and information retrieval continue to make progress, the demand grows for test collections and metrics that evaluate meaningful tasks, including not only the accuracy of document retrieval, but also the accuracy, speed, transportability, and ease of use of systems that perform functions such as those outlined in the previous section. This new direction involves the constant interplay of two goals: (1) produce new measurable results and (2) produce new measures of new results.

The resulting experimental methodology has spawned a host of research projects emphasizing robust processing, large-scale systems, knowledge acquisition, and performance evaluation. As the new research is still taking shape, one shouldn’t expect any breakthroughs as yet. The next section considers the limited progress that has already resulted.

1.4 Where We Are Now

While text-based intelligent systems are very much a futuristic concept, the recent emphasis on experiment and performance has brought some noticeable changes during the last several years:

• Evaluation:

In government, academia, and industry, the desire for results has led to new metrics for evaluating system performance. While metrics and benchmarks often spark debate, they also show clear progress. For example, a government-sponsored message processing conference three years ago featured a small set of programs performing different functions in different domains, while a more recent similar conference included nine substantial programs performing a common task on a set of over 100 real messages, and produced meaningful results [Sundheim, 1989] (see Hobbs et al., this volume). New evaluation metrics have appeared also in other tasks, such as text categorization (cf. Hayes, this volume).

• Scale:

Natural language programs typically have operated on a handful of texts; recently, programs have emerged that process streams of hundreds of thousands of words or more, depending on the level of semantic processing. Along with their broader capabilities, the knowledge bases that such programs use have been expanding. While a typical lexicon recently might have included 100 or 200 words, many systems now have real lexicons of 10, 000 roots or more.

• Commercialization:

The number of industrial scientists represented in this volume is an indicator of the emerging commercial applications of robust text processing and information retrieval technology, as is the increasing number of commercially available systems. Many commercial applications that formerly used relational databases or other structured knowledge sources are shifting to textual databases because of the availability of on-line text information, and many hardware and software vendors are packaging their products with substantial text databases. These products generally do not employ the sort of technology discussed here, but do provide a vehicle for the ultimate application of the technology.

• Cooperation and Competition:

Until recently, schools of thought in text processing and information retrieval were dogmatic enough to ignore most other related work. In many areas, recent projects have spawned cooperative efforts in collecting data and lexical knowledge, assembling test collections, and cooperating between industry and academia. Competition, on the other hand, was never allowed because of the general lack of evaluation criteria. Now there is a growing interest in holding “showdowns” that objectively compare different methods.

While there has been some visible progress toward text-based intelligent systems, we aren’t very close to a desirable state of technology. The next section addresses some of the obstacles we must overcome.

1.5 Why We Aren’t There Yet

Many of us have workstations on top of our desks that have access via computer networks to trillions of words of text—encyclopedias, almanacs, dictionaries, literature, news, and electronic bulletin boards. Ironically, we are loath to attempt to use most of this information because a combination of factors—mainly the difficulty of finding any particular bit of knowledge we desire—makes it a gross waste of time.

Much of this problem in crudeness of information access boils down to issues that are relatively mundane, having little to do with text content—the speed of transmission across networks, compatibility of hardware, security, legal and copyright concerns, the lack of standards for storing and transmitting on-line text, etc. As the motivation for using on-line text helps to dissolve some of these issues, we can hope for better opportunities to use the advanced technologies for content analysis that are reported here.

In addition to these mundane communication and standardization issues, there is a more relevant problem of how to market the technology that we are developing. Too often we ignore the strengths of the competition—in this case, simple text search, Boolean query, and keyword retrieval methods. While these simpler methods lack the power and intuitive appeal, say, of natural language analysis or concept-based information retrieval, they have certain features that appeal to users of large text databases: they are fast, portable, relatively inexpensive, and relatively easy to learn. The techniques are compatible with many software packages, run on many hardware platforms, and are easier to implement in hardware. By contrast, natural language processing can be slow, brittle, and expensive. In order to bring the technology to the marketplace in the near future (such as the next dozen or so years), we will either have to minimize these disadvantages or prove dramatic improvements over simpler methods.

Some key technical barriers stand in the way of the all-knowing desktop librarian. These technical barriers will form some of the focal points the research reported in this volume as well as the progress that is likely to be made in the rest of the century. Four such issues are (1) robustness of analysis, (2) retrieval strategy, (3) presentation of information, and (4) cultivation of applications. The next section will outline the technical challenges in each of these areas.

1.6 Challenges for the 1990’s

The intelligent access to information from texts is the central theme of this research. The following are some of the key thrusts of this theme, including the topics of many of the papers here:

• Robustness:

The next generation of language analyzers must do much of the same sort of processing that current systems do, but must do it more accurately, faster, and with less domain-dependent knowledge. Robustness applies both to extending techniques that are already robust, such as parsing and morphology, and to increasing the robustness of more knowledge-intensive techniques, such as semantic analysis.

• Retrieval Strategy:

Current retrieval methods are oriented toward the retrieval of documents, not information in general. Text-based systems must address the broader issue of satisfying the information needs of many different systems and users. Within this broader information processing context, the concept of success but be redefined to be more than reproducing “relevant” texts, and new retrieval strategies must address this new notion of success. For example, if a user wants to know a specific piece of informatio...

Cover
Half Title
Title Page
Copyright Page
Table of Contents
Preface
1 Introduction: Text Power and Intelligent Systems
Part I Broad-Scale NLP
Part II “Traditional” Information Retrieval
Part III Emerging Applications
Index

About this book

Frequently asked questions

Information

Table of contents