The Handbook of Computational Linguistics and Natural Language Processing
eBook - ePub

The Handbook of Computational Linguistics and Natural Language Processing

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

The Handbook of Computational Linguistics and Natural Language Processing

About this book

This comprehensive reference work provides an overview of the concepts, methodologies, and applications in computational linguistics and natural language processing (NLP).
  • Features contributions by the top researchers in the field, reflecting the work that is driving the discipline forward
  • Includes an introduction to the major theoretical issues in these fields, as well as the central engineering applications that the work has produced
  • Presents the major developments in an accessible way, explaining the close connection between scientific understanding of the computational properties of natural language and the creation of effective language technologies
  • Serves as an invaluable state-of-the-art reference source for computational linguists and software engineers developing NLP applications in industrial research and development labs of software companies

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access The Handbook of Computational Linguistics and Natural Language Processing by Alexander Clark, Chris Fox, Shalom Lappin, Alexander Clark,Chris Fox,Shalom Lappin in PDF and/or ePUB format, as well as other popular books in Languages & Linguistics & Linguistics. We have over one million books available in our catalogue for you to explore.
Part I Formal Foundations

1 Formal Language Theory

SHULY WINTNER

1 Introduction

This chapter provides a gentle introduction to formal language theory, aimed at readers with little background in formal systems. The motivation is natural language processing (NLP), and the presentation is geared towards NLP applications, with linguistically motivated examples, but without compromising mathematical rigor.
The text covers elementary formal language theory, including: regular languages and regular expressions; languages vs. computational machinery; finite state automata; regular relations and finite state transducers; context-free grammars and languages; the Chomsky hierarchy; weak and strong generative capacity; and mildly context-sensitive languages.

2 Basic Notions

Formal languages are defined with respect to a given alphabet, which is a finite set of symbols, each of which is called a letter. This notation does not mean, however, that elements of the alphabet must be “ordinary” letters; they can be any symbol, such as numbers, or digits, or words. It is customary to use Σ to denote the alphabet. A finite sequence of letters is called a string, or a word. For simplicity, we usually forsake the traditional sequence notation in favor of a more straightforward representation of strings.
Example 1 (Strings). Let Σ = {0,1} be an alphabet. Then all binary numbers are strings over Σ. Instead of 〈0,1,1,0,1〉 we usually write 01101. If Σ ={a,b,c,d,...,y,z} is an alphabet, then cat, incredulous, and supercalifragilisticexpialidocious are strings, as are tac, qqq, and kjshdflkwjehr.
The length of a string ω is the number of letters in the sequence, and is denoted |ω|. The unique string of length 0 is called the empty string and is usually denoted (but sometimes λ).
Let ω1 =x1, . . .,xnand ω2 =y1,. . .,ym〉be two strings over the same alphabet Σ. The concatenation of ω1 and ω2, denoted ω1 ·ω2, is the string 〈x1,. . .,xn,y1,. . .,ym. Note that the length of ω1 ·ω2is the sum of the lengths of ω1and ω2: | ω1 ·ω2| = |ω1| +|ω2|. When it is clear from the context, we sometimes omit the ‘·’ symbol when depicting concatenation.
Example 2 (Concatenation). Let Σ = {a,b,c,d,. . .,y,z} be an alphabet. Then master · mind = mastermind, mind · master = mindmaster, and master · master = mastermaster. Similarly, learn · s = learns, learn · ed= learned, and learn · ing= learning.
Notice that when the empty string is concatenated with any string ω, the resulting string is ω. Formally, for every string ω, ω · = ·ω =ω.
We define an exponent operator over strings in the following way: for every string ω, ω0 (read: ω raised to the power of zero) is defined as . Then, for n >0, ωn is defined as ωn-1 ·ω. Informally, ωn is obtained by concatenating ω with itself n times. In particular, ω1 =ω.
Example 3 (Exponent). If ω = go, then ω0 =, ω1 =ω =go, ω2 =ω1 ·ω =ω ·ω =gogo, ω3 =gogogo, and so on.
A few other notions that will be useful in the sequel: the reversal of a string ω is denoted ωR and is obtained by writing ω in the reverse order. Thus, if ω =x 1,x2,. . .,xn, ωR =xn,xn-1,. . .,x1〉.
Example 4 (Reversal). Let Σ = {a,b,c,d,...,y,z} be an alphabet. If ω is the string saw, then ωR is the string was. If ω =madam, then ωR =madam =ω. In this case we say that ω is a palindrome.
Given a string ω, a substring of ω is a sequence formed by taking contiguous symbols of ω in the order in which they occur in ω: ωc is a substring of ω if and only if there exist (possibly empty) strings ωl and ωr such that ω =ωl ·ωc ·ωr. Two special cases of substrings are prefix and suffix: if ω =ωl ·ωc ·ωr then ωlis a prefix of ω and ωr is a suffix of ω. Note that every prefix and every suffix is a substring, but not every substring is a prefix or a suffix.
Example 5 (Substrings). Let Σ ={a,b,c,d,...,y,z} be an alphabet and ω =indistinguishable a string over Σ. Then , in, indis, indistinguish, and indistinguishable are prefixes of ω, while , e, able, distinguishable and indistinguishable are suffixes of ω. Substrings that are neither prefixes nor suffixes include distinguish, gui, and is.
Given an alphabet Σ, the set of all strings over Σ is denoted by Σ*(the reason for this notation will become clear presently). Notice that no matter what the Σ is, as long as it includes at least one symbol, Σ* is always infinite. A formal language over an alphabet Σ is any subset of Σ*. Since Σ*is always infinite, the number of formal languages over Σ is also infinite.
As the following example demonstrates, formal languages are quite unlike what one usually means when one uses the term “language” informally. They are essentially sets of strings of characters. Still, all natural languages are, at least superficially, such string sets. Higher-level notions, relating the strings to objects and actions in the world, are completely ignored by this view. While this is a rather radical idealization, it is a useful one.
Example 6 (Languages). Let Σ = {a,b,c,. . .,y,z}. Then Σ*is the set of all strings over the Latin alphabet. Any subset of this set is a language. In particular, the following are formal languages:
  • Σ*;
  • the set of strings consisting of consonants only;
  • the set of strings consisting of vowels only;
  • the set of strings each of which contains at least one vowel and at least one consonant;
  • the set of palindromes: strings that read the same from right to left and from left to right;
  • the set of strings whose length is less than 17 letters;
  • the set of single-letter strings;
  • the set {i, you, he, she, it, we, they};
  • the set of words occurring in Joyce’s Ulysses (ignoring punctuation etc.);
  • the empty set.
Note that the first five languages are infinite while the last five are finite.
We can now lift some of the string operations defined above to languages. If L is a language then the reversal of L, denoted LR, is the language {ω | ωR L}, that is, the set of reversed L-strings. Concatenation can also be lifted to languages: if L1and L2are languages, then L1 · L2is the language defined as {ω1 · ω2| ω1L1 and ω2L2}: the concatenation of two languages is the set of strings obtained by concatenating some word of the first language with some word of the second.
Example 7 (Language operations). Let L1 = {i, you, he, she, it, we, they}...

Table of contents

  1. Cover
  2. Praise for The Handbook of Computational Linguistics and Natural Language Processing
  3. Series page
  4. Dedication
  5. Title page
  6. Copyright page
  7. List of Figures
  8. List of Tables
  9. Notes on Contributors
  10. Preface
  11. Introduction
  12. Part I: Formal Foundations
  13. Part II: Current Methods
  14. Part III: Domains of Application
  15. Part IV: Applications
  16. References
  17. Author Index
  18. Subject Index