Computer Science

Regular Expressions

Regular expressions are sequences of characters that define a search pattern, used for string manipulation and searching within text. They provide a powerful way to match, search, and manipulate text based on patterns, allowing for complex and flexible text processing. In computer science, regular expressions are widely used in tasks such as text parsing, data validation, and pattern matching.

Written by Perlego with AI-assistance

11 Key excerpts on "Regular Expressions"

  • Book cover image for: Compiler Theory and Construction Handbook
    ________________________ WORLD TECHNOLOGIES ________________________ Chapter 8 Regular Expression In computing, a regular expression , also referred to as regex or regexp , provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. The following examples illustrate a few specifications that could be expressed in a regular expression: • The sequence of characters car appearing consecutively in any context, such as in car, cartoon, or bicarbonate • The sequence of characters car occurring in that order with other characters between them, such as in Icelander or chandler • The word car when it appears as an isolated word • The word car when preceded by the word blue or red • The word car when not preceded by the word motor • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, $100 or $245.99). Regular Expressions can be much more complex than these examples. Regular Expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, Awk, and Tcl, have fully integrated Regular Expressions into the syntax of the core language itself. Others like C, C++, .NET, Java, and Python instead provide access to Regular Expressions only through libraries. Utilities provided by Unix distributions— including the editor ed and the filter grep—were the first to popularize the concept of Regular Expressions. As an example of the syntax, the regular expression bex can be used to search for all instances of the string ex that occur after word boundaries (signified by the b ).
  • Book cover image for: Computability Theory & Automata Theory
    ________________________ WORLD TECHNOLOGIES ________________________ Chapter 14 Regular Expression In computing, a regular expression , also referred to as regex or regexp , provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. The following examples illustrate a few specifications that could be expressed in a regular expression: • The sequence of characters car appearing consecutively in any context, such as in car, cartoon, or bicarbonate • The sequence of characters car occurring in that order with other characters between them, such as in Icelander or chandler • The word car when it appears as an isolated word • The word car when preceded by the word blue or red • The word car when not preceded by the word motor • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, $100 or $245.99). Regular Expressions can be much more complex than these examples. Regular Expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, Awk, and Tcl, have fully integrated Regular Expressions into the syntax of the core language itself. Others like C, C++, .NET, Java, and Python instead provide access to Regular Expressions only through libraries. Utilities provided by Unix distributions— including the editor ed and the filter grep—were the first to popularize the concept of Regular Expressions. As an example of the syntax, the regular expression bex can be used to search for all instances of the string ex that occur after word boundaries (signified by the b ).
  • Book cover image for: Compiler Construction
    No longer available |Learn more
    ________________________ WORLD TECHNOLOGIES ________________________ Chapter 3 Regular Expression In computing, a regular expression , also referred to as regex or regexp , provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. The following examples illustrate a few specifications that could be expressed in a regular expression: • The sequence of characters car appearing consecutively in any context, such as in car, cartoon, or bicarbonate • The sequence of characters car occurring in that order with other characters between them, such as in Icelander or chandler • The word car when it appears as an isolated word • The word car when preceded by the word blue or red • The word car when not preceded by the word motor • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, $100 or $245.99). Regular Expressions can be much more complex than these examples. Regular Expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, Awk, and Tcl, have fully integrated Regular Expressions into the syntax of the core language itself. Others like C, C++, .NET, Java, and Python instead provide access to Regular Expressions only through libraries. Utilities provided by Unix distributions— including the editor ed and the filter grep—were the first to popularize the concept of Regular Expressions. As an example of the syntax, the regular expression bex can be used to search for all instances of the string ex that occur after word boundaries (signified by the b ). Thus
  • Book cover image for: Automata Theory in Theoretical Computer Science
    WT ____________________ WORLD TECHNOLOGIES ____________________ Chapter-7 Regular Expression In computing, a regular expression , also referred to as regex or regexp , provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification. The following examples illustrate a few specifications that could be expressed in a regular expression: • The sequence of characters car appearing consecutively in any context, such as in car, cartoon, or bicarbonate • The sequence of characters car occurring in that order with other characters between them, such as in Icelander or chandler • The word car when it appears as an isolated word • The word car when preceded by the word blue or red • The word car when not preceded by the word motor • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, $100 or $245.99). Regular Expressions can be much more complex than these examples. Regular Expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. Some of these languages, including Perl, Ruby, Awk, and Tcl, have fully integrated Regular Expressions into the syntax of the core language itself. Others like C, C++, .NET, Java, and Python instead provide access to Regular Expressions only through libraries. Utilities provided by Unix distributions— including the editor ed and the filter grep—were the first to popularize the concept of Regular Expressions. As an example of the syntax, the regular expression bex can be used to search for all instances of the string ex that occur after word boundaries (signified by the b ). Thus
  • Book cover image for: Computing Skills for Biologists
    C H A P T E R 5 • • • • • • • • • • • • • Regular Expressions 5.1 What Are Regular Expressions? Sometimes, you need to extract data from text. For example, you might want to extract all the protein accession numbers from a paper, the DNA motifs from sequence data, or the geographical coordinates of your sample sites from a large and complicated text file. Often, it is not feasible to search for all pos-sible occurrences exactly as they appear in the text, but you can describe the pattern you’re looking for in your own words (e.g., find all words starting with 3 uppercase letters, followed by 4 digits). The question is how to explain such a pattern to a computer. The answer is to use Regular Expressions. Regular Expressions are used to find a match for a particular pattern in a string of text. We’ve already used them in section 1.6.5: the Unix command grep stands for “global regular expression print.” There, we conducted exclu-sively literal searches, meaning that we searched for lines containing an exact match of the input we provided. The power of Regular Expressions, however, is that we can use special syntax to describe patterns in a general way (e.g., find anything that looks like a Latin binomial), and then easily list all the occurrences of a pattern in a string or text file. 5.2 Why Use Regular Expressions? Ask several programmers what they think about Regular Expressions, and you might hear that they are the greatest thing since sliced bread, or one of the greatest nuisances ever invented. Despite the polarized opinion, for many biological problems, Regular Expressions can save the day. They can be used to collect information: Search for patterns corresponding to structural or functional features in sequence data (e.g., degenerated primer binding sites, transcription factor binding sites). Similarly, simple searches can match accession and gene numbers, or extract references from a manuscript.
  • Book cover image for: Learn Linux Shell Scripting – Fundamentals of Bash 4.4
    No longer available |Learn more

    Learn Linux Shell Scripting – Fundamentals of Bash 4.4

    A comprehensive guide to automating administrative tasks with the Bash shell

    Regular Expressions

    This chapter introduces Regular Expressions, and the main commands that we can use to leverage their power. We'll first look at the theory behind Regular Expressions, before moving deeper into practical examples of using Regular Expressions with grep and sed.
    W e will also explain globbing, and how it is used on the command line.
    The following commands will be introduced in this chapter: grep, set, egrep, and sed. The following topics will be covered in this chapter:
    • What are Regular Expressions?
    • Globbing
    • Using Regular Expressions with egrep and sed
    Passage contains an image

    Technical requirements

    All scripts for this chapter can be found on GitHub: https://github.com/tammert/learn-linux-shell-scripting/tree/master/chapter_10 . Other than this, the Ubuntu virtual machine is still our way of testing and running the scripts in this chapter.
    Passage contains an image

    Introducing Regular Expressions

    You might have heard the term regular expression , or regex , before. For many people, a regular expression is something that seems very complicated, and is often plucked somewhere from the internet or a textbook, without fully grasping what it does.
    While that is fine for completing a set task, understanding Regular Expressions better than the average systems administrator really allows you to differentiate yourself, both in creating scripts and working on the Terminal.
    A nicely tailored regular expression can really help you keep your scripts short, simple, and robust to changes in the future.
    Passage contains an image

    What is a regular expression?

    In essence, a regular expression is a piece of text that functions as a search pattern for other text. Regular Expressions make it possible to easily say, for example, that I want to select all lines that contain a word that is five characters in length, or look for all files that end in .log.
    An example might help with your understanding. First, we need a command that we can use to explore Regular Expressions. The most famous command used in Linux with Regular Expressions is grep.
  • Book cover image for: Learning AWK Programming
    Generally, all editors have the ability to perform search-and-replace operations. Some editors can only search for patterns, others can also replace them, and others can also print the line containing that pattern. A regular expression goes many steps beyond this simple search, replace, and printing functionality, and hence it is more powerful and flexible. We can search for a word of a certain size, such as a word that has four characters or numbers. We can search for a word that ends with a particular character, let's say e. You can search for phone numbers, email IDs, and so on, and can also perform validation using Regular Expressions. They simplify complex pattern-matching tasks, and hence form an important part of AWK programming. Other regular expression variations also exist, notably those for Perl.
    Passage contains an image

    Using Regular Expressions with AWK

    There are mainly two types of Regular Expressions in Linux:
    • Basic Regular Expressions that are used by vi, sed, grep, and so on
    • Extended Regular Expressions that are used by awk, nawk, gawk, and egrep
    Here, we will refer to extended Regular Expressions as Regular Expressions in the context of AWK. In AWK, Regular Expressions are enclosed in forward slashes, '/', (forming the AWK pattern) and match every input record whose text belongs to that set.
    The simplest regular expression is a string of letters, numbers, or both that matches itself. For example, here we use the ly regular expression string to print all lines that contain the ly pattern in them. We just need to enclose the regular expression in forward slashes in AWK:
  • Book cover image for: Regular Expressions
    No longer available |Learn more

    Regular Expressions

    Pocket Primer

    CHAPTER 1
    INTRODUCTION TO Regular Expressions
    T his chapter introduces you to basic Regular Expressions, often abbreviated as REs, that will prepare you for the material in subsequent chapters. The REs in this chapter are illustrated via the Unix grep utility that is available on any Unix-related platform, including Linux and MacBook (OS X). If you are a complete neophyte, you’ll learn a decent variety of REs by the time you have finished reading this chapter.
    In fact, this chapter does not require you to understand any of the deeper theory that underlies REs: simply launch the grep (or egrep ) utility from the command line to see the result of matching REs to various strings. In most cases, the text strings are placed in text files so that the REs can be tested against multiple strings simultaneously.
    In essence, this chapter acts as “ground zero” for REs, starting from the simplest search strings (i.e., hard-coded strings), to search strings that contain REs involving uppercase letters, lowercase letters, numbers, special characters, and various combinations of such strings.
    If you have some experience working with REs, skim through the code samples in this chapter (you might find something new to you). If you are impatient, see if you can explain the purpose of the following RE: [^ ]*?@[^ ]*?\.[^ ]* . If you know the answer, then you can probably go directly to Chapter 2.
    The first section in this chapter (which comprises most of the chapter) contains code snippets that illustrate how to perform very simple pattern matching with lines of text in a text file. This section also introduces the metacharacters ^, $, ., \, and ?
  • Book cover image for: Python for Linguists
    An RE is a string defined in terms of the recursive operations above: concatenation, union, and Kleene star. An RE itself is a finite sequence of symbols, but it defines a potentially infinite set of strings. For example, the RE ab*c is a finite sequence of symbols, but defines an infinite set: fac; abc; abbc; abbbc; : : :g. Similarly, (ab)|(c*) defines the infinite set fab; , c; cc; ccc; : : :g. We use  to indicate the empty string. The general idea in pattern matching is that if the string we are matching against contains any of the strings that the relevant RE defines, then there is a match. For example, if we try to match the RE a|b against the string pancake, we have a match because the RE defines the string set fa; bg and pancake contains the substring a. Similarly, the RE a* will match any string 2 Formally, one can show that these are more restricted than a phrase-structure grammar. 6.2 Patterns 123 because the set of strings it defines includes  and every string definitionally includes the empty string. The following chart gives examples of simple REs (along the left) and whether they match various strings (along the top). a b ab acb ba a ! ! ! ! ab ! a|b ! ! ! ! ! a* ! ! ! ! ! a|(bc) ! ! ! ! (a|b)c ! a*|b ! ! ! ! ! a|(b*) ! ! ! ! ! (a|b)* ! ! ! ! ! a(b*) ! ! ! ! (ab)* ! ! ! ! ! Most programming languages use REs to do pattern matching because by using these it is possible to be extremely efficient when checking whether some string matches some pattern. If we were to enrich our pattern-matching sys- tem significantly beyond the three operations listed above, we would lose this efficiency. There is a tradeoff, however. With this efficient system, there are patterns we cannot specify. Most programming languages, Python included, thus go slightly beyond REs in what they allow in pattern-matching syntax.
  • Book cover image for: WORKING WITH grep, sed, AND awk Pocket Primer
    C H A P T E R 6 Regular Expressions This chapter explores Regular Expressions, a very powerful language feature in many programming languages (such as JavaScript and Java). Consequently, the knowledge that you gain from the material in this chapter will be useful to you outside of awk. Although you have seen examples of Regular Expressions in previous chapters, this chapter consolidates those code samples and provides a more extensive discussion of the variety of Regular Expressions that you can define in an awk command. As a result, this chapter contains a mixture of code blocks and complete code samples, with varying degrees of complexity, that are suitable for beginners as well as people who have had some exposure to Regular Expressions. There is also a good chance that you have used Regular Expressions in commands that you have launched from the command line on a laptop, whether it be Windows, Unix, or Linux-based systems. Examples of such comments involve the DIR command on Windows for listing files with a given suffix, and the ls command for performing the same action on a Linux machine or MacBook. In this chapter, you will learn how to define and use more complex Regular Expressions than the Regular Expressions that you have used from the command line. The first part of this chapter discusses metacharacters and character classes, followed by code samples that define Regular Expressions with digits and letters (uppercase as well as lowercase), and how to use character classes in Regular Expressions. The second portion contains code samples with Regular Expressions involving metacharacters, such as “.,” “^,” “$,” and “|.” In addition, you will also learn how to match subsets of strings via Regular Expressions. 212 • WORKING WITH GREP, SED, AND AWK POCKET PRIMER The third portion of this chapter shows you how to use the built- in sub() and gsub() functions in awk to remove digits, characters, and consecutive characters via awk commands.
  • Book cover image for: Applied Automata Theory
    AN INTRODUCTION TO Regular Expressions ROBERT MCNAUGHTON RENSSELAER POLYTECHNIC INSTITUTE TROY, NEW YORK 1. Introduction 35 2. Definitions 35 3. Some Laws 37 4. Historical Remarks 38 5. Establishing Equations 39 6. Proof by Reparsing 40 7. Logical Systems 41 8. Graph Manipulation 43 9. Star Height 52 References 54 1. INTRODUCTION Although this chapter is written as an introduction, it will seek to develop a new point of view toward Regular Expressions. Later sections will describe some of the current research interests in a nontechnical fashion. Brzozowski's survey [1], written several 3'ears ago, gives a good account of the research problems that were then current. This chapter is a slightly revised version of an article in the work edited by Hart and Takasu [la]. 2. DEFINITIONS To begin with, Regular Expressions are expressions standing for regular events (or regular languages), which are certain sets of words. A word is a 35 36 ROBERT MCNAUGHTON string of symbols over some alphabet, denoted by Σ. Although Σ is in general any finite set, very often the alphabet whose symbols are 0 and 1 will be used for examples. Regular Expressions are made up of letters of the alphabet, and signs standing for certain operators. The three operators are union, concate-nation, and star (or closure). Union is the ordinary set-theoretic union, whose sign is U. Concatenation is written as a dot and sometimes is denoted by mere juxtaposition. The concatenation of two words is obtained by writing the first word, and then the second word following it without any space. The concatenation of tw^o sets of words tends to be thought of as a Cartesian product: however, it is not quite that. Let a and ß be two events. α·β or aß, the concatenation of a and ß, is the set of all words that can be obtained by concatenating a word from a and a word from ß in that order. For example, if a: = {0, 01, 001} and ß = {1, 11}, then aß = {01, 011, Olli, 0011, 00111}.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.