Computer Science

Suffix Tree

A suffix tree is a data structure used in computer science for efficiently storing and searching strings. It is a tree-like structure that represents all the suffixes of a given string. Suffix trees are commonly used in applications such as text editors, search engines, and bioinformatics.

Written by Perlego with AI-assistance

10 Key excerpts on "Suffix Tree"

  • Book cover image for: String Searching Algorithms
    Majster and Reiser's 87 g3 String Searching Algorithms (1980) on-line construction allows the index to be built as the input string is read in a symbol at a time from left to right, but does so in greater than linear time. Ukkonen (1992b, 1993) has, however, recently developed a linear-time, on-line, suffix-tree construction algorithm. In contrast to these linear-time techniques, the straightforward, brute-force method of suffix-tree construction requires quadratic time in the worst case. Its time complexity has, however, been shown to be 0(n log n) in the expected case (Apostolico and Szpankowski, 1992). Before specific suffix-tree construction strategies are examined, a general de-scription of the data structure itself shall firstly be given, starting off with a discus-sion of the related suffix trie. 4.1.1 Suffix Tries A trie is a type of digital search tree (Knuth, 1973), and thus represents a set of pattern strings, or keys, over a finite alphabet. The term was coined by Fredkin (1959,1960) from 'information retrieval' for a table-based implementation, and the structure was also independently proposed, in a form employing linked lists, by de la Briandais (1959). For a set of strings over a finite alphabet C, each edge of the trie for the set represents a symbol from C, and sibling edges must represent distinct symbols. The maximum degree of any node in the trie is thus equal to C. As an example, the trie for the keywords EDGE, END, ENDING, WARP, and WASP is shown in Figure 4.1. The shaded nodes correspond to complete keywords, spelt out by the sequence of symbols represented by the edges comprising the path from the root to the node in question. Note in passing that the Aho-Corasick pattern-matching automaton employed in multiple-string searching (as discussed in Chapter 2) may be represented by the trie of the pattern strings. The nodes of the trie correspond to the states of the automaton, and the edges to the forward state-transitions.
  • Book cover image for: Advanced Data Structures
    If the trie nodes are realized as linked lists, the operation make Suffix Tree preprocesses a string of length n over an alphabet A in time O(|A|n) into a structure of size O(n), which supports find string queries for a string q in time O(|A| length(q )). The Suffix Tree structure turned out to be very useful for various string pat- tern processing tasks (Apostolico 1985; Gusfield 1997). Some applications motivated variants of the underlying structure, like parametrized strings intro- duced in Baker (1993) and further discussed in Kosaraju (1995) and Cole and Hariharan (2003); a parametrized string consists of characters of the underlying alphabet and variables, where all occurrences of the same variable have to be replaced by the same string. This can be viewed as an equivalence class of strings, for example, a program under renaming of variables. Another variant are the two-dimensional strings, rectangular arrays of sym- bols from an alphabet, which can be viewed as abstraction of images, where a two-dimensional substring corresponds to a match of a translate of a small im- age in the big image. Two-dimensional Suffix Trees were introduced in Giancarlo (1995) and further developed in Choi and Lam (1997) and Cole and Hariha- ran (2003); higher-dimensional versions are discussed in Kim, Kim, and Park (2003). Suffix Trees can also be used to find repetitions in text, which is an important subtask of dictionary-based compression methods like Lempel-Ziv. A closely related structure is the directed acyclic word graph (DAWG), which is the smallest automaton that accepts the subwords of a given word (Blumer et al. 1985; Blumer 1987; Holub and Crochemore 2002); it can also be constructed by the same algorithms as Suffix Trees (Chen and Seiferas 1987; Ukkonen 1995). Yet another variant is the affix tree studied in Maass (2003).
  • Book cover image for: Algorithms in Bioinformatics
    eBook - PDF

    Algorithms in Bioinformatics

    A Practical Introduction

    Chapter 3 Suffix Tree 3.1 Introduction The suffix tree of a string is a fundamental data structure for pattern match-ing [305]. It has many biological applications. The rest of this book will discuss some of its applications, including • Biological database searching (Chapter 5) • Whole genome alignment (Chapter 4) • Motif finding (Chapter 10) In this chapter, we define a suffix tree and present simple applications of a suffix tree. Then, we discuss a linear suffix tree construction algorithm proposed by Farach. Finally, we discuss the variants of a suffix tree like suffix array and FM-index. We also study the application of suffix tree related data structures on approximate matching problems. 3.2 Suffix Tree This section defines a suffix tree. First, we introduce a suffix trie. A trie (derived from the word re trie val) is a rooted tree where every edge is labeled by a character. It represents a set of strings formed by concatenating the characters on the unique paths from the root to the leaves of the trie. A suffix trie is simply a trie storing all possible suffixes of a string S . Figure 3.1(a) shows all possible suffixes of a string S [1 .. 7] = acacag $, where $ is a special symbol that indicates the end of S . The corresponding suffix trie is shown in Figure 3.1(b). In a suffix trie, every possible suffix of S is represented as a path from the root to a leaf. A suffix tree is a rooted tree formed by contracting all internal nodes in the suffix trie with single child and single parent. Figure 3.1(c) shows the suffix tree made from the suffix trie in Figure 3.1(b). For each edge in the suffix tree, the edge label is defined as the concatenation of the characters on the 57
  • Book cover image for: Analytic Pattern Matching
    eBook - PDF

    Analytic Pattern Matching

    From DNA to Twitter

    Part II APPLICATIONS CHAPTER 6 Algorithms and Data Structures The second part of the book, which begins with this chapter, is tuned towards the applications of pattern matching to data structures and algorithms on strings. In particular, we study digital trees such as tries and digital search trees, Suffix Trees, the Lempel–Ziv’77 and Lempel–Ziv’78 data compression algorithms, and string complexity (i.e., how many distinct words there are in a string). In the present chapter we take a break from analysis and describe some pop- ular data structures on strings. We will present simple constructions of digital 136 Chapter 6. Algorithms and Data Structures trees known as tries and digital search trees and will analyze these structures in the chapters to come. 6.1. Tries A trie or prefix tree, is an ordered tree data structure. Its purpose is to store keys, usually represented by strings, known in this context as records. Tries were proposed by de la Briandais (1959) and Fredkin (1960), who introduced the name; it is derived from retrieval. Tries belong to a large class of digital trees that store strings in one form or another. In this chapter, besides tries we also consider two other digital trees, namely Suffix Trees and digital search trees. A trie – like every recursive data structure – consists of a root and |A| subtrees, where A is an alphabet; each of the subtrees is also a trie. Tries store strings or records in leaves while their internal nodes act as routing nodes to the leaves. The edges to the subtrees are usually labeled by the symbols from A. To insert a new string into an existing trie, one follows a path dictated by the symbols of the new string until an empty terminal node is found (i.e., no other string shares the same prefix), in which the string is stored. Observe that we either store the whole string or just the suffix that is left after the path to the leaf.
  • Book cover image for: Bioinformatics Algorithms
    eBook - ePub

    Bioinformatics Algorithms

    Design and Implementation in Python

    • Miguel Rocha, Pedro G. Ferreira(Authors)
    • 2018(Publication Date)
    • Academic Press
      (Publisher)
    Figure 16.8 Example of the search for the patterns TA, in (A), and ACG, in (B), over a Suffix Tree (for the sequence TACTA). Nodes in blue (mid gray in print version) represent the walk over the tree, considering the symbols in the pattern, while nodes in red (dark gray in print version) show nodes where the symbol in the sequence does not have a matching edge. Leaves marked in blue (light gray in print version) represent all suffixes matching the pattern.
    Note that if the previous process fails in a given node, i.e. if there is no branch leaving the node marked with the symbol in the sequence, this means that the pattern does not occur. This is the case with the search for pattern ACG in Fig. 16.8 B.
    Since Suffix Trees can be used in scenarios where the target sequence is very large, one of their main problems is the amount of memory needed, since trees will become very large. Noticing that in many cases there are linear segments, i.e. sequences of nodes only with a single leaving edge, these may be compacted by considering only the first and last node of the segment, and concatenating the symbols in the path.
    An example of this process is provided by Fig. 16.9 , where the tree from Fig. 16.7 is compacted. Notice that the tree is shown in two versions: the first shows edges with strings of symbols, while the latter shows a tree with ranges of positions. Indeed, since the strings in the edges are always sub-strings of the target sequence, to avoid redundancy it is sufficient to keep the starting and end positions for this string.
    Figure 16.9 Example of a compact Suffix Tree, representing the same sequence as the one in Fig. 16.7 . (A) Edges show sub-strings; (B) Edges show position intervals.
    Suffix Trees can also be used for a number of different tasks when handling strings, and in particular biological sequences. Indeed, from what it was explained above, it seems clear that they are useful to search for repeats of patterns in sequences, thus enabling the identification of many types of repeats in genomes.
    Also, Suffix Trees may be created from more than a sequence, thus enabling to address problems such as discovering which trees contain a given pattern, the longest sub-string shared by a set of sequences or calculating the maximum overlap of a set of sequences.
    In Fig. 16.10
  • Book cover image for: Genome-Scale Algorithm Design
    eBook - PDF

    Genome-Scale Algorithm Design

    Biological Sequence Analysis in the Era of High-Throughput Sequencing

    • Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, Alexandru I. Tomescu(Authors)
    • 2015(Publication Date)
    The following theorem immediately follows. theorem 8.18 Assume we are given ST T and the suffix links from all its leaves. Then, the suffix links from all internal nodes of ST T can be built in O(n) time. 8.4 Applications of the Suffix Tree In this section we sketch some prototypical applications of Suffix Trees. The purpose is to show the versatility and power of this fundamental data structure. The exercises of this chapter illustrate some additional uses, and later chapters revisit the same problems with more space-efficient solutions. In what follows, we often consider the Suffix Tree of a set of strings instead of just one string. Then, a property of a set S = {S 1 , S 2 , . . . , S d } should be interpreted as a property of the concatenation C = S 1 $ 1 S 2 $ 2 · · · $ d−1 S d #, where characters $ i and # are all distinct and do not appear in the strings of S . 8.4.1 Maximal repeats A repeat in a string T = t 1 t 2 · · · t n is a substring X that occurs more than once in T . A repeat X is right-maximal (respectively, left-maximal) if it cannot be extended to the 146 Classical indexes Figure 8.6 Building the suffix links of internal nodes from the suffix links of leaves. (Top) The intervals [sl(i v )..sl(j v )] for every internal node v (gray bars), where i v and j v are the first and the last position of the interval of v in SA, and the parenthesization produced by a depth-first traversal of ST. For clarity, character labels are omitted and nodes are assigned numerical identifiers. (Bottom) Left-to-right scan of the parenthesization, represented as a doubly-linked list with shortcuts connecting open and closed parentheses. Triangles show the current position of the pointer. right (respectively, to the left), by even a single character, without losing at least one of its occurrences. A repeat is maximal if it is both left- and right-maximal. For example, the maximal repeats of T = ACAGCAGT are A and CAG.
  • Book cover image for: Data Structures and Algorithms in Java
    • Michael T. Goodrich, Roberto Tamassia, Michael H. Goldwasser(Authors)
    • 2014(Publication Date)
    • Wiley
      (Publisher)
    We must still store the different strings in S, of course, but we nevertheless reduce the space for the trie. Searching in a compressed trie is not necessarily faster than in a standard tree, since there is still need to compare every character of the desired pattern with the potentially multicharacter labels while traversing paths in the trie. 592 Chapter 13. Text Processing 13.3.3 Suffix Tries One of the primary applications for tries is for the case when the strings in the collection S are all the suffixes of a string X . Such a trie is called the suffix trie (also known as a Suffix Tree or position tree) of string X . For example, Figure 13.11a shows the suffix trie for the eight suffixes of string “minimize.” For a suffix trie, the compact representation presented in the previous section can be further simplified. Namely, the label of each vertex is a pair “j..k” indicating the string X [ j..k]. (See Figure 13.11b.) To satisfy the rule that no suffix of X is a prefix of another suffix, we can add a special character, denoted with $, that is not in the original alphabet Σ at the end of X (and thus to every suffix). That is, if string X has length n, we build a trie for the set of n strings X [ j..n − 1]$, for j = 0,..., n − 1. Saving Space Using a suffix trie allows us to save space over a standard trie by using several space compression techniques, including those used for the compressed trie. The advantage of the compact representation of tries now becomes apparent for suffix tries. Since the total length of the suffixes of a string X of length n is 1 + 2 + ··· + n = n(n + 1) 2 , storing all the suffixes of X explicitly would take O(n 2 ) space. Even so, the suf- fix trie represents these strings implicitly in O(n) space, as formally stated in the following proposition. Proposition 13.6: The compact representation of a suffix trie T for a string X of length n uses O(n) space.
  • Book cover image for: Shared-Memory Parallelism Can be Simple, Fast, and Scalable
    • Julian Shun(Author)
    • 2017(Publication Date)
    • ACM Books
      (Publisher)
    Ukkonen 1995 ] and there have been many implementations of these algorithms. Although originally designed for fixed-sized alphabets with deterministic linear work, Weiner’s algorithm can work on an alphabet {0, . . . , n − 1}, henceforth [n ], in linear expected work simply by using hashing to access the children of a node.
    The algorithm of Weiner and its derivatives are all incremental and inherently sequential. The first parallel algorithm for Suffix Trees was given by Apostolico et al. [1988 ] and was based on a quite different doubling approach. For a parameter 0 < ≤ 1 the algorithm runs in O ((1/) log n ) depth, O ((n /) log n ) work, and O (n
    1+
    ) space on the CRCW PRAM for arbitrary alphabets. Although reasonably simple, this algorithm is likely not practical since it is not work-efficient and uses superlinear memory (by a polynomial factor). The parallel construction of Suffix Trees was later improved to linear work and polynomial space by Sahinalp and Vishkin [1994 ], with an algorithm taking O (log2 n ) depth on the CRCW PRAM (they note that linear space can be obtained by using hashing and randomization) and linear work and linear space by Hariharan [1994 ], with an algorithm taking O (log4 n ) depth on the CREW PRAM. Farach and Muthukrishnan improved the depth to O (log n ) with high probability on the CRCW PRAM [Farach and Muthukrishnan 1996
  • Book cover image for: Biocomputing 2008 - Proceedings Of The Pacific Symposium
    • Russ B Altman, A Keith Dunker, Lawrence Hunter(Authors)
    • 2007(Publication Date)
    • World Scientific
      (Publisher)
    Let C* be the set of all possible strings (or sequences) that can be con- structed using C. Let $ @ C be the terminal character, used to mark the end of a string. Let S = soslsz.. . s,-1 be the in- put string where S E C* and its length IS1 = n. The ith suf- Figure 1. Suffix Tree Ts for S = ACGACG$ v I , fix of S is represented as Si = sisi+lsi+2.. . s,-1. For convenience, we append the terminal character to the string, and refer to it by s,. The Suffix Tree of the string S, denoted as Ts, stores all the suffixes of S in a tree structure, where suffixes that share a common prefix lie on the same pat,h from the root of the tree. A Suffix Tree has two kinds of nodes: in- ternal and leaf nodes. An internal node in t,he Suffix Tree, except the root, 92 has at least 2 children, where each edge to a child begins with a different character. Since the terminal character is unique, there are as many leaves in the Suffix Tree as there are suffixes, namely n + 1 leaves (counting $ as the “empty” suffix). Each leaf node thus corresponds to a unique suffix Si. Let ~ ( v ) denote the substring obtained by concatenating all characters from the root to node v. Each internal node v also maintains a sujjix link to the internal node w, where ~(w) is the immediate suffix of o(v). A sufiix tree example is given in Fig. 1; circles represent internal nodes, square nodes denot,e leaves, and dashed lines indicate suffix links. Internal nodes are labeled in depth-first, order, and leaf nodes are labeled by the suffix start position. The edges are also shown in the encoded form, giving the start and end positions of the edge label. 3. The Basic Trellis+ Approach TRELLIS+ follows the same overall approach as TRELLIS 1 3 . Let S denote the input sequence, which may be a single genome, or the string obtained by concatenating many sequences. TRELLIS+ follows a partitioning and merging approach to build a disk-based Suffix Tree.
  • Book cover image for: 125 Problems in Text Algorithms
    eBook - PDF
    The historically first such construction was by Kärkkäinen and Sanders [153, 154] (see [74]), then by Ko and Aluru [163] and by Kim et al. [159], followed by several others. 48 Linear Suffix Trie 119 48 Linear Suffix Trie The Suffix trie of a word can be of quadratic size according to the word length. On the contrary, its Suffix Tree requires only a linear amount of space for its storage, but the space should include the word itself. The goal is to design a Suffix trie with edges labelled by single letters and that can be stored in linear space without the word itself. This is done by adding extra nodes and a few elements to the Suffix Tree. A node of the Suffix trie of y is identified to the factor of y that labels the path from the root to the node. Nodes in the linear Suffix trie LST (y) that are not in the Suffix Tree ST (y) are of the form au, where a is a letter and u is a node of ST (y). That is, denoting by s the suffix link of the tree, s(au) = u. When nodes are added to ST (y) to create LST (y) edges are relabelled accordingly. Question. Show the number of extra nodes added to the Suffix Tree of a word y to create its linear Suffix trie is less than |y |. Labels of edges in LST (y) are reduced to the first letter of the correspond- ing factor as follows. If v, |v| > 1, labels the edge from u to uv in ST (y), the label of the associated edge in LST (y) is the first letter of v and the node uv is marked with the + sign to indicate the actual label is longer. Question. Design an algorithm that checks if x occurs in y using the linear Suffix trie LST (y) and runs in time O(|x |) on a fixed alphabet. [Hint: Edge labels can be recovered using suffix links.] a b a b + a + a + b + b b a b + b + b 120 Efficient Data Structures The above picture illustrates the linear Suffix trie of aababbab. White- coloured nodes are those of its Suffix Tree (below with explicit edge labels), doubly circled when they are suffixes.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.