Computer Science

Huffman Coding

Huffman coding is a method of lossless data compression that assigns variable-length codes to input characters based on their frequencies. It is an optimal prefix coding technique, meaning no other prefix code can have a smaller average length. Huffman coding is widely used in file compression algorithms, such as in the creation of ZIP files, to reduce the size of data for storage or transmission.

Written by Perlego with AI-assistance

12 Key excerpts on "Huffman Coding"

  • Book cover image for: Classical and Quantum Information Theory
    eBook - PDF

    Classical and Quantum Information Theory

    An Introduction for the Telecom Scientist

    9 Optimal coding and compression The previous chapter introduced the concept of coding optimality, as based on variable- length codewords. As we have learnt, an optimal code is one for which the mean codeword length closely approaches or is equal to the source entropy. There exist several families of codes that can be called optimal, as based on various types of algorithms. This chapter, and the following, will provide an overview of this rich subject, which finds many applications in communications, in particular in the domain of data compression. In this chapter, I will introduce Huffman codes, and then I will describe how they can be used to perform data compression to the limits predicted by Shannon. I will then introduce the principle of block codes, which also enable data compression. 9.1 Huffman codes As we have learnt earlier, variable-length codes are in the general case more efficient than fixed-length ones. The most frequent source symbols are assigned the shortest codewords, and the reverse for the less frequent ones. The coding-tree method makes it possible to find some heuristic codeword assignment, according to the above rule. Despite the lack of further guidance, the result proved effective, considering that we obtained η = 96.23% with a ternary coding of the English-character source (see Fig. 8.3, Table 8.3). But we have no clue as to whether other coding trees with greater coding efficiencies may ever exist, unless we try out all the possibilities, which is impractical. The Huffman Coding algorithm provides a near-final answer to the above code- optimality issue.
  • Book cover image for: The Communications Handbook
    • Jerry D. Gibson(Author)
    • 2018(Publication Date)
    • CRC Press
      (Publisher)
    We then describe universal coding techniques, which do not require any a priori knowledge of the statistics of the source. Finally, we look at two of the three most popular areas for lossless compression, text, and images. The third area, compression of facsimile, is covered in the next chapter. 93.2 Entropy Coders The idea behind entropy coding is very simple: use shorter codes for more frequently occurring symbols (or sets of symbols). This idea has been around for a long time and was used by Samuel Morse in the development of Morse code. As the codes generated are of variable length, it is essential that a sequence of codewords be decoded to a unique sequence of symbols. One way of guaranteeing this is to make sure that no codeword is a prefix of another code. This is called the prefix condition, and codes that satisfy this condition are called prefix codes. The prefix condition, while sufficient, is not necessary for unique decoding. However, it can be shown that given any uniquely decodable code that is not a prefix code, we can always find a prefix code that performs at least as well in terms of compression. Because prefix codes are also easier to decode, most of the work on lossless coding has dealt with prefix codes. H S ( ) P X i = ( ) P X i = ( ) log i = 1 m ∑ = Lossless Compression 93 -3 Huffman Codes The Huffman Coding algorithm was developed by David A. Huffman as part of a class assignment [Huffman, 1952]. The algorithm is based on two observations about optimum prefix codes: 1. In an optimum code, symbols that occur more frequently will have shorter codewords than symbols that occur less frequently. 2. In an optimum code, the two symbols that occur least frequently will have the same length. The Huffman procedure is obtained by adding the simple requirement that the codewords corresponding to the two lowest probability symbols differ only in the last bit.
  • Book cover image for: Efficient C/C++ Programming
    eBook - PDF

    Efficient C/C++ Programming

    Smaller, Faster, Better

    CHAPTER Cn U Rd Ths Qkly? A Data Compression Utility TOPICS DISCUSSED Huffman Coding, Arithmetic Coding, Lookup Tables, Assembly Language Enhancements Introduction In this chapter we will examine the Huffman Coding and arithmetic coding methods of data compression and develop an implementation of the latter algorithm. The arith-metic coding algorithm allows a tradeoff among memory consumption and compres-sion ratio; our emphasis will be on minimum memory consumption, on the assumption that the result would eventually be used as embedded code within a larger program. We will also be applying one of the most powerful methods of improving the effi-ciency of a given algorithm: recoding critical areas of the program in assembly language. While the exact details of this enhancement are specific to machines based on the 80x86 architecture, the principles that we will use to focus our effort are generally applicable. Huffman Coding Huffman Coding is widely recognized as the most efficient method of encoding char-acters for data compression. This algorithm is a way of encoding different characters in different numbers of bits, with the most common characters encoded in the fewest bits. For example, suppose we have a message made up of the letters 'A', 'B', and ' C , which can occur in any combination. Figure 4.1 shows the relative frequency of each of these letters and the Huffman code assigned to each one. 123 4 124 Efficient C/C++ Programming The codes are determined by the frequencies: as mentioned above, the letter with the greatest frequency, 'C, has the shortest code, of one bit. The other two letters, 'A' and 'B', have longer codes. On the other hand, the simplest code to represent any of three characters would use two bits for each character. How would the length of an encod-ed message be affected by using the Huffman code rather than the fixed-length one? Let's encode the message CABCCABC using both codes.
  • Book cover image for: Algorithms and Theory of Computation Handbook, Volume 1
    eBook - PDF
    Coding the text is just replacing each symbol (more exactly each occurrence of it) by its new codeword . The method works for any length of blocks (not only 8 bits), but the running time grows exponentially with the length. In the following, we assume that symbols are originally encoded on 8 bits to simplify the description. The Huffman algorithm uses the notion of prefix code . A prefix code is a set of words containing no word that is a prefix of another word of the set. The advantage of such a code is that decoding is immediate. Moreover, it can be proved that this type of code does not weaken the compression. A prefix code on the binary alphabet { 0 , 1 } corresponds to a binary tree in which the links from a node to its left and right children are labeled by 0 and 1 respectively. Such a tree is called a (digital) trie . Leaves of the trie are labeled by the original characters, and labels of branches are the words of the code (codewords of characters). Working with prefix code implies that codewords are identified with leaves only. Moreover, in the present method codes are complete: they correspond to complete tries, i.e., trees in which all internal nodes have exactly two children. In the model where characters of the text are given new codewords, the Huffman algorithm builds a code that is optimal in the sense that the compression is the best possible (if the model of the source text is a zero-order Markov process, that is if the probability of symbol occurrence are independent). The length of the encoded text is minimum. The code depends on the input text, and more precisely on the frequencies of characters in the text. The most frequent char-acters are given shortest codewords while the least frequent symbols correspond to the longest codewords. Text Data Compression Algorithms 14 -3 14.2.1 Encoding The complete compression algorithm is composed of three steps: count of character frequencies, construction of the prefix code, and encoding of the text.
  • Book cover image for: Embedded DSP Processor Design
    eBook - PDF

    Embedded DSP Processor Design

    Application Specific Instruction Set Processors

    However, the file after compression and decompression must exactly match the original information. The principle of loss-less compression is to find and remove any redundant information in the data. For example, when encoding the characters in a computer system, the length of the code assigned to each character is the same. However,some characters may appear more often than the others, thus making it more efficient to use shorter codes for representing these characters. 20 CHAPTER 1 Introduction A common lossless compression scheme is Huffman Coding, which has the following properties: ■ Codes for more probable symbols are shorter than those for less probable symbols. ■ Each code can be uniquely decoded. A Huffman tree is used in Huffman Coding. This tree is built based on statisti-cal measurements of the data to be encoded. As an example, the frequencies of the different symbols in the sequence ABAACDAAAB are calculated and listed in Table 1.4. The Huffman tree is illustrated in Figure 1.14. In this case, there are four different symbols (A, B, C, and D), and at least two bits per symbol are needed. Thus 10 2 20 bits are required for encoding the string ABAACDAAAB. If the Huffman codes inTable 1.4 are used,only 6 1 2 2 3 3 16 bits are needed. Thus four bits are saved. Once the Huffman codes have been decided, the code of each symbol can be found from a simple lookup table. Decoding can be illustrated by the same example. Assume that the bit stream 01000110111 was generated from the Huffman codes in Table 1.4. This binary code will be translated in the following way: 0 → A, 10 → B, 0 → A, 0 → A, 110 → C,and 111 → D. Obviously the Huffman tree must be known to the decoder. Table 1.4 Symbol Frequencies and Huffman Codes. Symbol Frequency Normal code Huffman code A 6 00 0 B 2 01 10 C 1 10 110 D 1 11 111 FIGURE 1.14 Huffman tree.
  • Book cover image for: Still Image and Video Compression with MATLAB
    CHAPTER 5 LOSSLESS CODING 5.1 INTRODUCTION
    The input to a lossless coder is a symbol and the output is a codeword, typically a binary codeword. We are familiar with ASCII codes for the English alphabet. In an image compression system, a symbol could be a quantized pixel or a quantized transform coefficient. Then, the codeword assigned to the symbol by the lossless coder is either transmitted or stored. If the lengths of all codewords assigned to the different symbols are fixed or constant, then the lossless coder is called a fixed rate coder. On the other hand, if the code lengths of the codewords are variable, then the coder is a variable-length coder. A pulse code modulator assigns equal length codewords to its input symbols. Later in the chapter, we will see that a Huffman coder assigns variable-length codewords to its input symbols. In order for a lossless coder to assign codewords to its input symbols, it must first have the codewords precomputed and stored. We will discuss a method to generate the codebook of a Huffman coder later in the chapter.
    A lossless decoder performs the inverse operation, which is to output a symbol corresponding to its input—a codeword. Therefore, it is imperative that a lossless decoder has the same codebook as the encoder. The practicality of a lossless coder depends on the size of the codebook it requires to encode its input symbols as well as its encoding complexity. A Huffman coder may require a large codebook, while an arithmetic coder requires no codebook at all, though it may be more complex than the Huffman coder. The choice of a particular lossless coder depends not only on the above-mentioned two factors but also on achievable compression.
    Before we describe various lossless encoding methods and their implementation, we must have a rudimentary knowledge of information theory. Information theory gives us a means to quantify a source of symbols in terms of its average information content as well as the theoretical limit of its achievable lossless compression. This is important because such quantitative means enable us to compare the performance of different compression schemes. Therefore, we will briefly describe a few basic definitions of information theory and then proceed with the lossless coders.
  • Book cover image for: Document and Image Compression
    • Mauro Barni(Author)
    • 2018(Publication Date)
    • CRC Press
      (Publisher)
    . . , | A | N }. By applying Huffman Coding to the new symbols, it Lossless Image Coding 119 is possible to reduce the average number of bits per original symbol. For example, in the above case of | A | = 3, consider a block of N = 5 occurrences of original symbols { a i , i = 1, . . . , 3}, generating new symbols { b i , i = 1, . . . , 243}. Even without formally applying the Huffman Coding procedure, we note a fixed codeword length of L = 8 bits can be used to uniquely represent the 243 possible blocks of N = 5 original symbols. This gives us a length of 8/5 = 1.6 bits/original symbol, significantly closer to the bound of 1.585 bits/symbol. In this simple case it was possible to use a suitable block size to get close to the bound. With a large set of symbols and widely varying probabilities, an alternative method called arithmetic coding, which we describe below proves to be more effective in approaching the bound in a systematic way. 5.2.3.2 Arithmetic Coding Huffman Coding, introduced in the previous section, was the entropy coder of choice for many years. Huffman Coding in the basic form has some deficiencies, namely that it does not meet the entropy unless all probabilities are integral powers of 1/2, it does not code conditional probabilities, and the probabilities are fixed. All these deficiencies may be addressed by alphabet extension, having multiple Huffman codes (one for each set of probabilities) and using adaptive Huffman Coding evaluating and possibly redesigning the Huffman tree when the (estimated) probabilities change. All approaches are used in practice, but especially if all are to be used at once the complexity becomes prohibitive. An alternative is given by arithmetic coding [43] which elegantly deals with all these issues. Arithmetic coding is often referred to as coding by interval subdivision.
  • Book cover image for: Lossless Compression Handbook
    A Tunstall code associates variable-length strings of symbols to a set of fixed-length codes rather than associating a code to every symbol. We can build a codebook with the 2 ~ - n first codes associated with the 2 ~ - n most a priori likely strings, under i.i.d assumptions, and keep n codes for literals, because not all possible strings will CHAPTER 4 / Huffman Coding 89 be present in the the dictionary. We could also consider using another algorithm altogether, one that naturally extracts repeated strings from the input. One such algorithm is Welch's variation [12] on the Ziv-Lempel dictionary-based scheme [13], known as LZW. We could also consider the use of arithmetic coding and estimate the probabilities with a small-order probabilistic model, like an order m Markov chain or a similar mechanism. 4.3.4 Length-Constrained Huffman Codes One may want to limit in length the codes generated by Huffman's procedure. They are many possible reasons for doing so. One possible reason for limiting the length of Huffman-type codes is to prevent getting very long codes when symbol probabilities are underestimated. Often, we get approximated probability information by sampling part of the data. For rarer symbols, this may lead to symbols getting a near zero probability, despite their occurring much more frequently in the data. Another reason could be that we want to limit the number of bits required to be held in memory (or CPU register) in order to decode the next symbol from a compressed data stream. We may also wish to put an upper bound on the number of steps needed to decode a symbol. This situation arises when a compression application is severely constrained in time, for example, in multimedia or telecommunication applications, where timing is crucial. There are several algorithms to compute length-constrained Huffman codes.
  • Book cover image for: Mathematics That Power Our World, The: How Is It Made?
    • Joseph Khoury, Gilles Lamothe(Authors)
    • 2016(Publication Date)
    • World Scientific
      (Publisher)
    Chapter 2 Basics of data compression, prefix-free codes and Huffman codes 2.1 Introduction In our era of digital intelligence, data compression became such a necessity that without it our modern lifestyle as we know it will come to a halt. We all use data compression on a daily basis, without realizing it in most cases. Saving a file on your computer, downloading or uploading a picture from or to the internet, taking a photo with your digital camera, sending or receiving a fax or undergoing an MRI medical scan, are few examples of daily activities that require data compression. Without this technology, it would virtually be impossible to do things as simple as viewing a friend’s photo album on a social network, let alone complex electronic transactions for businesses and industries. In this chapter, we go over some of the compression recipes where mathematics is the main ingredient. 2.1.1 What is data compression and do we really need it? In the context of this chapter, the word data stands for a digital form of information that can be analyzed by a computer. Before it is converted to a digital form, information usually comes in a raw form that we call source data . Source data can be a text, an image, an audio or a video. The answer to “why do we need data compression?” is simple: reduce the cost of using available technologies . Imagine you own a business for sell-ing or renting moving boxes. If you do not use foldable boxes or collapsible containers, you would need a huge storage facility and your business would not be financially sustainable. Similarly, uncompressed texts, images, au-dio and video files or information transfer over digital networks require 39 40 The mathematics that power our world substantial storage capacity certainly not available on standard machines you use at the office or at home.
  • Book cover image for: Compression Algorithms for Real Programmers
    Also, it is possibletousethesestatisticalmethodsonwordsinsteadofcharacters. The ˇ rstthreesectionsofthischapterdescribethreedifferentwaystoachieve the same end goal. The algorithms are all very similar and produce close to the same results with ˇ les. All are extensions of the work that Claude Shannon did on characterizing information and developing a way to measure it. Thisworkis described later in the chapter to provide a theoretical basis for understanding the algorithms,howtheywork,whentheywillsucceed,andwhentheywillfail. The ˇ nalsectionprovidessomestrategiesforoptimizingthealgorithmswithdata. The ˇ rstapproach is commonly known as “Huffman encoding” and named after DavidHuffman[Huf52]. Itprovidesasimplewayofproducingasetofreplace-ment bits. The algorithm is easy to describe, simple to code, and comes with a proofthatitisoptimal,atleastinsomesense. The algorithm ˇ ndsthe strings of bits for each letter by creating a tree and usingthetreetoreadoffthecodes.Thecommonlettersendupnearthetopofthe tree, while the least common letters end up near the bottom. Thepathsfromthe root of the tree to the node containing the character are used to compute the bit patterns. Here'sthebasicalgorithmforcreatingthetree.Figure2.1showshowitworks forasimplesample. 1. Foreach in ,createanode . A .125 B .125 C .250 D .5 A .125 B .125 C .250 D .5 .250 D .5 A .125 B .125 C .250 .250 .5 D .5 A .125 B .125 C .250 .250 .5 1.0 T 17 Figure 2.1 2.1. HUFFMANENCODING : This illustrates the basic algorithm for creating a Huffman tree used to ˇ nd Huffman encodings. The four levels show the collection of trees in after each pass throughtheloop. Length x i a b c a b c a b ( ) = ( ) ( ) ( ) = ( ) + ( ) v n ρ x T A T v n T n n n n n v n v n v n T T CHAPTER2. STATISTICALBASICS 18 2. Attachavalue tothisnode. 3. Addeachoftheserawnodestoaset thatisacollectionoftrees. Atthis point,thereisonesinglenodetreeforeachcharacterin .
  • Book cover image for: Communication Systems
    • Simon Haykin, Michael Moher(Authors)
    • 2021(Publication Date)
    • Wiley
      (Publisher)
    354 INFORMATION THEORY AND CODING matter to the decoder. Any one of them will produce the same output sequence. (Some more advanced algorithms use the first occurrence as it may be represented by fewer bits in general.) The decoding algorithm is much simpler than the encoding algorithm, as the decoder knows exactly where to look in the decoded stream (search buffer) to find the matching string. The decoder starts with an empty (all zeros) search buffer, then: • For each codeword received, the decoder reads the string from the search buffer of the indicated position and length and appends it to the right-hand end of the search buffer. • The next character is then appended to the search buffer, • The search buffer is then slid to the right so the pointer occurs immediately after the last known symbol and the process is repeated. From the example described here, we note that, in contrast to Huffman Coding, the Lempel–Ziv algorithm uses fixed length codes to represent a variable number of source symbols. If errors occur in the transmission of a data sequence that has been encoded with the Lempel–Ziv algorithm, the decoding is susceptible to error propagation. For short sequences of characters, the matching strings found in the search buffer are unlikely to be very long. In this case, the output of the Lempel–Ziv algorithm may be a ‘‘compressed’’ sequence, which is longer than the input sequence. The Lempel–Ziv algorithm only achieves its true advantage when processing long strings of data, for example, large files. For a long time, Huffman Coding was unchallenged as the algorithm of choice for lossless data compression. Then, the Lempel–Ziv algorithm took over almost completely from the Huffman algorithm and became the standard algorithm for file compression. In recent years, more advanced data compression algorithms have been developed building upon the ideas of Huffman, Lempel, and Ziv.
  • Book cover image for: Digital Signal Compression
    eBook - PDF

    Digital Signal Compression

    Principles and Practice

    Alphabets of such large size lead to practical difficulties of codebook design and memory. Large alphabets are especially troublesome when attempting adaptive coding or trying to exploit statistical dependence between samples in conditional coding. Large alphabets arise not only in run-length coding of documents, but also in transform coding, to be covered later, and when coding multiple symbols together that generate alphabet extensions. One method that is used extensively to alleviate the problem of large alphabets is called modified Huffman Coding. Suppose that the alphabet consists of integers with a large range, such as the run-lengths. We break the range into a series of intervals of length m, so that a number n is expressed as n = qm + R. The variable q is a quotient and R is a remainder in division of n by m. Given m, there is a one-to-one correspondence between n and the pair of numbers q and R. Encoding n is equivalent to coding the pair (q , R). If the pair members are coded separately and the size of the alphabet of n is N , then the size of the pair alphabet is  N / m + m, which can amount to a significant reduction. For instance, consider the 1728 size alphabet for document coding. A typical m = 64 yields a size of 27 for the quotient q and 64 for the remainder R. Although the choice of the code is open, the quotient is usually coded with a Huffman code and the remainder either with a separate Huffman code or with natural binary (uncoded). A special case of the modified Huffman code is often used when probabilities of values beyond a certain value are all very small. To illustrate, suppose that 0 ≤ n ≤ N and values n > m have very small probability. Then, we code 0 ≤ n ≤ m with a Huffman code and the larger values with an ESCAPE codeword followed by n − m in natural binary. 58 Entropy coding techniques The loss of efficiency is often small when the statistics have certain properties and m is chosen judiciously.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.