Computer Science

Bloom Filters

Bloom Filters are probabilistic data structures used to test whether an element is a member of a set. They work by hashing the element and checking if the corresponding bits in a bit array are set. False positives are possible, but false negatives are not. They are commonly used in applications where memory is limited and speed is important.

Written by Perlego with AI-assistance

7 Key excerpts on "Bloom Filters"

  • Book cover image for: Overlay Networks
    eBook - PDF

    Overlay Networks

    Toward Information Networking.

    The key idea behind the data structures discussed in this chapter is that, by allowing the representation of the set of elements to lose some information, in other words to become lossy, the storage requirements can be significantly reduced. The data structures presented in this chapter for probabilistic representation of sets are based on the seminal work by Burton Bloom in 1970. Bloom first described a compact probabilistic data structure that was used to represent words in a dictionary. There was little interest in using Bloom Filters for networking until 1995, after which this area has gained widespread interest both in academia and in the industry. Bloom Filters are an efficient mechanism for probabilistic representation of sets and support membership queries [32]. Bloom Filters have many applications in dictionaries, networking, measurement, and P2P systems [40]. Meta-databases are an example application domain of Bloomier filters. Meta-databases direct queries to actual external databases. Toward the end of the chapter, we consider four types of applications pertaining to dis-tributed operation and networking: caching, P2P networks, packet routing and forwarding, and measurement. 115 116 Overlay Networks: Toward Information Networking 7.2 Bloom Filters The Bloom filter is a space-efficient probabilistic data structure that supports set mem-bership queries. The data structure was conceived by Burton H. Bloom in the 1970s. The structure offers a compact probabilistic way to represent a set that can result in false posi-tives but never in false negatives. This makes Bloom Filters useful for many different kinds of tasks that involve lists and sets. The basic operations involve adding elements to the set and querying for element membership in the probabilistic set representation. The basic Bloom filter does not support the removal of elements; however, a number of extensions have been developed that also support removals.
  • Book cover image for: Algorithms and Data Structures for Massive Datasets
    • Dzejla Medjedovic, Emin Tahirovic, Ines Dedovic(Authors)
    • 2022(Publication Date)
    • Manning
      (Publisher)

    3 Approximate membership: Bloom and quotient filters

    This chapter covers
    • Learning what Bloom Filters are and why and when they are useful
    • Configuring a Bloom filter in a practical setting
    • Exploring the interplay in Bloom filter parameters
    • Learning about quotient filters as Bloom filter replacements
    • Comparing the performance of a Bloom filter and a quotient filter
    Bloom Filters have become a standard in systems that process large datasets. Their widespread use, especially in networks and distributed databases, comes from the effectiveness they exhibit in situations where we need a hash table functionality but do not have the luxury of space. They were invented in the 1970s by Burton Bloom [1], but they only really “bloomed” in the last few decades due to an increasing need to tame and compress big datasets. Bloom Filters have also piqued the interest of the computer science research community, which developed many variants on top of the basic data structure in order to address some of the filters’ shortcomings and adapt them to different contexts.
    One simple way to think about Bloom Filters is that they support insert and lookup in the same way that hash tables do, but using very little space (i.e., 1 byte per item or less). This is a significant savings when keys take up 4-8 bytes. Bloom Filters do not store the items themselves, and they use less space than the lower theoretical limit required to store the data correctly; therefore, they exhibit an error rate. They have false positives, but they do not have false negatives, and the one-sidedness of the error can be used to our benefit. When the Bloom filter reports the item as Found/Present , there is a small chance it is not telling the truth, but when it reports the item as Not Found/Not Present , we know it’s telling the truth. In situations where the query answer is expected to be Not Present
  • Book cover image for: Advanced Algorithms and Data Structures
    • Marcello La Rocca(Author)
    • 2021(Publication Date)
    • Manning
      (Publisher)
    Another area where Bloom filter-based caching helps is reducing the unnecessary fetching/storage of expensive IO resources. The mechanism is the same as with crawling: the operation is only performed when we have a “miss,” while “hits” usually trigger a more in-depth comparison (for instance, on a hit, retrieving from disk just the first few lines or the first block of a document, and comparing them).

    4.7.5 Spell checker

    Simpler versions of spell checkers used to employ Bloom Filters as dictionaries. For every word of the text examined, a lookup on a Bloom filter would validate the word as correct or mark it as a spelling error. Of course, the false positive occurrences would cause some spelling error to go undetected, but the odds of this happening could be controlled in advance. Today, however, spell checkers mostly take advantage of tries: these data structures provide good performance on text searches without the false positives.

    4.7.6 Distributed databases and file systems

    Cassandra uses Bloom Filters for index scans to determine whether an SSTable has data for a particular row.
    Likewise, Apache HBase uses Bloom Filters as an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. This in turn boosts the overall read speed, by filtering out unnecessary disk reads of HFile blocks that don’t contain a particular row or row-column.
    We are at the end of our excursus on practical ways to use Bloom Filters. It’s worth mentioning that other applications of Bloom Filters include rate limiters, blacklists, synchronization speedup, and estimating the size of joins in DBs.

    4.8 Why Bloom Filters work21

    So far, we have asked you to take for granted that Bloom Filters do work as we described. Now it’s time to look more closely and explain why a Bloom filter actually works. Although this section is not strictly needed to implement or use Bloom Filters, reading it might help you understand this data structure in more depth.
    As already mentioned, Bloom Filters are a tradeoff between memory and accuracy. If you are going to create an instance of a Bloom filter with a storage capacity of 8 bits and then try to store 1 million objects in it, chances are that you won’t get great performance. On the contrary, considering a Bloom filter with an 8-bit buffer, its whole buffer would be set to 1 after approximately 10-20 hashes. At that point, all calls to contains will just return true
  • Book cover image for: Practical Cryptography
    eBook - PDF

    Practical Cryptography

    Algorithms and Implementations Using C++

    • Saiful Azad, Al-Sakib Khan Pathan, Saiful Azad, Al-Sakib Khan Pathan(Authors)
    • 2014(Publication Date)
    From earlier in this chapter we know that blooming is performed by calculating one or more hash keys and updating the value of the filter by OR-ing each hash key with its current state. This is referred to as the insert operation. The lookup operation is done by taking the bitwise AND between a given hash key and current state of the filter. The decision making from this point on can go in one of the following two directions: • The result of AND is the same as the value of the hash key— this is either true positive or false positive, with no way to tell between the two. • The result of AND is not the same as the value of the hash key—this is a 100% reliable true negative. One common way to describe the above lookup behavior of Bloom Filters is to describe the filter as a person with memory who can only answer the question “Have you seen this item before?” reliably. This is not to underestimate the utility of the filter, as the answer to this exact question is exactly what is needed in many practical situations. Let us look at the Bloom filter design from the viewpoint of hash-ing, especially given that the state of the filter is gradually built by adding more hash keys onto its state. Let us put n as number of items and m the bit length of hash keys, and therefore the filter. We know from before that each bit in the hash key can be set to 1 with 50% probability. Therefore, omitting details, the optimal number of hash function can be calculated as k In m n m n 2 0 .6 =       ≈ If each hash function is perfectly independent of all others, then the probability of a bit remaining 0 after n elements is p m e kn kn m 1 1 = -      ≈ - 319 METHODS AND ALGORITHMS FOR FAST HASHING False positive—an important performance metric of a Bloom filter is then pFP p e k kn m k k 1 1 1 2 ( ) = -≈ -      ≈ -for the optimal k .
  • Book cover image for: Handbook on Theoretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless, and Peer-to-Peer Networks
    626 Theoretical and Algorithmic Aspects of Sensor, Ad Hoc, and P2P Networks that likely exist in nearby nodes. However, the approach alone fails to find replicas far away from the query source. Bloom Filters 50 are often used to approximately and efficiently summarize elements in a set. A bloom filter is a bit-string of length m that is associated with a family of independent hash functions. Each hash function takes as input any set element and outputs an integer in [0, m). To generate a representation of a set using Bloom Filters, every set element is hashed using all hash functions. Any bit in the bloom filter whose position matches a hash function result is set to 1. To determine whether an element is in the set described by a bloom filter, that element is hashed using the same family of hash functions. If any matching bit is not set to 1, the element is definitely not in the set. If all matching bits in the bloom filter are set to 1, the element is probably in the set. If the element indeed is not in the set, this is called a false positive . Attenuated Bloom Filters are extensions to Bloom Filters. An attenuated bloom filter of depth d is an array of d regular Bloom Filters of the same length w . A level is assigned to each regular bloom filter in the array. level 1 is assigned to the first bloom filter and level 2 is assigned to the second bloom filter. The higher levels are considered attenuated with respect to the lower levels. Each node stores an attenuated bloom filter for each neighbor. The i -th bloom filter in an attenuated bloom filter (depth: d ; i ≤ d ) for a neighbor B at a node A summarizes the set of documents that will probably be found through B on all nodes i -hops away from A . Figure 37.2 illustrates an attenuated bloom filter for neighbor C at node B . “File3” and “File4” are available at two-hops distance from B through C . They are hashed to { 0, 5, 6 } and { 2, 5, 8 } , respectively. Therefore, the second bloom filter contains 1 at bits 0, 2, 5, 6, and 8. To route a query for a file, the querying node hashes the file name using the family of hash functions. Then the querying node checks level 1 of its attenuated Bloom Filters. If level 1 of an attenuated bloom filter for a neighbor has 1s at all matching positions, the file
  • Book cover image for: Advanced Data Structures
    eBook - ePub

    Advanced Data Structures

    Theory and Applications

    176 ].
    Figure 16.2: Relations of red-black tree structures
    16.3   Distributed Caching
    A distributed system is a collection of independent computers connected by some medium. Data caching provides solutions to many serious problems in distributed environments.
    The hash-based bloom filter is used extensively to manage data caching in distributed environments.
    Feng et al. [179 ] proposed a system using Bloom Filters to distribute data cache information. A summary cache system generates a question to determine whether another station’s cache holds desired data if a local cache misses a step. Communication and time costs for accessing data from the original station are reduced. To reduce message traffic, stations use periodic broadcasts of a bloom filter that represents a cache instead of transferring the entire content of the cache.
    Each station verifies bloom filter of other stations for any data availability. False positivity and false negativity will trigger delays due to hash collisions caused by the bloom filter’s limited buffer size. Distributed caching has proven useful in Google’s BigTable, Google Maps, Google Earth, Web indexing and other distributed storage systems for structured data. These applications utilize Bloom Filters to reduce disk lookups. Summary cache systems are used extensively in cloud computing, mapping, and reduction paradigms. Bloom Filters optimize reduction operations. Summary caches divide applications into small chunks to achieve parallel efficiency [180 ].
    16.4   Data Structures for Building File Systems
    Disk file systems use bitmaps to track free blocks and handle queries related to specific disk blocks. Disk files need good data structures to store directories and efficiently handle queries and fast lookups. Microsoft’s early FAT32 system used arrays for file allocations. The ext and ext2 systems use link lists. The XFS and NTFS systems use B+ trees for directory and security-related metadata indexing. The ext3 and ext4 file systems use modified B+ trees (also known as H trees) for file indexing [181
  • Book cover image for: Reconfigurable and Adaptive Computing
    eBook - PDF
    • Nadia Nedjah, Chao Wang, Nadia Nedjah, Chao Wang(Authors)
    • 2018(Publication Date)
    In Dharmapurikar et al. [22] and Dharmapurikar and Lockwood [24], the Bloom filter, instead of including just present bits at hash locations, included an address for the microcontroller. Some regular expressions could be supported by linking string literals together with software. A few studies have used the ClamAV pattern set in their evaluations. Ho and Lemieux [25] and Tsung Lin Ho and Lemieux [26] used Bloomier filters in their PERG and PERG-RX architectures. The Bloomier filter is an extension of the Bloom filter. PERG supports the single-byte and displacement wildcards ? and { n }, which insert fixed-length gaps between string fragments. PERG-RX adds support for other wildcards that require arbitrary-length gaps. It operates at a rate of 1.2 Gbps. 3.4 BACKGROUND To let a computing system run four elements are necessary: an algorithm, data structure, hardware, and software. In this section, important topics related to a computing platform will be highlighted first. Then, basic virus database con-cepts will be introduced. Finally, the Bloom filter algorithm will be introduced. 3.4.1 Computing Platform A processor requires an appropriate host system architecture. In a well-balanced system, performance of all components should fit. However, in the case of general-purpose processor optimized properties of a host system cannot be strictly defined, that is, they are different for vari-ous algorithms. In practice, the best general-purpose solutions are built using state-of-the-art components: processors, memory chips, graphics card, storage devices, and so on. The policy of choosing the best avail-able component on the market to set up a computer works in practice as such systems execute different kinds of applications. But, for IO bound tasks, cutting-edge processors do not help when the storage is too slow. Conversely, when a single-purpose system is proposed, it is possible to establish a proper balance between computer components’ performances.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.