Bloom filter

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether or not an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

For example, one might use a Bloom filter to do spell-checking in a space-efficient way. A Bloom filter to which a dictionary of correct words have been added will accept all words in the dictionary and reject almost all words which are not, which is good enough in some cases. Depending on the false positive rate, the resulting data structure can require as little as a byte per dictionary word. One peculiar attribute of this spell-checker is that it is not possible to extract the list of correct words from it – at best, one can extract a list containing the correct words plus a significant number of false positives.

Contents

1 Algorithm description

2 Space and time advantages

3 Probability of false positives

4 Interesting properties

5 Counting filters

6 Bloomier filters

7 External links

Algorithm description

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps a key value to one of the m array positions.

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.

To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements.

Unfortunately, removing an element from this simple Bloom filter is impossible. Although setting any one of the element's k bits to zero suffices to remove it, this has the side effect of removing any other elements that map onto that bit, and we have no way of determining whether any such elements have been added. The result is a possibility of false negatives, which are not allowed.

Space and time advantages

While risking false positives, Bloom filters have a strong space advantage over other data structures for representing sets such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements! This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

However, if the number of potential values is small and many of them can be in the set, then the Bloom filter is easily surpassed by the determinstic bit array, which requires only one bit for each potential element. Note also that hash tables gain a space and time advantage if they begin ignoring collisions and only store whether each bucket contains an entry; in this case, they have effectively become Bloom filters with k = 1.

Bloom filters also have the unusual property that the time needed to either add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. No other set data structure has this property, but the average access time of sparse hash tables can make them faster in practice than some Bloom filters. In a hardware implementation, however, the Bloom filter shines because its k lookups are independent and can be parallelized.

Probability of false positives

Assume that a hash function selects each array position with equal probability. The probability that a certain bit is not set to one by a certain hash function during the insertion of an element is then 1-1/m. The probability that it is not set by any of the hash functions is

<math>\left(1-\frac{1}{m}\right)^k<math>.

If we have inserted n elements, the probability that a certain bit is still 0 is

<math>\left(1-\frac{1}{m}\right)^{kn}<math>;

the probability that it is 1 is therefore

<math>1-\left(1-\frac{1}{m}\right)^{kn}<math>.

Now test membership of an element that is not in the set. Each of the k array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the algorithm to erroneously claim that the element is in the set, is then

<math>\left(1-\left(1-\frac{1}{m}\right)^{kn}\right)^k \approx \left(1-e^{-kn/m}\right)^k<math>.

Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is

<math>\frac{m}{n}\ln 2 \approx \frac{9m}{13n} \approx 0.7\frac{m}{n}<math>,

which gives a probability of

<math>\left(\frac{1}{2^{\ln 2}}\right)^{m/n} \approx 0.62^{m/n}<math>.

Interesting properties

Unlike sets based on hash tables, any Bloom filter can represent the entire universe of elements. In this case, all bits are 1 and the false positive rate is 100%. Another consequence of this property is that add never fails due to the data structure "filling up."

Union and intersection of Bloom filters with the same size and set of hash functions can be implemented with bitwise OR and AND operations, respectively.

Counting filters

Counting filters provide a way to implement a delete operation on a Bloom filter without recreating the filter afresh. In a counting filter the array positions (buckets) are extended from being a single bit, to being an n-bit counter. In fact, regular Bloom filters can be considered as counting filters with a bucket size of one bit. Counting filters were introduced in L. Fan, P. Cao, J. Almeida, and A. Broder. Summary cache: A scalable wide-area Web cache sharing protocol. In Proceeding of SIGCOMM ’98, 1998.

The insert operation is extended to increment the value of the buckets and the lookup operation checks that each of the required buckets is non-zero. The delete operation, obviously, then consists of decrementing the value of each of the respective buckets.

Arithmetic overflow of the buckets is a problem and the buckets should be sufficiently large to make this case rare. If it does occur then the increment and decrement operations must leave the bucket set to the maximum possible value in order to retain the properties of a Bloom filter. In practice, 4-bit counters are sufficient.

Bloomier filters

In 2004, Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld, and Ayellet Tal designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an associative array. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of Bloomier filters, a false positive is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that is in the map.

The simplest Bloomier filter is near-optimal and fairly simple to describe. Suppose initially that the only possible values are 0 and 1. We create a pair of Bloom filters A₀ and B₀ which contain, respectively, all values mapping to 0 and all values mapping to 1. Then, to determine which value a given key maps to, we look it up in both filters. If it is in neither, then the key is not in the map. If the key is in A₀ but not B₀, then it does not map to 1, and has a high probability of mapping to 0. Conversely, if the key is in B₀ but not A₀, then it does not map to 0 and has a high probability of mapping to 1.

A problem arises, however, when both filters claim to contain the item. We never insert an item into both, so one or both of the filters is lying (producing a false positive), but we don't know which. To determine this, we have another, smaller pair of filters A₁ and B₁. A₁ contains values that map to 0 and which are false positives in B₀; B₁ contains values that map to 1 and which are false positives in A₀. But whenever A₀ and B₀ both produce positives, at most one of these cases must occur, and so we simply have to determine which if any of the two filters A₁ and B₁ contains the key, another instance of our original problem.

It may so happen again that both filters produce a positive; we apply the same idea recursively to solve this problem. Because each pair of filters only contains keys that are in the map and produced false positives on all previous filter pairs, the number of keys is extremely likely to quickly drop to a very small quantity that can be easily stored in an ordinary deterministic map, such as a pair of small arrays with linear search. Moreover, the average total search time is a constant, because almost all queries will be resolved by the first pair, almost all remaining queries by the second pair, and so on. The total space required is independent of n, and is almost entirely occupied by the first filter pair.

Now that we have the structure and a search algorithm, we also need to know how to insert new key/value pairs. The program must not attempt to insert the same key with both values. If the value is 0, insert the key into A₀ and then test if the key is in B₀. If so, this is a false positive for B₀, and the key must also be inserted into A₁ recursively in the same manner. If we reach the last level, we simply insert it. When the value is 1, the operation is similar but with A and B reversed.

Now that we can map a key to the value 0 or 1, how does this help us map to general values? This is simple. We create a single such Bloomier filter for each bit of the result. If the values are large, we can instead map keys to hash values that can be used to retrieve the actual values. The space required for a Bloomier filter with n-bit values is typically slightly more than the space for 2n Bloom filters.

External links

Original paper (http://portal.acm.org/citation.cfm?id=362692&dl=ACM&coll=portal)
Table of false-positive rates for different configurations (http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html)
Online Bloom filter calculator (http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html)
Bloom filters in Python (http://www.imperialviolet.org/pybloom.html)
Bloom filters in Perl (http://search.cpan.org/author/MCEGLOWS/Bloom-Filter)
Network Applications of Bloom Filters: A Survey. A. Broder and M. Mitzenmacher. Allerton Conference 2002. (http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey.pdf)
Spectral Bloom Filters. S. Cohen and Y. Matias. SIGMOD 2003. (http://www.acm.org/sigmod/sigmod03/eproceedings/)
The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables. B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. SODA 2004. (http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf)
An Optimal Bloom Filter Replacement; In Proc. ACM-SIAM Symposium on Discrete Algorithms, SODA 2005 (http://www.it-c.dk/people/pagh/papers/bloom.pdf)he:פילטר בלום

Retrieved from "https://academickids.com:443/encyclopedia/index.php/Bloom_filter"

Categories: Data structures

Bloom filter

Algorithm description

Space and time advantages

Probability of false positives

Interesting properties

Counting filters

Bloomier filters

External links

Navigation

Search

Toolbox

Personal tools