Why Bloom filters work the way they do

Imagine you’re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access http://domain/etc. You’d like a way of checking whether domain is known to be a malicious site. What’s a good way of doing this?

An obvious naive way is for your browser to maintain a list or set data structure containing all known malicious domains. A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious domains, and domains need (say) an average of 20 bytes to store, then you need 20 megabytes of storage. That’s quite an overhead for a single feature in your web browser. Is there a better way?

In this post I’ll describe a data structure which provides an excellent way of solving this kind of problem. The data structure is known as a Bloom filter. Bloom filter are much more memory efficient than the naive “store-everything” approach, while remaining extremely fast. I’ll describe both how Bloom filters work, and also some extensions of Bloom filters to solve more general problems.

Most explanations of Bloom filters cut to the chase, quickly explaining the detailed mechanics of how Bloom filters work. Such explanations are informative, but I must admit that they made me uncomfortable when I was first learning about Bloom filters. In particular, I didn’t feel that they helped me understand why Bloom filters are put together the way they are. I couldn’t fathom the mindset that would lead someone to invent such a data structure. And that left me feeling that all I had was a superficial, surface-level understanding of Bloom filters.

In this post I take an unusual approach to explaining Bloom filters. We won’t begin with a full-blown explanation. Instead, I’ll gradually build up to the full data structure in stages. My goal is to tell a plausible story explaining how one could invent Bloom filters from scratch, with each step along the way more or less “obvious”. Of course, hindsight is 20-20, and such a story shouldn’t be taken too literally. Rather, the benefit of developing Bloom filters in this way is that it will deepen our understanding of why Bloom filters work in just the way they do. We’ll explore some alternative directions that plausibly could have been taken – and see why they don’t work as well as Bloom filters ultimately turn out to work. At the end we’ll understand much better why Bloom filters are constructed the way they are.

Of course, this means that if your goal is just to understand the mechanics of Bloom filters, then this post isn’t for you. Instead, I’d suggest looking at a more conventional introduction – the Wikipedia article, for example, perhaps in conjunction with an interactive demo, like the nice one here. But if your goal is to understand why Bloom filters work the way they do, then you may enjoy the post.

A stylistic note: Most of my posts are code-oriented. This post is much more focused on mathematical analysis and algebraic manipulation: the point isn’t code, but rather how one could come to invent a particular data structure. That is, it’s the story behind the code that implements Bloom filters, and as such it requires rather more attention to mathematical detail.

General description of the problem: Let’s begin by abstracting away from the “safe web browsing” problem that began this post. We want a data structure which represents a set S of objects. That data structure should enable two operations: (1) the ability to add an extra object x to the set; and (2) a test to determine whether a given object x is a member of S. Of course, there are many other operations we might imagine wanting – for example, maybe we’d also like to be able to delete objects from the set. But we’re going to start with just these two operations of adding and testing. Later we’ll come back and ask whether operations such as deleteing objects are also possible.

Idea: store a set of hashed objects: Okay, so how can we solve the problem of representing S in a way that’s more memory efficient than just storing all the objects in S? One idea is to store hashed versions of the objects in S, instead of the full objects. If the hash function is well chosen, then the hashed objects will take up much less memory, but there will be little danger of making errors when testing whether an object is an element of the set or not.

Let’s be a little more explicit about how this would work. We have a set S of objects x_0, x_1, \ldots, x_{|S|-1}, where |S| denotes the number of objects in S. For each object we compute an m-bit hash function h(x_j) – i.e., a hash function which takes an arbitrary object as input, and outputs m bits – and the set S is represented by the set \{ h(x_0), h(x_1), \ldots, h(x_{|S|-1}) \}. We can test whether x is an element of S by checking whether h(x) is in the set of hashes. This basic hashing approach requires roughly m |S| bits of memory.

(As an aside, in principle it’s possible to store the set of hashed objects more efficiently, using just m|S|-\log(|S|!) bits, where \log is to base two. The -\log(|S|!) saving is possible because the ordering of the objects in a set is redundant information, and so in principle can be eliminated using a suitable encoding. However, I haven’t thought through what encodings could be used to do this in practice. In any case, the saving is likely to be minimal, since \log(|S|!)  \approx |S| \log |S|, and m will usually be quite a bit bigger than \log |S| – if that weren’t the case, then hash collisions would occur all the time. So I’ll ignore the terms -\log(|S|!) for the rest of this post. In fact, in general I’ll be pretty cavalier in later analyses as well, omitting lower order terms without comment.)

A danger with this hash-based approach is that an object x outside the set S might have the same hash value as an object inside the set, i.e., h(x) = h(x_j) for some j. In this case, test will erroneously report that x is in S. That is, this data structure will give us a false positive. Fortunately, by choosing a suitable value for m, the number of bits output by the hash function, we can reduce the probability of a false positive as much as we want. To understand how this works, notice first that the probability of test giving a false positive is 1 minus the probability of test correctly reporting that x is not in S. This occurs when h(x) \neq h(x_j) for all j. If the hash function is well chosen, then the probability that h(x) \neq h(x_j) is (1-1/2^m) for each j, and these are independent events. Thus the probability of test failing is:

   p = 1-(1-1/2^m)^{|S|}.

This expression involves three quantities: the probability p of test giving a false positive, the number m of bits output by the hash function, and the number of elements in the set, |S|. It’s a nice expression, but it’s more enlightening when rewritten in a slightly different form. What we’d really like to understand is how many bits of memory are needed to represent a set of size |S|, with probability p of a test failing. To understand that we let \# be the number of bits of memory used, and aim to express \# as a function of p and |S|. Observe that \# = |S|m, and so we can substitute for m = \#/|S| to obtain

   p = 1-\left(1-1/2^{\#/|S|}\right)^{|S|}.

This can be rearranged to express \# in term of p and |S|:

   \# = |S|\log \frac{1}{1-(1-p)^{1/|S|}}.

This expression answers the question we really want answered, telling us how many bits are required to store a set of size |S| with a probability p of a test failing. Of course, in practice we’d like p to be small – say p = 0.01 – and when this is the case the expression may be approximated by a more transparent expression:

   \# \approx |S|\log \frac{|S|}{p}.

This makes intuitive sense: test failure occurs when x is not in S, but h(x) is in the hashed version of S. Because this happens with probability p, it must be that S occupies a fraction p of the total space of possible hash outputs. And so the size of the space of all possible hash outputs must be about |S|/p. As a consequence we need \log(|S|/p) bits to represent each hashed object, in agreement with the expression above.

How memory efficient is this hash-based approach to representing S? It’s obviously likely to be quite a bit better than storing full representations of the objects in S. But we’ll see later that Bloom filters can be far more memory efficient still.

The big drawback of this hash-based approach is the false positives. Still, for many applications it’s fine to have a small probability of a false positive. For example, false positives turn out to be okay for the safe web browsing problem. You might worry that false positives would cause some safe sites to erroneously be reported as unsafe, but the browser can avoid this by maintaining a (small!) list of safe sites which are false positives for test.

Idea: use a bit array: Suppose we want to represent some subset S of the integers 0, 1, 2, \ldots, 999. As an alternative to hashing or to storing S directly, we could represent S using an array of 1000 bits, numbered 0 through 999. We would set bits in the array to 1 if the corresponding number is in S, and otherwise set them to 0. It’s obviously trivial to add objects to S, and to test whether a particular object is in S or not.

The memory cost to store S in this bit-array approach is 1000 bits, regardless of how big or small |S| is. Suppose, for comparison, that we stored S directly as a list of 32-bit integers. Then the cost would be 32 |S| bits. When |S| is very small, this approach would be more memory efficient than using a bit array. But as |S| gets larger, storing |S| directly becomes much less memory efficient. We could ameliorate this somewhat by storing elements of S using only 10 bits, instead of 32 bits. But even if we did this, it would still be more expensive to store the list once |S| got beyond one hundred elements. So a bit array really would be better for modestly large subsets.

Idea: use a bit array where the indices are given by hashes: A problem with the bit array example described above is that we needed a way of numbering the possible elements of S, 0,1,\ldots,999. In general the elements of S may be complicated objects, not numbers in a small, well-defined range.

Fortunately, we can use hashing to number the elements of S. Suppose h(\cdot) is an m-bit hash function. We’re going to represent a set S = \{x_0,\ldots,x_{|S|-1}\} using a bit array containing 2^m elements. In particular, for each x_j we set the h(x_j)th element in the bit array, where we regard h(x_j) as a number in the range 0,1,\ldots,2^m-1. More explicitly, we can add an element x to the set by setting bit number h(x) in the bit array. And we can test whether x is an element of S by checking whether bit number h(x) in the bit array is set.

This is a good scheme, but the test can fail to give the correct result, which occurs whenever x is not an element of S, yet h(x) = h(x_j) for some j. This is exactly the same failure condition as for the basic hashing scheme we described earlier. By exactly the same reasoning as used then, the failure probability is

   [*] \,\,\,\, p = 1-(1-1/2^m)^{|S|}.

As we did earlier, we’d like to re-express this in terms of the number of bits of memory used, \#. This works differently than for the basic hashing scheme, since the number of bits of memory consumed by the current approach is \# = 2^m, as compared to \# = |S|m for the earlier scheme. Using \# = 2^m and substituting for m in Equation [*], we have:

   p = 1-(1-1/\#)^{|S|}.

Rearranging this to express \# in term of p and |S| we obtain:

   \# = \frac{1}{1-(1-p)^{1/|S|}}.

When p is small this can be approximated by

   \# \approx \frac{|S|}{p}.

This isn’t very memory efficient! We’d like the probability of failure p to be small, and that makes the 1/p dependence bad news when compared to the \log(|S|/p) dependence of the basic hashing scheme described earlier. The only time the current approach is better is when |S| is very, very large. To get some idea for just how large, if we want p = 0.01, then 1/p is only better than \log(|S|/p) when |S| gets to be more than about 1.27 * 10^{28}. That’s quite a set! In practice, the basic hashing scheme will be much more memory efficient.

Intuitively, it’s not hard to see why this approach is so memory inefficient compared to the basic hashing scheme. The problem is that with an m-bit hash function, the basic hashing scheme used m|S| bits of memory, while hashing into a bit array uses 2^m bits, but doesn’t change the probability of failure. That’s exponentially more memory!

At this point, hashing into bit arrays looks like a bad idea. But it turns out that by tweaking the idea just a little we can improve it a lot. To carry out this tweaking, it helps to name the data structure we’ve just described (where we hash into a bit array). We’ll call it a filter, anticipating the fact that it’s a precursor to the Bloom filter. I don’t know whether “filter” is a standard name, but in any case it’ll be a useful working name.

Idea: use multiple filters: How can we make the basic filter just described more memory efficient? One possibility is to try using multiple filters, based on independent hash functions. More precisely, the idea is to use k filters, each based on an (independent) m-bit hash function, h_0, h_1, \ldots, h_{k-1}. So our data structure will consist of k separate bit arrays, each containing 2^m bits, for a grand total of \# = k 2^m bits. We can add an element x by setting the h_0(x)th bit in the first bit array (i.e., the first filter), the h_1(x)th bit in the second filter, and so on. We can test whether a candidate element x is in the set by simply checking whether all the appropriate bits are set in each filter. For this to fail, each individual filter must fail. Because the hash functions are independent of one another, the probability of this is the kth power of any single filter failing:

   p = \left(1-(1-1/2^m)^{|S|}\right)^k.

The number of bits of memory used by this data structure is \# = k 2^m and so we can substitute 2^m = \#/k and rearrange to get

   [**] \,\,\,\, \# = \frac{k}{1-(1-p^{1/k})^{1/|S|}}.

Provided p^{1/k} is much smaller than 1, this expression can be simplified to give

   \# \approx \frac{k|S|}{p^{1/k}}.

Good news! This repetition strategy is much more memory efficient than a single filter, at least for small values of k. For instance, moving from k = 1 repetitions to k = 2 repititions changes the denominator from p to \sqrt{p} – typically, a huge improvement, since p is very small. And the only price paid is doubling the numerator. So this is a big win.

Intuitively, and in retrospect, this result is not so surprising. Putting multiple filters in a row, the probability of error drops exponentially with the number of filters. By contrast, in the single filter scheme, the drop in the probability of error is roughly linear with the number of bits. (This follows from considering Equation [*] in the limit where 1/2^m is small.) So using multiple filters is a good strategy.

Of course, a caveat to the last paragraph is that this analysis requires that p^{1/k} \ll 1, which means that k can’t be too large before the analysis breaks down. For larger values of k the analysis is somewhat more complicated. In order to find the optimal value of k we’d need to figure out what value of k minimizes the exact expression [**] for \#. We won’t bother – at best it’d be tedious, and, as we’ll see shortly, there is in any case a better approach.

Overlapping filters: This is a variation on the idea of repeating filters. Instead of having k separate bit arrays, we use just a single array of 2^m bits. When adding an object x, we simply set all the bits h_0(x), h_1(x),\ldots, h_{k-1}(x) in the same bit array. To test whether an element x is in the set, we simply check whether all the bits h_0(x), h_1(x),\ldots, h_{k-1}(x) are set or not.

What’s the probability of the test failing? Suppose x \not \in S. Failure occurs when h_0(x) = h_{i_0}(x_{j_0}) for some i_0 and j_0, and also h_1(x) = h_{i_1}(x_{j_1}) for some i_1 and j_1, and so on for all the remaining hash functions, h_2, h_3,\ldots, h_{k-1}. These are independent events, and so the probability they all occur is just the product of the probabilities of the individual events. A little thought should convince you that each individual event will have the same probability, and so we can just focus on computing the probability that h_0(x) = h_{i_0}(x_{j_0}) for some i_0 and j_0. The overall probability p of failure will then be the kth power of that probability, i.e.,

   p = p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0)^k

The probability that h_0(x) = h_{i_0}(x_{j_0}) for some i_0 and j_0 is one minus the probability that h_0(x) \neq h_{i_0}(x_{j_0}) for all i_0 and j_0. These are independent events for the different possible values of i_0 and j_0, each with probability 1-1/2^m, and so

   p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0) = 1-(1-1/2^m )^{k|S|},

since there are k|S| different pairs of possible values (i_0, j_0). It follows that

   p = \left(1-(1-1/2^m )^{k|S|}\right)^k.

Substituting 2^m = \# we obtain

   p = \left(1-(1-1/\# )^{k|S|}\right)^k

which can be rearranged to obtain

   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}}.

This is remarkably similar to the expression [**] derived above for repeating filters. In fact, provided p^{1/k} is much smaller than 1, we get

   \# \approx \frac{k|S|}{p^{1/k}},

which is exactly the same as [**] when p^{1/k} is small. So this approach gives quite similar outcomes to the repeating filter strategy.

Which approach is better, repeating or overlapping filters? In fact, it can be shown that

   \frac{1}{1-(1-p^{1/k})^{1/k|S|}} \leq \frac{k}{1-(1-p^{1/k})^{1/|S|}},

and so the overlapping filter strategy is more memory efficient than the repeating filter strategy. I won’t prove the inequality here – it’s a straightforward (albeit tedious) exercise in calculus. The important takeaway is that overlapping filters are the more memory-efficient approach.

How do overlapping filters compare to our first approach, the basic hashing strategy? I’ll defer a full answer until later, but we can get some insight by choosing p = 0.0001 and k=4. Then for the overlapping filter we get \# \approx 40|S|, while the basic hashing strategy gives \# = |S| \log( 10000 |S|). Basic hashing is worse whenever |S| is more than about 100 million – a big number, but also a big improvement over the 1.27 * 10^{28} required by a single filter. Given that we haven’t yet made any attempt to optimize k, this ought to encourage us that we’re onto something.

Problems for the author

  • I suspect that there’s a simple intuitive argument that would let us see upfront that overlapping filters will be more memory efficient than repeating filters. Can I find such an argument?

Bloom filters: We’re finally ready for Bloom filters. In fact, Bloom filters involve only a few small changes to overlapping filters. In describing overlapping filters we hashed into a bit array containing 2^m bits. We could, instead, have used hash functions with a range 0,\ldots,M-1 and hashed into a bit array of M (instead of 2^m) bits. The analysis goes through unchanged if we do this, and we end up with

   p = \left(1-(1-1/\# )^{k|S|}\right)^k

and

   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}},

exactly as before. The only reason I didn’t do this earlier is because in deriving Equation [*] above it was convenient to re-use the reasoning from the basic hashing scheme, where m (not M) was the convenient parameter to use. But the exact same reasoning works.

What’s the best value of k to choose? Put another way, what value of k should we choose in order to minimize the number of bits, \#, given a particular value for the probability of error, p, and a particular sizek |S|? Equivalently, what value of k will minimize p, given \# and |S|? I won’t go through the full analysis here, but with calculus and some algebra you can show that choosing

   k \approx \frac{\#}{|S|} \ln 2

minimizes the probability p. (Note that \ln denotes the natural logarithm, not logarithms to base 2.) By choosing k in this way we get:

   [***] \,\,\,\, \# = \frac{|S|}{\ln 2} \log \frac{1}{p}.

This really is good news! Not only is it better than a bit array, it’s actually (usually) much better than the basic hashing scheme we began with. In particular, it will be better whenever

   \frac{1}{\ln 2} \log \frac{1}{p} \leq \log \frac{|S|}{p},

which is equivalent to requiring

   |S| \geq p^{1-1/\ln 2} \approx \frac{1}{p^{0.44}}.

If we want (say) p=0.01 this means that Bloom filter will be better whenever |S| \geq 8, which is obviously an extremely modest set size.

Another way of interpreting [***] is that a Bloom filter requires \frac{1}{\ln 2} \log \frac{1}{p} \approx 1.44 \log \frac{1}{p} bits per element of the set being represented. In fact, it’s possible to prove that any data structure supporting the add and test operations will require at least \log \frac{1}{p} bits per element in the set. This means that Bloom filters are near-optimal. Futher work has been done finding even more memory-efficient data structures that actually meet the \log \frac{1}{p} bound. See, for example, the paper by Anna Pagh, Rasmus Pagh, and S. Srinivasa Rao.

Problems for the author

  • Are the more memory-efficient algorithms practical? Should we be using them?

In actual applications of Bloom filters, we won’t know S in advance, nor |S|. So the way we usually specify a Bloom filter is to specify the maximum size n of set that we’d like to be able to represent, and the maximal probability of error, p, that we’re willing to tolerate. Then we choose

   \# = \frac{n}{\ln 2} \log \frac{1}{p}

and

   k = \ln \frac{1}{p}.

This gives us a Bloom filter capable of representing any set up to size n, with probability of error guaranteed to be at most p. The size n is called the capacity of the Bloom filter. Actually, these expressions are slight simplifications, since the terms on the right may not be integers – to be a little more pedantic, we choose

   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil

and

   k = \lceil \ln \frac{1}{p} \rceil.

One thing that still bugs me about Bloom filters is the expression for the optimal value for k. I don’t have a good intuition for it – why is it logarithmic in 1/p, and why does it not depend on |S|? There’s a tradeoff going on here that’s quite strange when you think about it: bit arrays on their own aren’t very good, but if you repeat or overlap them just the right number of times, then performance improves a lot. And so you can think of Bloom filters as a kind of compromise between an overlap strategy and a bit array strategy. But it’s really not at all obvious (a) why choosing a compromise strategy is the best; or (b) why the right point at which to compromise is where it is, i.e., why k has the form it does. I can’t quite answer these questions at this point – I can’t see that far through Bloom filters. I suspect that understanding the k = 2 case really well would help, but haven’t put in the work. Anyone with more insight is welcome to speak up!

Summing up Bloom filters: Let’s collect everything together. Suppose we want a Bloom filter with capacity n, i.e., capable of representing any set S containing up to n elements, and such that test produces a false positive with probability at most p. Then we choose

   k = \lceil \ln \frac{1}{p} \rceil

independent hash functions, h_0, h_1, \ldots, h_{k-1}. Each hash function has a range 0,\ldots,\#-1, where \# is the number of bits of memory our Bloom filter requires,

   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil.

We number the bits in our Bloom filter from 0,\ldots,\#-1. To add an element x to our set we set the bits h_0(x), h_1(x), \ldots, h_{\#-1}(x) in the filter. And to test whether a given element x is in the set we simply check whether bits h_0(x), h_1(x), \ldots, h_{\#-1}(x) in the bit array are all set.

That’s all there is to the mechanics of how Bloom filters work! I won’t give any sample code – I usually provide code samples in Python, but the Python standard library lacks bit arrays, so nearly all of the code would be concerned with defining a bit array class. That didn’t seem like it’d be terribly illuminating. Of course, it’s not difficult to find libraries implementing Bloom filters. For example, Jason Davies has written a javascript Bloom filter which has a fun and informative online interactive visualisation. I recommend checking it out. And I’ve personally used Mike Axiak‘s fast C-based Python library pybloomfiltermmap – the documentation is clear, it took just a few minutes to get up and running, and I’ve generally had no problems.

Problems

  • Suppose we have two Bloom filters, corresponding to sets S_1 and S_2. How can we construct the Bloom filters corresponding to the sets S_1 \cup S_2 and S_1 \cap S_2?

Applications of Bloom filters: Bloom filters have been used to solve many different problems. Here’s just a few examples to give the flavour of how they can be used. An early idea was Manber and Wu’s 1994 proposal to use Bloom filters to store lists of weak passwords. Google’s BigTable storage system uses Bloom filters to speed up queries, by avoiding disk accesses for rows or columns that don’t exist. Google Chrome uses Bloom filters to do safe web browsing – the opening example in this post was quite real! More generally, it’s useful to consider using Bloom filters whenever a large collection of objects needs to be stored. They’re not appropriate for all purposes, but at the least it’s worth thinking about whether or not a Bloom filter can be applied.

Extensions of Bloom filters: There’s many clever ways of extending Bloom filters. I’ll briefly describe one, just to give you the flavour, and provide links to several more.

A delete operation: It’s possible to modify Bloom filters so they support a delete operation that lets you remove an element from the set. You can’t do this with a standard Bloom filter: it would require unsetting one or more of the bits h_0(x), h_1(x), \ldots in the bit array. This could easily lead us to accidentally delete other elements in the set as well.

Instead, the delete operation is implemented using an idea known as a counting Bloom filter. The basic idea is to take a standard Bloom filter, and replace each bit in the bit array by a bucket containing several bits (usually 3 or 4 bits). We’re going to treat those buckets as counters, initially set to 0. We add an element x to the counting Bloom filter by incrementing each of the buckets numbered h_0(x), h_1(x), \ldots. We test whether x is in the counting Bloom filter by looking to see whether each of the corresponding buckets are non-zero. And we delete x by decrementing each bucket.

This strategy avoids the accidental deletion problem, because when two elements of the set x and y hash into the same bucket, the count in that bucket will be at least 2. deleteing one of the elements, say x, will still leave the count for the bucket at least 1, so y won’t be accidentally deleted. Of course, you could worry that this will lead us to erroneously conclude that x is still in the set after it’s been deleted. But that can only happen if other elements in the set hash into every single bucket that x hashes into. That will only happen if |S| is very large.

Of course, that’s just the basic idea behind counting Bloom filters. A full analysis requires us to understand issues such as bucket overflow (when a counter gets incremented too many times), the optimal size for buckets, the probability of errors, and so on. I won’t get into that, but you there’s details in the further reading, below.

Other variations and further reading: There are many more variations on Bloom filters. Just to give you the flavour of a few applications: (1) they can be modified to be used as lookup dictionaries, associating a value with each element added to the filter; (2) they can be modified so that the capacity scales up dynamically; and (3) they can be used to quickly approximate the number of elements in a set. There are many more variations as well: Bloom filters have turned out to be a very generative idea! This is part of why it’s useful to understand them deeply, since even if a standard Bloom filter can’t solve the particular problem you’re considering, it may be possible to come up with a variation which does. You can get some idea of the scope of the known variations by looking at the Wikipedia article. I also like the survey article by Andrei Broder and Michael Mitzenmacher. It’s a little more dated (2004) than the Wikipedia article, but nicely written and a good introduction. For a shorter introduction to some variations, there’s also a recent blog post by Matthias Vallentin. You can get the flavour of current research by looking at some of the papers citing Bloom filters here. Finally, you may enjoy reading the original paper on Bloom filters, as well as the original paper on counting Bloom filters.

Understanding data structures: I wrote this post because I recently realized that I didn’t understand any complex data structure in any sort of depth. There are, of course, a huge number of striking data structures in computer science – just look at Wikipedia’s amazing list! And while I’m familiar with many of the simpler data structures, I’m ignorant of most complex data structures. There’s nothing wrong with that – unless one is a specialist in data structures there’s no need to master a long laundry list. But what bothered me is that I hadn’t thoroughly mastered even a single complex data structure. In some sense, I didn’t know what it means to understand a complex data structure, at least beyond surface mechanics. By trying to reinvent Bloom filters, I’ve found that I’ve deepened my own understanding and, I hope, written something of interest to others.

Interested in more? Please subscribe to this blog, or follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.

20 comments

  1. Very nice post about a topic in which I didn’t know I was interested! I find I am interested because of some analogies with physics: Storing a list of hashes is like representing a multiparticle state in first quantized form, e.g., a list of |S| positions. Storing them as |S| 1s in a bit array indexed by hashes—a filter—is like representing the state in second quantized form, i.e., as the occupation numbers of a list of positions. Moving to multiple (k) filters is what one would do if the particles live in k dimensions.

    So what’s the particle (physics) analogy for overlapping the filters? It is the inverse of the possibly more familiar idea of taking k particles in 1 dimension and representing their state by a single “particle” in k dimensions with coordinates the set of 1 dimensional coordinates. Bloom filters optimize the choice of the number of dimensions in which the “particle” lives.

    Finally, handling the delete operation by moving to counting Bloom filters has the physics analogy of removing the exclusion principle on the second quantized representation, i.e., switching from fermions to bosons (or at least partly removing it in the case where the buckets have some finite maximum size).

    1. David – Interesting way of thinking. I wonder what exactly is being optimized in this analogy? In what sense is this best? Relatedly, what do multiple hash collisions mean?

  2. Oh cool, you lost me with your mathematic notation nonsense. You should have stayed with the web browser example.

      1. I agree that sometimes the mathematical notation is necessary – but a lot of people discovered a love for math and algorithmics later in life. I had horrible math teachers growing up that could not explain these things and didn’t care if it was as muddy as pea soup.

        But from a self-taught perspective – the explanation of “this operation will take n*n*n number of iterations in a list” is less terse but easier to explain than the notational representation.

        http://bigocheatsheet.com/

        Good info about understanding the “notational nonsense”

  3. You wrote:
    ” I suspect that there’s a simple intuitive argument that would let us see upfront that overlapping filters will be more memory efficient than repeating filters. Can I find such an argument? ”

    I’ll take a quick stab at this one.

    I think the best way to tackle this is to start by visualizing the extreme cases: What happens when the bit map is all ones? What does it signify if the bit map is all zeros? Then back off from the extremes just a tad: Almost all ones, or almost all zeros?

    Once that is done it is easy to rationalize that either extreme is inefficient in terms of memory. Almost all ones means the bit map is too small for a given set, has very little information density (large p), and is thus inefficient. Almost all zeros means the bit map is too large for a given p and thus we are wasting memory. Intuitively we can grasp that optimum memory efficiency is somewhere between the two extremes. Wikipedia says max efficiency occurs at ~50% ones, but I would think optimum would be a smidgen over 50%

    So if we set our goal to be ~ 50% ones, we can evaluate the two choices: overlapping vs. repeating in that light.

    The next step is a bit harder to do without resorting to math. Let’s ask the question does p linearly track bit size? Intuitively I do not think so because I see from my study of the extremes that p will move very slowly near the all zeros condition, but will increase very quickly as the bit map becomes saturated near the all ones condition. If p went up linearly then overlapping vs. repeating algorithms would yield approximately the same p for a given memory size. But since it is not linear, just a small increase in the size of the bit map will cause a large change in p as we near the extreme all ones case. The overlapping filter option allows a small increase in bit map size thus dropping p quickly vs. the repeating filter option which stair steps the memory at k*M (hope I wrote that right)

    Putting it together. We want to be somewhere near 50% ones. The repeating filter option is k times as large memory wise than the overlapping filter option. Since p does linearly track with bit map size we can always create a more efficient bit map using the overlapping filters than we can with the repeating filters. The repeating method will always have more zeros than ones for a given p and will thus not have the same degree of information density for a given memory size.

    Hope my thinking is clear and accurate. Even if not I had a lot of fun reading your article. This one post got me to subscribe. Thanks.

  4. Hi Michael, just finished reading your book on Reinventing Discovery (fantastic!) and thought I’d look you up on twitter, and then ran across this exposition on Bloom filters. I’d seen it before but hadn’t connected the dots to you!

    Anyway, thought you might be interested in our work in applying Bloom filters to DNA sequence assembly — http://www.pnas.org/content/early/2012/07/25/1121464109.full.pdf. You can also look at rayan chaikhi’s follow up work on actually building an assembler on Bloom filters, too.

  5. I think another way to think about it is that you can improve the resolution of a simple bitvector by hashing to subsets of cells rather than just single cells. Of course, then analysis of collisions becomes more complicated…

  6. My own take on the first author’s problem, as seen here: http://iconcurandfurthermore.tumblr.com/post/47807349687/why-bloom-filters-work-the-way-they-do-ddi:

    Overlapping k filters of size |S| memory size # with Probability of Collision p each, in the worst case where the set bits of each filter never overlap, would be like increasing |S| by a factor of k. This would yield for each PoC ~ k|S|/# = kp.

    Increasing those filters’ memory sizes to k# would improve their specific PoCs to |S|/k# = p/k.

    Therefore, doing both would leave each filter’s PoC at ~ kp/k = p, which means , k filters of memory size # and k overlapping filters of memory size k# perform the same in the same budget of k# memory.

    In practice there are collisions between the filters’ set bits, making the effect of overlap less significant than the factor of k without changing any other calculation we did. Therefore, overlapping filters are (at least a bit) better.

  7. I could’ve used this for my database of a billion protein accessions to check for annotations. I worked around the problem with table precomputations, which was probably better in the long run, but it is difficult to keep all the data sets in sync with each other now.

  8. Great post and very helpful. I might actually end up studying more of Bloom filters because of this and finally being able to understand what they really are might help me understand why and how a lot of large-scale datastores like Cassandra, Elasticsearch, etc. use it.

    Thanks!

  9. Hi,

    I’m not sure to understand how you go from:

    k = # / |S| * ln(2)
    # = |S| * log(1/p) / ln(2)

    To:
    k = ln(1/p)

    By simply replacing (# / |S|) in the expression of k, we get:
    k = log(1/p)

    So, k would a base-2 logarithm, not a natural logarithm.

Comments are closed.