{"id":75,"date":"2012-09-26T18:21:24","date_gmt":"2012-09-26T22:21:24","guid":{"rendered":"https:\/\/michaelnielsen.org\/ddi\/?p=75"},"modified":"2012-09-27T08:21:57","modified_gmt":"2012-09-27T12:21:57","slug":"why-bloom-filters-work-the-way-they-do","status":"publish","type":"post","link":"https:\/\/michaelnielsen.org\/ddi\/why-bloom-filters-work-the-way-they-do\/","title":{"rendered":"Why Bloom filters work the way they do"},"content":{"rendered":"<p>Imagine you&#8217;re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access <tt>http:\/\/domain\/etc<\/tt>. You&#8217;d like a way of checking whether <tt>domain<\/tt> is known to be a malicious site.  What&#8217;s a good way of doing this?<\/p>\n<p>An obvious naive way is for your browser to maintain a list or set data structure containing all known malicious domains. A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious domains, and domains need (say) an average of 20 bytes to store, then you need 20 megabytes of storage. That&#8217;s quite an overhead for a single feature in your web browser.  Is there a better way?<\/p>\n<p>In this post I&#8217;ll describe a data structure which provides an excellent way of solving this kind of problem. The data structure is known as a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bloom_filter\">Bloom   filter<\/a>.  Bloom filter are much more memory efficient than the naive &#8220;store-everything&#8221; approach, while remaining extremely fast.  I&#8217;ll describe both how Bloom filters work, and also some extensions of Bloom filters to solve more general problems.<\/p>\n<p>Most explanations of Bloom filters cut to the chase, quickly explaining the detailed mechanics of how Bloom filters work.  Such explanations are informative, but I must admit that they made me uncomfortable when I was first learning about Bloom filters.  In particular, I didn&#8217;t feel that they helped me understand <em>why<\/em> Bloom filters are put together the way they are.  I couldn&#8217;t fathom the mindset that would lead someone to <em>invent<\/em> such a data structure.  And that left me feeling that all I had was a superficial, surface-level understanding of Bloom filters.<\/p>\n<p>In this post I take an unusual approach to explaining Bloom filters. We <em>won&#8217;t<\/em> begin with a full-blown explanation.  Instead, I&#8217;ll gradually build up to the full data structure in stages.  My goal is to tell a plausible story explaining how one could invent Bloom filters from scratch, with each step along the way more or less &#8220;obvious&#8221;.  Of course, hindsight is 20-20, and such a story shouldn&#8217;t be taken too literally.  Rather, the benefit of developing Bloom filters in this way is that it will deepen our understanding of why Bloom filters work in just the way they do.  We&#8217;ll explore some alternative directions that plausibly <em>could<\/em> have been taken &#8211; and see why they don&#8217;t work as well as Bloom filters ultimately turn out to work.  At the end we&#8217;ll understand much better why Bloom filters are constructed the way they are.<\/p>\n<p>Of course, this means that if your goal is just to understand the mechanics of Bloom filters, then this post isn&#8217;t for you.  Instead, I&#8217;d suggest looking at a more conventional introduction &#8211; the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bloom_filter\">Wikipedia article<\/a>, for example, perhaps in conjunction with an interactive demo, like the nice one <a href=\"http:\/\/www.jasondavies.com\/bloomfilter\/\">here<\/a>.  But if your goal is to understand why Bloom filters work the way they do, then you may enjoy the post.<\/p>\n<p><strong>A stylistic note:<\/strong> Most of my posts are code-oriented.  This post is much more focused on mathematical analysis and algebraic manipulation: the point isn&#8217;t code, but rather how one could come to invent a particular data structure.  That is, it&#8217;s the story <em>behind<\/em> the code that implements Bloom filters, and as such it requires rather more attention to mathematical detail.<\/p>\n<p><strong>General description of the problem:<\/strong> Let&#8217;s begin by abstracting away from the &#8220;safe web browsing&#8221; problem that began this post.  We want a data structure which represents a set <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> of objects.  That data structure should enable two operations: (1) the ability to <tt>add<\/tt> an extra object <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> to the set; and (2) a <tt>test<\/tt> to determine whether a given object <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is a member of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>.  Of course, there are many other operations we might imagine wanting &#8211; for example, maybe we&#8217;d also like to be able to <tt>delete<\/tt> objects from the set.  But we&#8217;re going to start with just these two operations of <tt>add<\/tt>ing and <tt>test<\/tt>ing.  Later we&#8217;ll come back and ask whether operations such as <tt>delete<\/tt>ing objects are also possible.<\/p>\n<p><strong>Idea: store a set of hashed objects:<\/strong> Okay, so how can we solve the problem of representing <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> in a way that&#8217;s more memory efficient than just storing all the objects in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>?  One idea is to store hashed versions of the objects in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>, instead of the full objects.  If the hash function is well chosen, then the hashed objects will take up much less memory, but there will be little danger of making errors when <tt>test<\/tt>ing whether an object is an element of the set or not.<\/p>\n<p>Let&#8217;s be a little more explicit about how this would work.  We have a set <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> of objects <img src='https:\/\/s0.wp.com\/latex.php?latex=x_0%2C+x_1%2C+%5Cldots%2C+x_%7B%7CS%7C-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_0, x_1, \\ldots, x_{|S|-1}' title='x_0, x_1, \\ldots, x_{|S|-1}' class='latex' \/>, where <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> denotes the number of objects in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>.  For each object we compute an <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/>-bit hash function <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x_j)' title='h(x_j)' class='latex' \/> &#8211; i.e., a hash function which takes an arbitrary object as input, and outputs <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/> bits &#8211; and the set <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> is represented by the set <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%7B+h%28x_0%29%2C+h%28x_1%29%2C+%5Cldots%2C+h%28x_%7B%7CS%7C-1%7D%29+%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\{ h(x_0), h(x_1), \\ldots, h(x_{|S|-1}) \\}' title='\\{ h(x_0), h(x_1), \\ldots, h(x_{|S|-1}) \\}' class='latex' \/>. We can <tt>test<\/tt> whether <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is an element of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> by checking whether <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' \/> is in the set of hashes.  This basic hashing approach requires roughly <img src='https:\/\/s0.wp.com\/latex.php?latex=m+%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m |S|' title='m |S|' class='latex' \/> bits of memory.<\/p>\n<p>(As an aside, in principle it&#8217;s possible to store the set of hashed objects more efficiently, using just <img src='https:\/\/s0.wp.com\/latex.php?latex=m%7CS%7C-%5Clog%28%7CS%7C%21%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m|S|-\\log(|S|!)' title='m|S|-\\log(|S|!)' class='latex' \/> bits, where <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log' title='\\log' class='latex' \/> is to base two.  The <img src='https:\/\/s0.wp.com\/latex.php?latex=-%5Clog%28%7CS%7C%21%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='-\\log(|S|!)' title='-\\log(|S|!)' class='latex' \/> saving is possible because the ordering of the objects in a set is redundant information, and so in principle can be eliminated using a suitable encoding.  However, I haven&#8217;t thought through what encodings could be used to do this in practice.  In any case, the saving is likely to be minimal, since <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28%7CS%7C%21%29++%5Capprox+%7CS%7C+%5Clog+%7CS%7C%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(|S|!)  \\approx |S| \\log |S|,' title='\\log(|S|!)  \\approx |S| \\log |S|,' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/> will usually be quite a bit bigger than <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog+%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log |S|' title='\\log |S|' class='latex' \/> &#8211; if that weren&#8217;t the case, then hash collisions would occur all the time.  So I&#8217;ll ignore the terms <img src='https:\/\/s0.wp.com\/latex.php?latex=-%5Clog%28%7CS%7C%21%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='-\\log(|S|!)' title='-\\log(|S|!)' class='latex' \/> for the rest of this post.  In fact, in general I&#8217;ll be pretty cavalier in later analyses as well, omitting lower order terms without comment.)<\/p>\n<p>A danger with this hash-based approach is that an object <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> outside the set <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> might have the same hash value as an object inside the set, i.e., <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29+%3D+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) = h(x_j)' title='h(x) = h(x_j)' class='latex' \/> for some <img src='https:\/\/s0.wp.com\/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' \/>.  In this case, <tt>test<\/tt> will erroneously report that <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>.  That is, this data structure will give us a <em>false positive<\/em>.  Fortunately, by choosing a suitable value for <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/>, the number of bits output by the hash function, we can reduce the probability of a false positive as much as we want. To understand how this works, notice first that the probability of <tt>test<\/tt> giving a false positive is 1 minus the probability of <tt>test<\/tt> correctly reporting that <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is not in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>. This occurs when <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29+%5Cneq+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) \\neq h(x_j)' title='h(x) \\neq h(x_j)' class='latex' \/> for all <img src='https:\/\/s0.wp.com\/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' \/>.  If the hash function is well chosen, then the probability that <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29+%5Cneq+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) \\neq h(x_j)' title='h(x) \\neq h(x_j)' class='latex' \/> is <img src='https:\/\/s0.wp.com\/latex.php?latex=%281-1%2F2%5Em%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(1-1\/2^m)' title='(1-1\/2^m)' class='latex' \/> for each <img src='https:\/\/s0.wp.com\/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' \/>, and these are independent events. Thus the probability of <tt>test<\/tt> failing is:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+1-%281-1%2F2%5Em%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = 1-(1-1\/2^m)^{|S|}. ' title='   p = 1-(1-1\/2^m)^{|S|}. ' class='latex' \/>\n<p>This expression involves three quantities: the probability <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> of <tt>test<\/tt> giving a false positive, the number <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/> of bits output by the hash function, and the number of elements in the set, <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>.  It&#8217;s a nice expression, but it&#8217;s more enlightening when rewritten in a slightly different form.  What we&#8217;d really like to understand is how many bits of memory are needed to represent a set of size <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>, with probability <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> of a <tt>test<\/tt> failing.  To understand that we let <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/> be the number of bits of memory used, and aim to express <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/> as a function of <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>.  Observe that <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%3D+%7CS%7Cm&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# = |S|m' title='\\# = |S|m' class='latex' \/>, and so we can substitute for <img src='https:\/\/s0.wp.com\/latex.php?latex=m+%3D+%5C%23%2F%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m = \\#\/|S|' title='m = \\#\/|S|' class='latex' \/> to obtain<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+1-%5Cleft%281-1%2F2%5E%7B%5C%23%2F%7CS%7C%7D%5Cright%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = 1-\\left(1-1\/2^{\\#\/|S|}\\right)^{|S|}. ' title='   p = 1-\\left(1-1\/2^{\\#\/|S|}\\right)^{|S|}. ' class='latex' \/>\n<p>This can be rearranged to express <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/> in term of <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%3D+%7CS%7C%5Clog+%5Cfrac%7B1%7D%7B1-%281-p%29%5E%7B1%2F%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# = |S|\\log \\frac{1}{1-(1-p)^{1\/|S|}}. ' title='   \\# = |S|\\log \\frac{1}{1-(1-p)^{1\/|S|}}. ' class='latex' \/>\n<p>This expression answers the question we really want answered, telling us how many bits are required to store a set of size <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> with a probability <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> of a <tt>test<\/tt> failing.  Of course, in practice we&#8217;d like <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> to be small &#8211; say <img src='https:\/\/s0.wp.com\/latex.php?latex=p+%3D+0.01&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = 0.01' title='p = 0.01' class='latex' \/> &#8211; and when this is the case the expression may be approximated by a more transparent expression:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%5Capprox+%7CS%7C%5Clog+%5Cfrac%7B%7CS%7C%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# \\approx |S|\\log \\frac{|S|}{p}. ' title='   \\# \\approx |S|\\log \\frac{|S|}{p}. ' class='latex' \/>\n<p>This makes intuitive sense: <tt>test<\/tt> failure occurs when <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is not in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>, but <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' \/> is in the hashed version of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>.  Because this happens with probability <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/>, it must be that <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> occupies a fraction <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> of the total space of possible hash outputs.  And so the size of the space of all possible hash outputs must be about <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|\/p' title='|S|\/p' class='latex' \/>.  As a consequence we need <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28%7CS%7C%2Fp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(|S|\/p)' title='\\log(|S|\/p)' class='latex' \/> bits to represent each hashed object, in agreement with the expression above.  <\/p>\n<p>How memory efficient is this hash-based approach to representing <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>? It&#8217;s obviously likely to be quite a bit better than storing full representations of the objects in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>.  But we&#8217;ll see later that Bloom filters can be far more memory efficient still.<\/p>\n<p>The big drawback of this hash-based approach is the false positives. Still, for many applications it&#8217;s fine to have a small probability of a false positive.  For example, false positives turn out to be okay for the safe web browsing problem.  You might worry that false positives would cause some safe sites to erroneously be reported as unsafe, but the browser can avoid this by maintaining a (small!)  list of safe sites which are false positives for <tt>test<\/tt>.<\/p>\n<p><strong>Idea: use a bit array:<\/strong> Suppose we want to represent some subset <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> of the integers <img src='https:\/\/s0.wp.com\/latex.php?latex=0%2C+1%2C+2%2C+%5Cldots%2C+999&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0, 1, 2, \\ldots, 999' title='0, 1, 2, \\ldots, 999' class='latex' \/>.  As an alternative to hashing or to storing <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> directly, we could represent <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> using an array of <img src='https:\/\/s0.wp.com\/latex.php?latex=1000&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1000' title='1000' class='latex' \/> bits, numbered <img src='https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' \/> through <img src='https:\/\/s0.wp.com\/latex.php?latex=999&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='999' title='999' class='latex' \/>.  We would set bits in the array to <img src='https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' \/> if the corresponding number is in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>, and otherwise set them to <img src='https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' \/>.  It&#8217;s obviously trivial to <tt>add<\/tt> objects to <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>, and to <tt>test<\/tt> whether a particular object is in <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> or not.<\/p>\n<p>The memory cost to store <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> in this bit-array approach is <img src='https:\/\/s0.wp.com\/latex.php?latex=1000&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1000' title='1000' class='latex' \/> bits, regardless of how big or small <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> is.  Suppose, for comparison, that we stored <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> directly as a list of 32-bit integers. Then the cost would be <img src='https:\/\/s0.wp.com\/latex.php?latex=32+%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='32 |S|' title='32 |S|' class='latex' \/> bits.  When <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> is very small, this approach would be more memory efficient than using a bit array.  But as <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> gets larger, storing <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> directly becomes much less memory efficient.  We could ameliorate this somewhat by storing elements of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> using only 10 bits, instead of 32 bits.  But even if we did this, it would still be more expensive to store the list once <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> got beyond one hundred elements.  So a bit array really would be better for modestly large subsets.<\/p>\n<p><strong>Idea: use a bit array where the indices are given by hashes:<\/strong> A problem with the bit array example described above is that we needed a way of numbering the possible elements of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>, <img src='https:\/\/s0.wp.com\/latex.php?latex=0%2C1%2C%5Cldots%2C999&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,1,\\ldots,999' title='0,1,\\ldots,999' class='latex' \/>.  In general the elements of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> may be complicated objects, not numbers in a small, well-defined range.<\/p>\n<p>Fortunately, we can use hashing to number the elements of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>. Suppose <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28%5Ccdot%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(\\cdot)' title='h(\\cdot)' class='latex' \/> is an <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/>-bit hash function.  We&#8217;re going to represent a set <img src='https:\/\/s0.wp.com\/latex.php?latex=S+%3D+%5C%7Bx_0%2C%5Cldots%2Cx_%7B%7CS%7C-1%7D%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S = \\{x_0,\\ldots,x_{|S|-1}\\}' title='S = \\{x_0,\\ldots,x_{|S|-1}\\}' class='latex' \/> using a bit array containing <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' \/> elements.  In particular, for each <img src='https:\/\/s0.wp.com\/latex.php?latex=x_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_j' title='x_j' class='latex' \/> we set the <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x_j)' title='h(x_j)' class='latex' \/>th element in the bit array, where we regard <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x_j)' title='h(x_j)' class='latex' \/> as a number in the range <img src='https:\/\/s0.wp.com\/latex.php?latex=0%2C1%2C%5Cldots%2C2%5Em-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,1,\\ldots,2^m-1' title='0,1,\\ldots,2^m-1' class='latex' \/>.  More explicitly, we can <tt>add<\/tt> an element <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> to the set by setting bit number <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' \/> in the bit array.  And we can <tt>test<\/tt> whether <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is an element of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> by checking whether bit number <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' \/> in the bit array is set.<\/p>\n<p>This is a good scheme, but the <tt>test<\/tt> can fail to give the correct result, which occurs whenever <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is not an element of <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/>, yet <img src='https:\/\/s0.wp.com\/latex.php?latex=h%28x%29+%3D+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) = h(x_j)' title='h(x) = h(x_j)' class='latex' \/> for some <img src='https:\/\/s0.wp.com\/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' \/>.  This is exactly the same failure condition as for the basic hashing scheme we described earlier.  By exactly the same reasoning as used then, the failure probability is<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5B%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+p+%3D+1-%281-1%2F2%5Em%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [*] \\,\\,\\,\\, p = 1-(1-1\/2^m)^{|S|}. ' title='   [*] \\,\\,\\,\\, p = 1-(1-1\/2^m)^{|S|}. ' class='latex' \/>\n<p>As we did earlier, we&#8217;d like to re-express this in terms of the number of bits of memory used, <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/>.  This works differently than for the basic hashing scheme, since the number of bits of memory consumed by the current approach is <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%3D+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# = 2^m' title='\\# = 2^m' class='latex' \/>, as compared to <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%3D+%7CS%7Cm&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# = |S|m' title='\\# = |S|m' class='latex' \/> for the earlier scheme.  Using <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%3D+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# = 2^m' title='\\# = 2^m' class='latex' \/> and substituting for <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/> in Equation [*], we have:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+1-%281-1%2F%5C%23%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = 1-(1-1\/\\#)^{|S|}. ' title='   p = 1-(1-1\/\\#)^{|S|}. ' class='latex' \/>\n<p>Rearranging this to express <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/> in term of <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> we obtain:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7B1%7D%7B1-%281-p%29%5E%7B1%2F%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# = \\frac{1}{1-(1-p)^{1\/|S|}}. ' title='   \\# = \\frac{1}{1-(1-p)^{1\/|S|}}. ' class='latex' \/>\n<p>When <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> is small this can be approximated by<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%5Capprox+%5Cfrac%7B%7CS%7C%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# \\approx \\frac{|S|}{p}. ' title='   \\# \\approx \\frac{|S|}{p}. ' class='latex' \/>\n<p>This isn&#8217;t very memory efficient!  We&#8217;d like the probability of failure <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> to be small, and that makes the <img src='https:\/\/s0.wp.com\/latex.php?latex=1%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1\/p' title='1\/p' class='latex' \/> dependence bad news when compared to the <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28%7CS%7C%2Fp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(|S|\/p)' title='\\log(|S|\/p)' class='latex' \/> dependence of the basic hashing scheme described earlier.  The only time the current approach is better is when <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> is very, very large.  To get some idea for just how large, if we want <img src='https:\/\/s0.wp.com\/latex.php?latex=p+%3D+0.01&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = 0.01' title='p = 0.01' class='latex' \/>, then <img src='https:\/\/s0.wp.com\/latex.php?latex=1%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1\/p' title='1\/p' class='latex' \/> is only better than <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog%28%7CS%7C%2Fp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log(|S|\/p)' title='\\log(|S|\/p)' class='latex' \/> when <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> gets to be more than about <img src='https:\/\/s0.wp.com\/latex.php?latex=1.27+%2A+10%5E%7B28%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1.27 * 10^{28}' title='1.27 * 10^{28}' class='latex' \/>. That&#8217;s quite a set!  In practice, the basic hashing scheme will be much more memory efficient.<\/p>\n<p>Intuitively, it&#8217;s not hard to see why this approach is so memory inefficient compared to the basic hashing scheme.  The problem is that with an <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/>-bit hash function, the basic hashing scheme used <img src='https:\/\/s0.wp.com\/latex.php?latex=m%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m|S|' title='m|S|' class='latex' \/> bits of memory, while hashing into a bit array uses <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' \/> bits, but doesn&#8217;t change the probability of failure.  That&#8217;s exponentially more memory!<\/p>\n<p>At this point, hashing into bit arrays looks like a bad idea.  But it turns out that by tweaking the idea just a little we can improve it a lot.  To carry out this tweaking, it helps to name the data structure we&#8217;ve just described (where we hash into a bit array).  We&#8217;ll call it a <em>filter<\/em>, anticipating the fact that it&#8217;s a precursor to the Bloom filter.  I don&#8217;t know whether &#8220;filter&#8221; is a standard name, but in any case it&#8217;ll be a useful working name.<\/p>\n<p><strong>Idea: use multiple filters:<\/strong> How can we make the basic filter just described more memory efficient?  One possibility is to try using multiple filters, based on independent hash functions. More precisely, the idea is to use <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> filters, each based on an (independent) <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/>-bit hash function, <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%2C+h_1%2C+%5Cldots%2C+h_%7Bk-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0, h_1, \\ldots, h_{k-1}' title='h_0, h_1, \\ldots, h_{k-1}' class='latex' \/>.  So our data structure will consist of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> separate bit arrays, each containing <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' \/> bits, for a grand total of <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%3D+k+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# = k 2^m' title='\\# = k 2^m' class='latex' \/> bits.  We can <tt>add<\/tt> an element <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> by setting the <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x)' title='h_0(x)' class='latex' \/>th bit in the first bit array (i.e., the first filter), the <img src='https:\/\/s0.wp.com\/latex.php?latex=h_1%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_1(x)' title='h_1(x)' class='latex' \/>th bit in the second filter, and so on.  We can <tt>test<\/tt> whether a candidate element <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is in the set by simply checking whether all the appropriate bits are set in each filter.  For this to fail, each individual filter must fail.  Because the hash functions are independent of one another, the probability of this is the <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/>th power of any single filter failing:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F2%5Em%29%5E%7B%7CS%7C%7D%5Cright%29%5Ek.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \\left(1-(1-1\/2^m)^{|S|}\\right)^k. ' title='   p = \\left(1-(1-1\/2^m)^{|S|}\\right)^k. ' class='latex' \/>\n<p>The number of bits of memory used by this data structure is <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%3D+k+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# = k 2^m' title='\\# = k 2^m' class='latex' \/> and so we can substitute <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em+%3D+%5C%23%2Fk&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m = \\#\/k' title='2^m = \\#\/k' class='latex' \/> and rearrange to get<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5B%2A%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%23+%3D+%5Cfrac%7Bk%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2F%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [**] \\,\\,\\,\\, \\# = \\frac{k}{1-(1-p^{1\/k})^{1\/|S|}}. ' title='   [**] \\,\\,\\,\\, \\# = \\frac{k}{1-(1-p^{1\/k})^{1\/|S|}}. ' class='latex' \/>\n<p>Provided <img src='https:\/\/s0.wp.com\/latex.php?latex=p%5E%7B1%2Fk%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1\/k}' title='p^{1\/k}' class='latex' \/> is much smaller than <img src='https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' \/>, this expression can be simplified to give<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%5Capprox+%5Cfrac%7Bk%7CS%7C%7D%7Bp%5E%7B1%2Fk%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# \\approx \\frac{k|S|}{p^{1\/k}}. ' title='   \\# \\approx \\frac{k|S|}{p^{1\/k}}. ' class='latex' \/>\n<p>Good news!  This repetition strategy is much more memory efficient than a single filter, at least for small values of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/>.  For instance, moving from <img src='https:\/\/s0.wp.com\/latex.php?latex=k+%3D+1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k = 1' title='k = 1' class='latex' \/> repetitions to <img src='https:\/\/s0.wp.com\/latex.php?latex=k+%3D+2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k = 2' title='k = 2' class='latex' \/> repititions changes the denominator from <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> to <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Csqrt%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\sqrt{p}' title='\\sqrt{p}' class='latex' \/> &#8211; typically, a huge improvement, since <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> is very small.  And the only price paid is doubling the numerator.  So this is a big win.<\/p>\n<p>Intuitively, and in retrospect, this result is not so surprising. Putting multiple filters in a row, the probability of error drops exponentially with the number of filters.  By contrast, in the single filter scheme, the drop in the probability of error is roughly linear with the number of bits.  (This follows from considering Equation [*] in the limit where <img src='https:\/\/s0.wp.com\/latex.php?latex=1%2F2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1\/2^m' title='1\/2^m' class='latex' \/> is small.)  So using multiple filters is a good strategy.<\/p>\n<p>Of course, a caveat to the last paragraph is that this analysis requires that <img src='https:\/\/s0.wp.com\/latex.php?latex=p%5E%7B1%2Fk%7D+%5Cll+1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1\/k} \\ll 1' title='p^{1\/k} \\ll 1' class='latex' \/>, which means that <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> can&#8217;t be too large before the analysis breaks down.  For larger values of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> the analysis is somewhat more complicated.  In order to find the optimal value of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> we&#8217;d need to figure out what value of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> minimizes the exact expression [**] for <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/>.  We won&#8217;t bother &#8211; at best it&#8217;d be tedious, and, as we&#8217;ll see shortly, there is in any case a better approach.<\/p>\n<p><strong>Overlapping filters:<\/strong> This is a variation on the idea of repeating filters.  Instead of having <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> separate bit arrays, we use just a single array of <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' \/> bits.  When <tt>add<\/tt>ing an object <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/>, we simply set all the bits <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C%5Cldots%2C+h_%7Bk-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x),\\ldots, h_{k-1}(x)' title='h_0(x), h_1(x),\\ldots, h_{k-1}(x)' class='latex' \/> in the same bit array.  To <tt>test<\/tt> whether an element <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is in the set, we simply check whether all the bits <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C%5Cldots%2C+h_%7Bk-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x),\\ldots, h_{k-1}(x)' title='h_0(x), h_1(x),\\ldots, h_{k-1}(x)' class='latex' \/> are set or not.<\/p>\n<p>What&#8217;s the probability of the <tt>test<\/tt> failing?  Suppose <img src='https:\/\/s0.wp.com\/latex.php?latex=x+%5Cnot+%5Cin+S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x \\not \\in S' title='x \\not \\in S' class='latex' \/>.  Failure occurs when <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) = h_{i_0}(x_{j_0})' title='h_0(x) = h_{i_0}(x_{j_0})' class='latex' \/> for some <img src='https:\/\/s0.wp.com\/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' \/>, and also <img src='https:\/\/s0.wp.com\/latex.php?latex=h_1%28x%29+%3D+h_%7Bi_1%7D%28x_%7Bj_1%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_1(x) = h_{i_1}(x_{j_1})' title='h_1(x) = h_{i_1}(x_{j_1})' class='latex' \/> for some <img src='https:\/\/s0.wp.com\/latex.php?latex=i_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_1' title='i_1' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=j_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_1' title='j_1' class='latex' \/>, and so on for all the remaining hash functions, <img src='https:\/\/s0.wp.com\/latex.php?latex=h_2%2C+h_3%2C%5Cldots%2C+h_%7Bk-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_2, h_3,\\ldots, h_{k-1}' title='h_2, h_3,\\ldots, h_{k-1}' class='latex' \/>.  These are independent events, and so the probability they all occur is just the product of the probabilities of the individual events.  A little thought should convince you that each individual event will have the same probability, and so we can just focus on computing the probability that <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) = h_{i_0}(x_{j_0})' title='h_0(x) = h_{i_0}(x_{j_0})' class='latex' \/> for some <img src='https:\/\/s0.wp.com\/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' \/>.  The overall probability <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/> of failure will then be the <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/>th power of that probability, i.e.,<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+p%28h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29+%5Cmbox%7B+for+some+%7D+i_0%2Cj_0%29%5Ek+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = p(h_0(x) = h_{i_0}(x_{j_0}) \\mbox{ for some } i_0,j_0)^k ' title='   p = p(h_0(x) = h_{i_0}(x_{j_0}) \\mbox{ for some } i_0,j_0)^k ' class='latex' \/>\n<p>The probability that <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) = h_{i_0}(x_{j_0})' title='h_0(x) = h_{i_0}(x_{j_0})' class='latex' \/> for some <img src='https:\/\/s0.wp.com\/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' \/> is one minus the probability that <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29+%5Cneq+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) \\neq h_{i_0}(x_{j_0})' title='h_0(x) \\neq h_{i_0}(x_{j_0})' class='latex' \/> for all <img src='https:\/\/s0.wp.com\/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' \/>.  These are independent events for the different possible values of <img src='https:\/\/s0.wp.com\/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' \/>, each with probability <img src='https:\/\/s0.wp.com\/latex.php?latex=1-1%2F2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1-1\/2^m' title='1-1\/2^m' class='latex' \/>, and so<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p%28h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29+%5Cmbox%7B+for+some+%7D+i_0%2Cj_0%29+%3D+1-%281-1%2F2%5Em+%29%5E%7Bk%7CS%7C%7D%2C+++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p(h_0(x) = h_{i_0}(x_{j_0}) \\mbox{ for some } i_0,j_0) = 1-(1-1\/2^m )^{k|S|},   ' title='   p(h_0(x) = h_{i_0}(x_{j_0}) \\mbox{ for some } i_0,j_0) = 1-(1-1\/2^m )^{k|S|},   ' class='latex' \/>\n<p>since there are <img src='https:\/\/s0.wp.com\/latex.php?latex=k%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k|S|' title='k|S|' class='latex' \/> different pairs of possible values <img src='https:\/\/s0.wp.com\/latex.php?latex=%28i_0%2C+j_0%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(i_0, j_0)' title='(i_0, j_0)' class='latex' \/>.  It follows that<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F2%5Em+%29%5E%7Bk%7CS%7C%7D%5Cright%29%5Ek.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \\left(1-(1-1\/2^m )^{k|S|}\\right)^k. ' title='   p = \\left(1-(1-1\/2^m )^{k|S|}\\right)^k. ' class='latex' \/>\n<p>Substituting <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em+%3D+%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m = \\#' title='2^m = \\#' class='latex' \/> we obtain<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F%5C%23+%29%5E%7Bk%7CS%7C%7D%5Cright%29%5Ek+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \\left(1-(1-1\/\\# )^{k|S|}\\right)^k ' title='   p = \\left(1-(1-1\/\\# )^{k|S|}\\right)^k ' class='latex' \/>\n<p>which can be rearranged to obtain<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7B1%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2Fk%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# = \\frac{1}{1-(1-p^{1\/k})^{1\/k|S|}}. ' title='   \\# = \\frac{1}{1-(1-p^{1\/k})^{1\/k|S|}}. ' class='latex' \/>\n<p>This is remarkably similar to the expression [**] derived above for repeating filters.  In fact, provided <img src='https:\/\/s0.wp.com\/latex.php?latex=p%5E%7B1%2Fk%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1\/k}' title='p^{1\/k}' class='latex' \/> is much smaller than <img src='https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' \/>, we get<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%5Capprox+%5Cfrac%7Bk%7CS%7C%7D%7Bp%5E%7B1%2Fk%7D%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# \\approx \\frac{k|S|}{p^{1\/k}}, ' title='   \\# \\approx \\frac{k|S|}{p^{1\/k}}, ' class='latex' \/>\n<p>which is exactly the same as [**] when <img src='https:\/\/s0.wp.com\/latex.php?latex=p%5E%7B1%2Fk%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1\/k}' title='p^{1\/k}' class='latex' \/> is small.  So this approach gives quite similar outcomes to the repeating filter strategy.<\/p>\n<p>Which approach is better, repeating or overlapping filters?  In fact, it can be shown that<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5Cfrac%7B1%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2Fk%7CS%7C%7D%7D+%5Cleq+%5Cfrac%7Bk%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2F%7CS%7C%7D%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\frac{1}{1-(1-p^{1\/k})^{1\/k|S|}} \\leq \\frac{k}{1-(1-p^{1\/k})^{1\/|S|}}, ' title='   \\frac{1}{1-(1-p^{1\/k})^{1\/k|S|}} \\leq \\frac{k}{1-(1-p^{1\/k})^{1\/|S|}}, ' class='latex' \/>\n<p>and so the overlapping filter strategy is more memory efficient than the repeating filter strategy.  I won&#8217;t prove the inequality here &#8211; it&#8217;s a straightforward (albeit tedious) exercise in calculus.  The important takeaway is that overlapping filters are the more memory-efficient approach.<\/p>\n<p>How do overlapping filters compare to our first approach, the basic hashing strategy?  I&#8217;ll defer a full answer until later, but we can get some insight by choosing <img src='https:\/\/s0.wp.com\/latex.php?latex=p+%3D+0.0001&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = 0.0001' title='p = 0.0001' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=k%3D4&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k=4' title='k=4' class='latex' \/>.  Then for the overlapping filter we get <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%5Capprox+40%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# \\approx 40|S|' title='\\# \\approx 40|S|' class='latex' \/>, while the basic hashing strategy gives <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23+%3D+%7CS%7C+%5Clog%28+10000+%7CS%7C%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\# = |S| \\log( 10000 |S|)' title='\\# = |S| \\log( 10000 |S|)' class='latex' \/>. Basic hashing is worse whenever <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> is more than about 100 million &#8211; a big number, but also a big improvement over the <img src='https:\/\/s0.wp.com\/latex.php?latex=1.27+%2A+10%5E%7B28%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1.27 * 10^{28}' title='1.27 * 10^{28}' class='latex' \/> required by a single filter.  Given that we haven&#8217;t yet made any attempt to optimize <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/>, this ought to encourage us that we&#8217;re onto something.<\/p>\n<h3>Problems for the author<\/h3>\n<ul>\n<li> I suspect that there&#8217;s a simple intuitive argument that would   let us see upfront that overlapping filters will be more memory   efficient than repeating filters.  Can I find such an argument? <\/ul>\n<p><strong>Bloom filters:<\/strong> We&#8217;re finally ready for Bloom filters.  In fact, Bloom filters involve only a few small changes to overlapping filters.  In describing overlapping filters we hashed into a bit array containing <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' \/> bits.  We could, instead, have used hash functions with a range <img src='https:\/\/s0.wp.com\/latex.php?latex=0%2C%5Cldots%2CM-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,\\ldots,M-1' title='0,\\ldots,M-1' class='latex' \/> and hashed into a bit array of <img src='https:\/\/s0.wp.com\/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' \/> (instead of <img src='https:\/\/s0.wp.com\/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' \/>) bits. The analysis goes through unchanged if we do this, and we end up with<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F%5C%23+%29%5E%7Bk%7CS%7C%7D%5Cright%29%5Ek+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \\left(1-(1-1\/\\# )^{k|S|}\\right)^k ' title='   p = \\left(1-(1-1\/\\# )^{k|S|}\\right)^k ' class='latex' \/>\n<p>and<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7B1%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2Fk%7CS%7C%7D%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# = \\frac{1}{1-(1-p^{1\/k})^{1\/k|S|}}, ' title='   \\# = \\frac{1}{1-(1-p^{1\/k})^{1\/k|S|}}, ' class='latex' \/>\n<p>exactly as before. The only reason I didn&#8217;t do this earlier is because in deriving Equation [*] above it was convenient to re-use the reasoning from the basic hashing scheme, where <img src='https:\/\/s0.wp.com\/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' \/> (not <img src='https:\/\/s0.wp.com\/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' \/>) was the convenient parameter to use.  But the exact same reasoning works.<\/p>\n<p>What&#8217;s the best value of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> to choose?  Put another way, what value of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> should we choose in order to minimize the number of bits, <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/>, given a particular value for the probability of error, <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/>, and a particular sizek <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>?  Equivalently, what value of <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> will minimize <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/>, given <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>?  I won&#8217;t go through the full analysis here, but with calculus and some algebra you can show that choosing<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++k+%5Capprox+%5Cfrac%7B%5C%23%7D%7B%7CS%7C%7D+%5Cln+2+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k \\approx \\frac{\\#}{|S|} \\ln 2 ' title='   k \\approx \\frac{\\#}{|S|} \\ln 2 ' class='latex' \/>\n<p>minimizes the probability <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/>.  (Note that <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cln&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\ln' title='\\ln' class='latex' \/> denotes the natural logarithm, not logarithms to base 2.)  By choosing <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> in this way we get:<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5B%2A%2A%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%23+%3D+%5Cfrac%7B%7CS%7C%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [***] \\,\\,\\,\\, \\# = \\frac{|S|}{\\ln 2} \\log \\frac{1}{p}. ' title='   [***] \\,\\,\\,\\, \\# = \\frac{|S|}{\\ln 2} \\log \\frac{1}{p}. ' class='latex' \/>\n<p>This really is good news!  Not only is it better than a bit array, it&#8217;s actually (usually) much better than the basic hashing scheme we began with.  In particular, it will be better whenever<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5Cfrac%7B1%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Cleq+%5Clog+%5Cfrac%7B%7CS%7C%7D%7Bp%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\frac{1}{\\ln 2} \\log \\frac{1}{p} \\leq \\log \\frac{|S|}{p}, ' title='   \\frac{1}{\\ln 2} \\log \\frac{1}{p} \\leq \\log \\frac{|S|}{p}, ' class='latex' \/>\n<p>which is equivalent to requiring<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%7CS%7C+%5Cgeq+p%5E%7B1-1%2F%5Cln+2%7D+%5Capprox+%5Cfrac%7B1%7D%7Bp%5E%7B0.44%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   |S| \\geq p^{1-1\/\\ln 2} \\approx \\frac{1}{p^{0.44}}. ' title='   |S| \\geq p^{1-1\/\\ln 2} \\approx \\frac{1}{p^{0.44}}. ' class='latex' \/>\n<p>If we want (say) <img src='https:\/\/s0.wp.com\/latex.php?latex=p%3D0.01&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p=0.01' title='p=0.01' class='latex' \/> this means that Bloom filter will be better whenever <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C+%5Cgeq+8&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S| \\geq 8' title='|S| \\geq 8' class='latex' \/>, which is obviously an extremely modest set size.<\/p>\n<p>Another way of interpreting [***] is that a Bloom filter requires <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7B1%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Capprox+1.44+%5Clog+%5Cfrac%7B1%7D%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\frac{1}{\\ln 2} \\log \\frac{1}{p} \\approx 1.44 \\log \\frac{1}{p}' title='\\frac{1}{\\ln 2} \\log \\frac{1}{p} \\approx 1.44 \\log \\frac{1}{p}' class='latex' \/> bits per element of the set being represented.  In fact, it&#8217;s possible to prove that any data structure supporting the <tt>add<\/tt> and <tt>test<\/tt> operations will require at least <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog+%5Cfrac%7B1%7D%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log \\frac{1}{p}' title='\\log \\frac{1}{p}' class='latex' \/> bits per element in the set.  This means that Bloom filters are near-optimal.  Futher work has been done finding even more memory-efficient data structures that actually meet the <img src='https:\/\/s0.wp.com\/latex.php?latex=%5Clog+%5Cfrac%7B1%7D%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\log \\frac{1}{p}' title='\\log \\frac{1}{p}' class='latex' \/> bound.  See, for example, the paper by <a href=\"http:\/\/scholar.google.ca\/scholar?cluster=13031359803369786500&#038;hl=en&#038;as_sdt=0,5\">Anna   Pagh, Rasmus Pagh, and S. Srinivasa Rao<\/a>.<\/p>\n<h3>Problems for the author<\/h3>\n<ul>\n<li> Are the more memory-efficient algorithms practical?  Should we   be using them? <\/ul>\n<p>In actual applications of Bloom filters, we won&#8217;t know <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> in advance, nor <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>. So the way we usually specify a Bloom filter is to specify the <em>maximum<\/em> size <img src='https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' \/> of set that we&#8217;d like to be able to represent, and the maximal probability of error, <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/>, that we&#8217;re willing to tolerate.  Then we choose<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7Bn%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# = \\frac{n}{\\ln 2} \\log \\frac{1}{p} ' title='   \\# = \\frac{n}{\\ln 2} \\log \\frac{1}{p} ' class='latex' \/>\n<p>and<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++k+%3D+%5Cln+%5Cfrac%7B1%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k = \\ln \\frac{1}{p}. ' title='   k = \\ln \\frac{1}{p}. ' class='latex' \/>\n<p>This gives us a Bloom filter capable of representing any set up to size <img src='https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' \/>, with probability of error guaranteed to be at most <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/>.  The size <img src='https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' \/> is called the <em>capacity<\/em> of the Bloom filter.  Actually, these expressions are slight simplifications, since the terms on the right may not be integers &#8211; to be a little more pedantic, we choose<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%3D+%5Clceil+%5Cfrac%7Bn%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# = \\lceil \\frac{n}{\\ln 2} \\log \\frac{1}{p} \\rceil ' title='   \\# = \\lceil \\frac{n}{\\ln 2} \\log \\frac{1}{p} \\rceil ' class='latex' \/>\n<p>and<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++k+%3D+%5Clceil+%5Cln+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k = \\lceil \\ln \\frac{1}{p} \\rceil. ' title='   k = \\lceil \\ln \\frac{1}{p} \\rceil. ' class='latex' \/>\n<p>One thing that still bugs me about Bloom filters is the expression for the optimal value for <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/>.  I don&#8217;t have a good intuition for it &#8211; why is it logarithmic in <img src='https:\/\/s0.wp.com\/latex.php?latex=1%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1\/p' title='1\/p' class='latex' \/>, and why does it not depend on <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/>? There&#8217;s a tradeoff going on here that&#8217;s quite strange when you think about it: bit arrays on their own aren&#8217;t very good, but if you repeat or overlap them just the right number of times, then performance improves a lot.  And so you can think of Bloom filters as a kind of compromise between an overlap strategy and a bit array strategy.  But it&#8217;s really not at all obvious (a) why choosing a compromise strategy is the best; or (b) why the right point at which to compromise is where it is, i.e., why <img src='https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' \/> has the form it does.  I can&#8217;t quite answer these questions at this point &#8211; I can&#8217;t see that far through Bloom filters.  I suspect that understanding the <img src='https:\/\/s0.wp.com\/latex.php?latex=k+%3D+2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k = 2' title='k = 2' class='latex' \/> case really well would help, but haven&#8217;t put in the work.  Anyone with more insight is welcome to speak up!<\/p>\n<p><strong>Summing up Bloom filters:<\/strong> Let&#8217;s collect everything together. Suppose we want a Bloom filter with capacity <img src='https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' \/>, i.e., capable of representing any set <img src='https:\/\/s0.wp.com\/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' \/> containing up to <img src='https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' \/> elements, and such that <tt>test<\/tt> produces a false positive with probability at most <img src='https:\/\/s0.wp.com\/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' \/>. Then we choose<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++k+%3D+%5Clceil+%5Cln+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k = \\lceil \\ln \\frac{1}{p} \\rceil ' title='   k = \\lceil \\ln \\frac{1}{p} \\rceil ' class='latex' \/>\n<p>independent hash functions, <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%2C+h_1%2C+%5Cldots%2C+h_%7Bk-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0, h_1, \\ldots, h_{k-1}' title='h_0, h_1, \\ldots, h_{k-1}' class='latex' \/>.  Each hash function has a range <img src='https:\/\/s0.wp.com\/latex.php?latex=0%2C%5Cldots%2C%5C%23-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,\\ldots,\\#-1' title='0,\\ldots,\\#-1' class='latex' \/>, where <img src='https:\/\/s0.wp.com\/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\\#' title='\\#' class='latex' \/> is the number of bits of memory our Bloom filter requires,<\/p>\n<img src='https:\/\/s0.wp.com\/latex.php?latex=+++%5C%23+%3D+%5Clceil+%5Cfrac%7Bn%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \\# = \\lceil \\frac{n}{\\ln 2} \\log \\frac{1}{p} \\rceil. ' title='   \\# = \\lceil \\frac{n}{\\ln 2} \\log \\frac{1}{p} \\rceil. ' class='latex' \/>\n<p>We number the bits in our Bloom filter from <img src='https:\/\/s0.wp.com\/latex.php?latex=0%2C%5Cldots%2C%5C%23-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,\\ldots,\\#-1' title='0,\\ldots,\\#-1' class='latex' \/>.  To <tt>add<\/tt> an element <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> to our set we set the bits <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots%2C+h_%7B%5C%23-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \\ldots, h_{\\#-1}(x)' title='h_0(x), h_1(x), \\ldots, h_{\\#-1}(x)' class='latex' \/> in the filter.  And to <tt>test<\/tt> whether a given element <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is in the set we simply check whether bits <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots%2C+h_%7B%5C%23-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \\ldots, h_{\\#-1}(x)' title='h_0(x), h_1(x), \\ldots, h_{\\#-1}(x)' class='latex' \/> in the bit array are all set.<\/p>\n<p>That&#8217;s all there is to the mechanics of how Bloom filters work!  I won&#8217;t give any sample code &#8211; I usually provide code samples in Python, but the Python standard library lacks bit arrays, so nearly all of the code would be concerned with defining a bit array class. That didn&#8217;t seem like it&#8217;d be terribly illuminating.  Of course, it&#8217;s not difficult to find libraries implementing Bloom filters.  For example, <a href=\"http:\/\/www.jasondavies.com\/\">Jason Davies<\/a> has written a javascript Bloom filter which has a fun and informative <a href=\"http:\/\/www.jasondavies.com\/bloomfilter\/\">online interactive   visualisation<\/a>.  I recommend checking it out.  And I&#8217;ve personally used <a href=\"http:\/\/mike.axiak.net\/\">Mike Axiak<\/a>&#8216;s fast C-based Python library <a href=\"https:\/\/github.com\/axiak\/pybloomfiltermmap\">pybloomfiltermmap<\/a> &#8211; the documentation is clear, it took just a few minutes to get up and running, and I&#8217;ve generally had no problems.<\/p>\n<h3>Problems<\/h3>\n<ul>\n<li> Suppose we have two Bloom filters, corresponding to sets <img src='https:\/\/s0.wp.com\/latex.php?latex=S_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_1' title='S_1' class='latex' \/>   and <img src='https:\/\/s0.wp.com\/latex.php?latex=S_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_2' title='S_2' class='latex' \/>.  How can we construct the Bloom filters corresponding to   the sets <img src='https:\/\/s0.wp.com\/latex.php?latex=S_1+%5Ccup+S_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_1 \\cup S_2' title='S_1 \\cup S_2' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=S_1+%5Ccap+S_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_1 \\cap S_2' title='S_1 \\cap S_2' class='latex' \/>?\n<\/ul>\n<p><strong>Applications of Bloom filters:<\/strong> Bloom filters have been used to solve many different problems.  Here&#8217;s just a few examples to give the flavour of how they can be used.  An early idea was Manber and Wu&#8217;s 1994 <a href=\"http:\/\/scholar.google.ca\/scholar?cluster=2662095141699171069\">proposal<\/a> to use Bloom filters to store lists of weak passwords.  Google&#8217;s <a href=\"http:\/\/research.google.com\/archive\/bigtable.html\">BigTable<\/a> storage system uses Bloom filters to speed up queries, by avoiding disk accesses for rows or columns that don&#8217;t exist.  Google Chrome uses Bloom filters to do <a href=\"http:\/\/src.chromium.org\/viewvc\/chrome\/trunk\/src\/chrome\/browser\/safe_browsing\/bloom_filter.h?view=markup\">safe   web browsing<\/a> &#8211; the opening example in this post was quite real! More generally, it&#8217;s useful to consider using Bloom filters whenever a large collection of objects needs to be stored.  They&#8217;re not appropriate for all purposes, but at the least it&#8217;s worth thinking about whether or not a Bloom filter can be applied.<\/p>\n<p><strong>Extensions of Bloom filters:<\/strong> There&#8217;s many clever ways of extending Bloom filters.  I&#8217;ll briefly describe one, just to give you the flavour, and provide links to several more.<\/p>\n<p><strong>A delete operation:<\/strong> It&#8217;s possible to modify Bloom filters so they support a <tt>delete<\/tt> operation that lets you remove an element from the set.  You can&#8217;t do this with a standard Bloom filter: it would require unsetting one or more of the bits <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \\ldots' title='h_0(x), h_1(x), \\ldots' class='latex' \/> in the bit array.  This could easily lead us to accidentally <tt>delete<\/tt> <em>other<\/em> elements in the set as well.<\/p>\n<p>Instead, the <tt>delete<\/tt> operation is implemented using an idea known as a <em>counting Bloom filter<\/em>.  The basic idea is to take a standard Bloom filter, and replace each bit in the bit array by a bucket containing several bits (usually 3 or 4 bits).  We&#8217;re going to treat those buckets as counters, initially set to <img src='https:\/\/s0.wp.com\/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' \/>.  We <tt>add<\/tt> an element <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> to the counting Bloom filter by <em>incrementing<\/em> each of the buckets numbered <img src='https:\/\/s0.wp.com\/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \\ldots' title='h_0(x), h_1(x), \\ldots' class='latex' \/>.  We <tt>test<\/tt> whether <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is in the counting Bloom filter by looking to see whether each of the corresponding buckets are non-zero.  And we <tt>delete<\/tt> <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> by decrementing each bucket.<\/p>\n<p>This strategy avoids the accidental deletion problem, because when two elements of the set <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> and <img src='https:\/\/s0.wp.com\/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' \/> hash into the same bucket, the count in that bucket will be at least <img src='https:\/\/s0.wp.com\/latex.php?latex=2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2' title='2' class='latex' \/>.  <tt>delete<\/tt>ing one of the elements, say <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/>, will still leave the count for the bucket at least <img src='https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' \/>, so <img src='https:\/\/s0.wp.com\/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' \/> won&#8217;t be accidentally deleted.  Of course, you could worry that this will lead us to erroneously conclude that <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> is still in the set after it&#8217;s been deleted.  But that can only happen if other elements in the set hash into every single bucket that <img src='https:\/\/s0.wp.com\/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' \/> hashes into.  That will only happen if <img src='https:\/\/s0.wp.com\/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' \/> is very large.<\/p>\n<p>Of course, that&#8217;s just the basic idea behind counting Bloom filters. A full analysis requires us to understand issues such as bucket overflow (when a counter gets incremented too many times), the optimal size for buckets, the probability of errors, and so on. I won&#8217;t get into that, but you there&#8217;s details in the further reading, below.<\/p>\n<p><strong>Other variations and further reading:<\/strong> There are many more variations on Bloom filters.  Just to give you the flavour of a few applications: (1) they can be modified to be used as lookup dictionaries, associating a value with each element <tt>add<\/tt>ed to the filter; (2) they can be modified so that the capacity scales up dynamically; and (3) they can be used to quickly approximate the number of elements in a set.  There are many more variations as well: Bloom filters have turned out to be a very generative idea!  This is part of why it&#8217;s useful to understand them deeply, since even if a standard Bloom filter can&#8217;t solve the particular problem you&#8217;re considering, it may be possible to come up with a variation which does.  You can get some idea of the scope of the known variations by looking at the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bloom_filter\">Wikipedia article<\/a>. I also like the <a href=\"http:\/\/scholar.google.ca\/scholar?cluster=7837240630058449829&#038;hl=en&#038;as_sdt=0,5\">survey   article<\/a> by <a href=\"http:\/\/en.wikipedia.org\/wiki\/Andrei_Broder\">Andrei Broder<\/a> and <a href=\"http:\/\/mybiasedcoin.blogspot.ca\/\">Michael Mitzenmacher<\/a>.  It&#8217;s a little more dated (2004) than the Wikipedia article, but nicely written and a good introduction.  For a shorter introduction to some variations, there&#8217;s also a recent <a href=\"http:\/\/matthias.vallentin.net\/blog\/2011\/06\/a-garden-variety-of-bloom-filters\/\">blog   post<\/a> by <a href=\"http:\/\/matthias.vallentin.net\/\">Matthias Vallentin<\/a>. You can get the flavour of current research by looking at some of the papers citing Bloom filters <a href=\"http:\/\/academic.research.microsoft.com\/Publication\/772630\/space-time-trade-offs-in-hash-coding-with-allowable-errors\">here<\/a>. Finally, you may enjoy reading the <a href=\"http:\/\/scholar.google.ca\/scholar?cluster=11454588508174765009&#038;hl=en&#038;as_sdt=0,5\">original   paper on Bloom filters<\/a>, as well as the <a href=\"http:\/\/scholar.google.ca\/scholar?cluster=18066790496670563714&#038;hl=en&#038;as_sdt=0,5\">original   paper on counting Bloom filters<\/a>.<\/p>\n<p><strong>Understanding data structures:<\/strong> I wrote this post because I recently realized that I didn&#8217;t understand any complex data structure in any sort of depth.  There are, of course, a huge number of striking data structures in computer science &#8211; just look at <a href=\"http:\/\/en.wikipedia.org\/wiki\/List_of_data_structures\">Wikipedia&#8217;s   amazing list<\/a>!  And while I&#8217;m familiar with many of the simpler data structures, I&#8217;m ignorant of most complex data structures.  There&#8217;s nothing wrong with that &#8211; unless one is a specialist in data structures there&#8217;s no need to master a long laundry list.  But what bothered me is that I hadn&#8217;t <em>thoroughly<\/em> mastered even a single complex data structure.  In some sense, I didn&#8217;t know what it means to understand a complex data structure, at least beyond surface mechanics.  By trying to reinvent Bloom filters, I&#8217;ve found that I&#8217;ve deepened my own understanding and, I hope, written something of interest to others.<\/p>\n<p>  <em>Interested in more?  Please <a href=\"https:\/\/michaelnielsen.org\/ddi\/feed\/\">subscribe to this blog<\/a>, or <a href=\"http:\/\/twitter.com\/\\#!\/michael_nielsen\">follow me on Twitter<\/a>.  You may also enjoy reading my new book about  open science, <a href=\"http:\/\/www.amazon.com\/Reinventing-Discovery-New-Networked-Science\/dp\/product-description\/0691148902\">Reinventing Discovery<\/a>.  <\/em> <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine you&#8217;re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access http:\/\/domain\/etc. You&#8217;d like a way of checking whether domain is known to be&hellip; <a class=\"more-link\" href=\"https:\/\/michaelnielsen.org\/ddi\/why-bloom-filters-work-the-way-they-do\/\">Continue reading <span class=\"screen-reader-text\">Why Bloom filters work the way they do<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-75","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/75","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/comments?post=75"}],"version-history":[{"count":0,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/posts\/75\/revisions"}],"wp:attachment":[{"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/media?parent=75"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/categories?post=75"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michaelnielsen.org\/ddi\/wp-json\/wp\/v2\/tags?post=75"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}