Bloom Filters

eli.thegreenplace.net

240 points by mfrw 2 months ago

I recently discovered "compact approximators", which can be seen as a generalisation of Bloom filters. Instead of telling you "element X is probably in the set", they tell you "a lower bound on the value associated with key X".

If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.

My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.

I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.

thomasmg 2 months ago

This is a bit similar to count-min sketch, which can be used to calculate frequencies in a stream.
Also interesting are "Invertible Bloom Lookup Tables" (IBLT), https://arxiv.org/pdf/1101.2245 , which allow to add and remove data, and allow to retrieve the remaining data if "little" data remains. That means, it can be used for error correction: all all the _correct_ data (let's say 1 GB of correct data) into a 1 MB IBLT. Let's say a downloader finds that the checksum doesn't match: he can download that 1 MB IBLT, and remove the data from his (invalid) download. What remains is the delta: the error correction data. I know, there are other ways you could do error correction, but this is very fast, and quite interesting technically.
- nullc 2 months ago
  
  IBLT operates over sets though, your downloading example is not over a set (the data has a position). You can make it work over data with positions by adding a position number to every encoding unit, but it's pure overhead. Other codes are more efficient.
  IBLT is also particularly bandwidth inefficient when the amount of correction data is not very large. Ordinary codes can recover a missing block (known error location) with just the amount of data missing or with twice the amount if the location(s) are unknown.
  There are codes with quasi-linear performance, though for your example application the number of errors are few so it shouldn't really matter if the code is cubic in the number of errors to decode (which is about the worst any should be).
  
  thomasmg 2 months ago
  
  Yes, I know it is not designed to be a error-correction code, and other codes (turbo code, fountain code), are a lot more efficient. But I wanted to mention it because it's related to the topic.
  > You can make it work over data with positions by adding a position number
  Yes, that's what I did in my experiments. I personally found it really simple to implement; I'm not sure how easy it is to implement a "real" error correction code.
  
  nullc 2 months ago
  
  Fountain code is decoded by essentially the same mechanism, it's called peeling. It's quite fun to implement. Though a plain fountain code has similar bandwidth inefficiencies... also the implementations that exist only correct erasures (though you can turn errors into erasures using a checksum/mac).
  You can see the fountain code as a very sparse linear system that is usually solvable through a fairly simple greedy algorithm. You an augment the sparse linear system with a dense one like a RS code, which is solved by gauss-seidel which is also fun if less braindead simple.
  This github user has a whole host of different competently implemented very high performance erasure codes:
  https://github.com/catid/wirehair https://github.com/catid/leopard https://github.com/catid/fecal https://github.com/catid/siamese
  As far as direct error codes. There is a competent BCH code implementation in the Linux kernel, but it only works over small fields so it doesn't support many independent blocks/packets.
  Then there is https://github.com/bitcoin-core/minisketch which I am a coauthor of, which is a set corrector like IBLT which achieves perfect communications efficiency (at a trade off of cubic decode performance instead of linear but with good constant factors). I mention it mostly because it contains fast implementations of the relevant and tricky number theory algorithms needed for a fast RS error correcting code over big fields (any size up to 2^64 at least)... but I'm not actually aware of a performant large field RS error correcting implementation. (IIRC there is a RS error corrector in gnuradio but again small field).
  (Mostly the small field stuff doesn't work for bigger fields because they use a chien search to find the roots of a polynomial, which is linear in the field size. Minisketch finds roots by factoring in log(|field|) time. Root finding is needed in algebraic error codes because it's how you find where the errors are. IIRC the linux kernel BCH implementation uses factoring like we do but is constructed to use a log table to turn multiplications into xors.)
kwillets 2 months ago

That's an interesting reference. I came up with what may be a simpler case of that -- a structure that estimates the last time a key was seen. It's an upper bound, as collisions make the time more recent, but never the other way. Instead of a semi-order it's a simple ordering, but it may be similar to the compact approximator.
Rate limiters apparently use count-min over a fixed time interval, which is bursty, but I took the same hash structure and came up with "timestamp-min" that allows even pacing (also without "fill up"). The one-sided error is also useful for checking if a cache entry is too old.
It doesn't fix everything (eg DDOS), but it can prevent any single client from over-requesting, or stealing another client's requests (collisions can be made as hard as necessary).
https://github.com/KWillets/RecencySketch
nyrikki 2 months ago

While there are real use cases for the above, if you are looking at bloom filters make sure you check the above works for your need.
First-order with least fixed point FO(LFP) == P As P=co-P but we think NP!=co-NP, bloom filters often have value for access to a small and fast path to co-NP
In that case the collisions don't matter because often excluding membership is more valuable than proving membership and you have to choose one.
That is why outlook used it to reduce address completion, even if a hit requires the expensive call to the server, the set of email addresses not in your address book is far larger.
But it is all horses for courses.
If your need does fit in P you have a lot of options, while the options for co-NP is far more rarified.
abetusk 2 months ago

Can you provide a link to a paper or reference?
- thomasmg 2 months ago
  
  This is described here: https://en.wikipedia.org/wiki/Bloom_filter#Compact_approxima...
taneq 2 months ago

Ooh, this seems like it could be useful for collision avoidance (where you need to know a lower bound to your time to impact.)

ozgrakkurt 2 months ago

Would recommend reading rocksdb implementation of bloom filter and ribbon filter to anyone wanting learn more about the production level implementation side. It is extremely well explained in the comments and is the state of the art implementation as far as I know.

https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...

https://github.com/facebook/rocksdb/blob/main/util/ribbon_al...

almostgotcaught 2 months ago

These are the kinds of comments that I write when I work really hard (and very long) on a PR and I know no one will really dig into it (kind of like "well at least I committed the findings to posterity").
- econ 2 months ago
  
  If you keep at it someone someday will be blown away.
  
  almostgotcaught 2 months ago
  
  Nah ain't no one got time for that - and I don't blame anyone either (not like I was read other people's comments).

Const-me 2 months ago

There’s a useful but non-obvious property of these filters.

If you have two Bloom filters which use same set of possible keys and same filter setup, you can compute intersection of their represented sets with a bitwise AND operation over the bitsets in these filters.

If you store bitsets aligned and padded by SIMD vector size (for example, an array of __m128i in C++ as opposed to the array of uint64 values in the golang example), computing such intersection is very efficient on modern computers.

I once used that technique to filter 3D objects from a large set using the intersection of 3 range queries, one per coordinate. Compute two Bloom filters filled with the objects returned by X and Y queries, intersect the filters, then iterate over the objects returned by the remaining Z query testing against the intersected Bloom filter. Due to probabilistic nature of the Bloom filter you still need to test for X and Y queries for the objects tested positive by the filter, however with proper setup the filter should reject vast majority of them.

ChuckMcM 2 months ago

That's pretty clever. In my 3D experiments I always struggled with minimizing the number of objects in the world I had to consider for rendering on each frame based on the view/fog(range). This seems like it would help with that.
- undefuser 2 months ago
  
  Have you tried using Quad Tree?
  
  ChuckMcM 2 months ago
  
  Yes! There was a great Graphics Gems article on implementing quad trees. The things that change are the sheer number of things you might have in your environment.
  
  undefuser 2 months ago
  
  So, in the end did quad tree help you solve the problem, or you needed something else?
  
  ChuckMcM 2 months ago
  
  End the end I gave up :-) The problem I was trying to solve was a physics based deformable environment. Something like minecraft but without the blockyness. Basically allowing one to dig as deep as one wanted into the planet, build as high as one wanted, and re-arrange as much as one wanted all with physics rules combined with some matter composition rules that would give one an 'authentic' experience. It was a stretch in the 90's and I think I stopped poking at it around 2005 :-). But part of the problem was when you broke things you got multiple sub-objects. Potentially down to the level of "sand" (which in my world was a .001cc chunk.)
econ 2 months ago

From what I understand bloom filters have a hash per item but when i invented (hah) them I used a bit array for each property where each bit describes an item at the same offset. When searching for some properties one can do an AND on those entire arrays and eliminate candidates really fast.

celeritascelery 2 months ago

The author says that with a 1.2GB bloom filter and 7 hash functions lookup is only 30ns. I have to assume this is because the cache has loaded all the benchmark values. My guess is that the benchmark is only testing a few elements. In real world scenarios with a bloom filter this big most of those 7 hash lookups will be into cold cache since 1.2 GB is too large. That means lookups are much longer than 30ns. Probably still faster than going to network or database though.

eliben 2 months ago

Updated the result to 80ns - thanks for flagging this. This grows with the size of the data (because more cache misses), and running the benchmark on the full billion takes a while.
[That said, on a hot production bloom filter, much can be loaded into caches anyway so it's not an entirely un-realistic scenario that some of these are cache hits]
returningfory2 2 months ago

This is the benchmark they wrote: https://github.com/eliben/code-for-blog/blob/7278526923168d2...
The benchmark alternates between ~1 million different keys to check in the filter, explicitly to account for cache effects.
- Tuna-Fish 2 months ago
  
  A single lookup is going to take more than 30ns, the reason they only see that is that the OoO machinery of their CPU is good enough to run those lookups in parallel.
Tuna-Fish 2 months ago

Yes, 30ns means it's in cache. But bloom filters are surprisingly fast for the amount of lookups they do, because they all happen in parallel and there is a lot of parallelism in modern memory subsystems, so that you essentially only pay the cost of a single random read for the entire lookup. If using 1GB pages, you can still realistically talk about <100ns lookups.
- bjornsing 2 months ago
  
  > so that you essentially only pay the cost of a single random read for the entire lookup
  Why would you ever pay more than that for a bloom filter lookup? I mean, I don’t see how that has anything to do with parallelism in memory subsystems. But I may be missing something.
  
  Tuna-Fish 2 months ago
  
  A bloom filter needs to multiple loads from different memory locations for each single lookup. (7, in the example 1.2GB filter.) But unlike, say, with a tree, it knows all the addresses after computing the hashes, without having to wait for results from the previous loads. So it can start all of them in parallel.
  
  bjornsing 2 months ago
  
  Yeah, sorry. I was thinking of a single hash function bloom filter, which is of course the exception. My bad.
vlmutolo 2 months ago

A simple extension of the Bloom filter called “block Bloom filters” fixes this. The idea is that the first hash is the index of a small constant-size block of your array, and the rest of the indices are within that block.
So a single query to the filter should only have one or two cache misses, depending on the size of your block. Or even if your block is larger than a cache line, you can probably issue all the loads at once and only pay for the latency of one memory access.
The downside of doing this is slightly more space usage relative to simple Bloom filters. I’d almost always reach for block Bloom filters, though, once the filter becomes a significant fraction of cache size.
I implemented block bloom filters for fairly large (~GB) arrays and saw about 35ns performance. They’re excellent data structures, pretty much as fast as you can get for approximate membership tests (though other filters have better space-time tradeoffs).

aranw 2 months ago

Sam Rose has also written a great article on Bloom Filters and has some great animations to help illustrate it too. Definitely worth checking out https://samwho.dev/bloom-filters/

noelwelsh 2 months ago

Hash functions are kinda magic. Bloom filters are one of the fun tricks you can play with them (though Bloom filters themselves are quite old at this point; check the literature for various alternatives that are better in specific situations.) The field of data sketches and streaming algorithms has many others. k-Minimum values is a nice example of a set cardinality estimator that is very easy to understand.

f_devd 2 months ago

> check the literature for various alternatives that are better in specific situations
For direct replacements, see Xor(+)-Filters: https://arxiv.org/abs/1912.08258 and the subsequent evolution into Binary Fuse Filters: https://arxiv.org/abs/2201.01174
- sparkie 2 months ago
  
  Also quite recent are Ribbon filters: https://arxiv.org/abs/2103.02515
  But these, and xor/binary-fuse-filters are only direct replacements for Bloom filters in some problems. There are still problems for which Bloom filters are a better solution.
  Bloom filters have the advantage that the filter produced by adding two elements, is the same as the bitwise-or of a pair of filters made by adding each element separately.
  bloom([x, y]) == bloom([x]) | bloom([y])
  Also, the filter produced by bitwise-and, `bloom([x]) & bloom([y])`, cannot have any bits set in it which are not also set by `bloom([x, y])`. We can assert that
  bloom([x]) & bloom([x, y]) == bloom([x]) bloom([y]) & bloom([x, y]) == bloom([y]) (bloom([x]) & bloom([y])) | bloom([x, y]) == bloom([x, y])
  There are practical applications that this can provide which are not satisfied by the "optimized" variants. The bitwise-or does set union (or least upper bound), and the bitwise-and does a set intersection (or greatest lower bound), though the filter produced by `bloom([x, y])` has a higher false positive rate than the filter produced by `bloom([x]) & bloom([y])`.
  bloom([x]) | bloom([y]) == bloom([x, y]) 🡥 🡤 bloom([x]) bloom([y]) 🡤 🡥 bloom([x]) & bloom([y])
  
  f_devd 2 months ago
  
  Also based on Ribbon, the BuRR filters: https://arxiv.org/pdf/2109.01892 which seem to get very close to the theoretical optimum, 0.1%-0.5% overhead comparatively
- samuel 2 months ago
  
  Interesting, but not quite the same:
  Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one).
- virexene 2 months ago
  
  I may have missed something when skimming the paper, but it sounds like Xor filters are constructed offline and can't be modified efficiently afterwards, whereas Bloom filters can be inserted into efficiently. So they don't seem to be an exact replacement
  
  f_devd 2 months ago
  
  No, you are correct. I misremembered with cuckoofilters.
Snawoot 2 months ago

Cuckoo filters are more efficient alternative. The algorithm of displacement is also very notable: how we can use random swaps to place data in cells optimally.

redbell 2 months ago

Also, this -somehow- related Ask HN is a true classic, at least for me: https://news.ycombinator.com/item?id=32186203

thrance 2 months ago

I've known about Bloom Filters for a while now, and I like the idea of them, but I've never had a need for them yet.

rasz 2 months ago

Browser :visited CSS pseudo-class lookup is one good every day candidate. No idea if thats how its implemented in Chrome/FF tho, last time I checked Chrome stores every single link ever clicked in Visited file so it must build fast lookup data structure on every browser start.
speed_spread 2 months ago

Good for you. Their main utility is in adtech and surveillance.
- okr 2 months ago
  
  Or, in general, all data intensive applications.
  
  OnlyMortal 2 months ago
  
  Such as deduplication.

esafak 2 months ago

Available in Redis: https://redis.io/docs/stack/bloom/

klaussilveira 2 months ago

You can even use Bloom Filters for rate limiting with counting BF:

https://github.com/FastFilter/fastfilter_cpp/blob/f27873fac4...

thomasmg 2 months ago

Yes, this is the succinct counting blocked Bloom filter I wrote. I wanted to write a paper about it. It requires twice the space of a blocked Bloom filter (which is about half the space of a regular counting Bloom filter). Reads are very fast, and updates are okish.
- klaussilveira 2 months ago
  
  Well, thank you for that! That library has taught me a lot. :)

fooker 2 months ago

Here's my use case for it: I want really small filters (say 64 bits or 128 bits), and use these to represent 50-100 element subsets of a ~1000-2000 element set.

I know with a traditional bloom filter this would give me a lot of false positives. Is there an alternative that would fare better ?

thomasmg 2 months ago

Well, what problem do you want to solve? What you describe is not a use case but a possible solution... this smells like an xy problem...
- fooker 2 months ago
  
  The problem is:
  Represent (possibly overlapping, hence trees are tricky) subsets of size 50-100 out of a set of size of size 1000-2000.
  Some amount of false positives is acceptable but not like 50%.
  
  thomasmg 2 months ago
  
  So, for this it sounds like you only need one Bloom filter (not multiple), and each subset is an _entry_ in the filter. The total set size doesn't matter; what matters (for the size of the Bloom filter) is the total number of entries you put into the Bloom filter, and the false positive rate. And then, you can do a membership test (with a configurable false positive rate, typical is 1%), to find out if an entry is in the set. BTW you can not use Bloom filters to store and retrieve entries: for that, you need something else (an array, or hash map, or a retrieval data structure; Bloom filter can't be used).

bobosha 2 months ago

what are some real-world usecases that people use it for?

mfrw 2 months ago

I like to think of it as a magical checklist, it helps us to quickly check if something _might_ be there, without actually looking through everything.
A few non-exhaustive real world use-cases that come to mind:
- Databases: To quickly check if a record might exist before doing a costly disk lookup.
- Spell Checkers: To check if a word might be in the dictionary.
- Spam Filters: To check if an email sender is on a list of known spammers.
- Browser Security: Chrome uses Bloom filters to check if a site might be malicious.
- Password Checker: To check if a password is known to be leaked.
- Web Caches: To check if a URL or resource is definitely not in the cache.
- Distributed Systems: To avoid sending data that another system definitely doesn’t need.
- hinkley 2 months ago
  
  It’s big for caches. As your application grows in complexity you get dependent lookups. I need the user id to get the company id. I need the company id to figure out what features the user has access to. And then I need to run a bunch of queries to pull data for those feature.
  And while using the cache may cut an order of magnitude off of the overall time, you’ve gone from hundreds of milliseconds to tens, but with a bloom filter you can figure out you have a cache miss faster and start the process of fetching the data sooner. The user may not notice the small improvement in response time, but by Little’s Law your cluster size can be smaller for the same traffic.
  Web browsers use bloom filters to determine which CSS rules apply to which elements. IIRC Chrome removed a perf screen for CSS rules because most people were getting results below the noise floor for the timing function. The time to load the CSS was still relevant (maybe moreso due to the higher setup cost of the filters).
thesz 2 months ago

https://news.ycombinator.com/item?id=42486610
Discussion of how Bloom filters made SQLite much faster.
gww 2 months ago

They are used somewhat commonly in bioinformatics where lookup tables can have long keys and many entries [1].
1. https://en.m.wikipedia.org/wiki/Bloom_filters_in_bioinformat...
nsteel 2 months ago

https://devblogs.microsoft.com/devops/super-charging-the-git...
bilbo-b-baggins 2 months ago

There’s loads of high throughput data cases where asking “does this already exist?” is a high cost tradeoff and bloom filters answer it with minimal resources. If the cost of a “no” is 100x lower than “yes” then a 1% bloom filter breaks even. If it’s a 10000x then a bloom filter is 100x better.
One of the most interesting applications are rolling time indexed filters. If you have T filters representing the current time interval, you can retain D filters to represent a value being seen in a Duration. Each new I time segment you add 1 filter representing I/T time and test against all filters in D. A consecutive positive result for T filters gives a range of I-T existence for set membership and both a positive result and the time range that it was found within.
withinboredom 2 months ago

If you have a whole cluster of machines that have data on them and you need to ask: "does this machine probably have or not have this data?"; a bloom filter will tell you an answer. It can save a ton of time since a bloom filter's answer is "probably yes" and "definitely no."
andrewstuart 2 months ago

Bloom filters can be extremely fast to tell you if something is not in a dataset.
They can give false positives incorrectly indicating an element might be in the set when it's not, but never false negatives
Knowing (fast) if something is not in a dataset can be very useful.
la_fayette 2 months ago

Initially, Bitcoin light wallets were built with bloom filters. So a full node would only propagate transactions, which satisfy a bloom filter, given by light client to that light client. The bloom filter seems not to be privacy preserving, that was one reason why this was abondend by some wallets. However, bitcoinj and wallets built on top of it, might still use this...
leeoniya 2 months ago

permission checks (allow/denylists)
- sparkie 2 months ago
  
  Using them for whitelists is probably not a great idea because they can give false positives. An attacker could potentially flood the filter with fake accounts and increase the rate of false positives, increasing the chance they're granted access.
  For blacklists, potentially more suitable, but since it can also give false positives, it could deny permission to people who should have it. An attacker might also attack this - by flooding the filter with accounts that deliberately get blacklisted, they could lock out people who should have access.
  Obviously this is very use-case specific - it's probably not the best approach to doing permissions if security is paramount.
  
  withinboredom 2 months ago
  
  No, but they can tell you a user is definitely not in an allowlist or blocklist. That is useful, especially if it can save a database lookup on every check.
  
  sparkie 2 months ago
  
  That may work, but there are potential issues with that regarding timing attacks. If an attacker could make many attempts to access a resource, they may be able to figure out who (probably) has access with a brute-force timing test, and narrow down an attack target.
  
  withinboredom 2 months ago
  
  I'm not sure I understand. Generally, an allow-list/block-list is for authorized resources? By the time you are doing this check, the user is already authenticated and this is part of authorization. So, the user shouldn't be able to authenticate as arbitrary users to do a timing attack. If they can, you have bigger problems.

andrewstuart 2 months ago

Bloom filters - when I eventually learned about them - were nothing at all like what I had assumed from the name. And very interesting too.

eimrine 2 months ago

"Bloom's filter" that's a more correct name which doesn't let to make any assumptions.
"Hashmap" - maybe a less correct name but it lets to make more correct assumption.

gitroom 2 months ago

pretty cool seeing how stuff like this keeps getting new uses - i always wanna tinker with these things but never have the right problem yet tbh

londons_explore 2 months ago

A whole load of data structures, like bloom filters, are 'efficiency tricks'.

For example, "check if we might have Mr Smith's data cached by looking in this bloom filter". As Long as the answer is usually right, that's good enough.

I do wonder if in the future we'll use a mini online-trained AI to achieve the same thing. "Ask the AI if we have MR Smiths data cached".

Rather like you might ask an office clerk "Is Mr smith that troublesome customer you were dealing with last week? Do you still have his file on your desk?"

nkrisc 2 months ago

Why would that be better than a bloom filter (or similar)?
- chupy 2 months ago
  
  Because it has AI in it's name /s
ozgrakkurt 2 months ago

There are actually “learned” bloom filters if anyone is interested in machine learning/bloom filter relation. but it is not related to chatbots
esafak 2 months ago

The way that would work is that the LLM would translate that sentence into a tool call, which would query a data store that does the heavy lifting. Also, there is an ML task called "Learning to hash", which is about optimizing the hash for the task: https://learning2hash.github.io/
hinkley 2 months ago

I would imagine a heuristic to keep enough of the filter in L2 cache to avoid pipeline stalls might be useful. Sort of a double bloom, but weighted for common lookups.
lightbendover 2 months ago

[dead]