5
gvnix
5y

I finally found the time to write a small article on Medium about Bloom Filters. At least this lockdown has allowed mw to work on pending stuff.

Read it here - https://medium.com/@gvnix/...

Comments
  • 0
    @JosiasAurel thanks man!
  • 1
    I see a problem with your database index logic - you never have to seek data on disk when using index, and even when you do, it gets loaded the first time and then stays in memory - an index with 10^6 keys takes less than 100 MB after all. DDR4 has latency under 20ns. In that case, your time complexity is NlogN * 20ns, for 10^6 that's slightly less than 400s, and not nearly close to 38 hours. 7 minutes to insert 10^6 rows sounds perfectly reasonable.

    Also you should be looking at log base 2, not natural logarithm, so it should be 19.931 instead of 13.815.

    Now, assuming your function is properly random and has 100 collisions, those collisions will take additional 40ms (+ 10 ms to load index for the first time), which again is probably less than it'll take to transfer 10^6 lines from your app to the database.
  • 0
    @hitko looks like I used natural log by mistake thanks for that.

    However the bloom filter is not built during the api call, it’s pre calculated.

    We completely avoid a dB call if we use it.
  • 0
    Also afaik, index can be cached and removed on the basis of the memory pressure. So you will get inconsistent performance.

    Am I missing something here?
  • 0
    @gvnix You're missing the "real world part", i.e. the fact that a database which has to often load indexes from disk will be so slow you won't even care about ID generation, because your inserts will be even slower.
  • 1
    Basically a real-world database should always have enough free RAM to accommodate spikes in requests at least for the time it takes to scale the infrastructure or take actions against an attacker, so a relatively small index will always stay in memory.
  • 0
    @hitko I understand your point. That hour figure I calculated was based on the worst case scenario where my random function is shit. :D

    Also, the random functions tend fail more than 0.01percent than I have mentioned there.

    If the the million records turn to 100 million or maybe more then indexing won’t be a quick solution.

    For example checking if a username is created in reddit or not. That’s basically real-time response.

    What I aimed there was to show that if I have enough records in my dB that my random function starts to mess up my API will slow down significantly.

    So a solution that takes less space than a index and is faster than an index would be a better solution. This will keep the dB free to do more important stuff.

    Also, I have updated the log value too idk how I missed that.
  • 0
    Basically bloom filter is more specialised and is targeted for specific scenarios.

    Indexing is a more general approach.
  • 0
    @gvnix Let's say you have an entry for every human on earth, and 8 characters in an ID - even though that many IDs alone would take about 60 GB, without all other columns. In that case probability of collisions is 0.28%, so let's assume these are user-submitted values with variable length (e.g. usernames). Assuming usernames can contain about 60 different characters (English letters, numbers, and a bunch of common symbols), after 5th letter there's a 10% chance you don't have another level of depth, and after 6th letter it's pretty much certain you don't. So at worst you're reading 6 or 7 blocks from memory, that's about 140ns every time you need to check if an username exists. 7 blocks roughly equals 299000 bits, so that's the maximum size if you want your filter to be as fast as an in-memory b+ prefix tree index. Unfortunately, that gives you 100% false positive probability.
  • 0
    And most importantly, think what bloom filter even tells you: there either isn't a KNOWN collision, or there PROBABLY is a collision. If there probably is a collision, you should by definition check the db to verify it, and if bloom filter tells you there isn't a collision, you may still have a collision in case another client inserted a row with the same ID. Either way you have to make a request against your DB, which will in turn check an index to verify there aren't any duplicates.

    Bottom line is good job on explaining bloom filter, but your use cases and supporting calculations are way off and definitely not something bloom filter should ever be used for.
  • 0
    @hitko oh no the use cases are correct let me explain.

    If there is a collision then sure check in dB. But that probability is defined by you and can be low enough that you avoid 99.99 percent calls to dB.

    Now about the distributed client issue. This filter is not meant to be inside that client instance’s heap, it would be in a central location like a redis cluster and then You can update this when you update the dB. So you definitely avoid the extra calls to the database.

    For more info about the use case check example section of -

    https://en.m.wikipedia.org/wiki/...
  • 0
    @hitko also I would update the calculation to make them more accurate.
Add Comment