What is Bloom Filter?
Bloom filters are for set membership which determines whether an element is present in a set or not. Bloom filter is a probabilistic data structure that works on hash-coding methods (similar to HashTable).
It is a memory-efficient, probabilistic data structure that we can use to answer the question of whether or not a given element is in a set.
When do we need a Bloom Filter?
- Suppose we have a list of some elements and we want to check whether a given element is present or not?
- Consider you are working on an email service and you are trying to implement a sign-up endpoint with a feature that a given username is already present or not?
- Suppose you have given a set of blacklisted IPs and you want to filter out whether a given IP is a blacklisted one or not?
How does Bloom Filter help?
HashSet or HashTable works well when we have limited data set, but might not fit as we move with a large data set. With a large data set, it takes a lot of time with a lot of memory. Bloom filter help in resolving this problem.
How? Let's take an example.
Assume we have a bit array of size N (let's say 24). We will initialize each bit with binary zero. Now let's assume we have two hash functions: hashFunction1 and hashFunction2 and we are storing employee names.
To store employee name 'Sebastian Lassak', we called both the hash functions.
hashFunction1("Sebastian Lassak");
hashFunction12("Sebastian Lassak");
hashFunction1 returns 2 and hashFunction2 returns 7. Now, go to indexes 2 and 7 of our bit array and mark the bit as binary 1.
Now we will put 'Filip Jakub', hashFunction1 returns 1 and hashFunction2 returns 7. Here index 7 is already marked by the previous entry so just mark index 1 as binary 1.
Now, let's check whether a person is present in the data set or not.
To check if 'Sebastian Lassak' is present or not, we will again call both the Hash Functions. hashFunction1 returns 2 and hashFunction2 returns 7. Now, go to the index 2 and 7, and check the bit, if both are marked with binary 1 then 'Sebastian Lassak' is present in the set, otherwise, it is not with the data set.
Let's search 'Lebastian Sassak', assume hashFunction1 again returns 2 and hashFunction2 returns 7. We go to the index 2 and 7, both are marked with binary 1, yet 'Lebastian Sassak' is not present in the set. Because bloom filters are probabilistic.
In the above case, indexes 2 and 7 were set to 1 by other inputs and not by 'Lebastian Sassak'. This is called collision and that’s why it is probabilistic, where the chances of happening are not 100%.
What can we expect from a Bloom filter?
When a Bloom filter says an element is not present it is 100% sure not present. But when the Bloom filter says the given element is present it is not 100% sure, because bloom filters are probabilistic.
There may be a chance due to collision all the bit of index given by hash functions has been set to 1 by other inputs.
Can we get 100% accurate results from a Bloom filter?
Yes, this could be achieved 'only' by taking more hash functions. With more numbers of the hash function, we can get a more accurate result because the chances of a collision are less.
How can we implement the Bloom filter in Java?
We can implement the Bloom filter using the Java library provided by Guava.
We need to pass the number of elements that we expect to be inserted into the filter and the desired false positive probability:
Reduce the false-positive probability
We need to add another parameter in Bloom filter object. e.g:
BloomFilter empBlooms = BloomFilter.create(Funnels.stringFunnel(Charset.forName("UTF-8")), 10000, 0.005);
Now false-positive probability has been reduced from 0.03 to 0.005. But tweaking this parameter has an effect on the side of the bloom filter.
Over-Saturated Bloom Filter
When we design our Bloom filter, it is important that we provide a reasonably accurate value for the expected number of elements. Otherwise, our filter will return false positives at a much higher rate than desired.
Give some real-life use of Bloom Filter.
1). Due to this space efficiency, the Bloom filter will easily fit in memory even for huge numbers of elements. Some databases, including Cassandra and Oracle, use this filter as the first check before going to disk or cache, for example, when a request for a specific ID comes in.
Apache Cassandra uses SSTable data structure on disk to save rows. There might be thousands of SSTable files on disk. In such cases, even tens of thousands of reading requests per unit time will cause very expensive disk IO operations to search for concerning row(s) in all the SSTables one by one in some order when data is not found in the in-memory tables of Cassandra.
Bloom filters are used to approximately identify if some row/column with the given data or id exists in an SSTable. If the result is ‘May Be Present’, Cassandra searches in the corresponding SSTable, in case the row is not found, Cassandra continues the search operation in other SSTables. So using bloom filter, Cassandra saves a lot of unnecessary SSTable scans hence saving huge disk IO operations costs.
2). Akamai and Facebook use bloom filters to avoid caching the items that are very rarely searched or searched only once (one-hit wonders). Only when they are searched more than once, they will get cached.
3). It can be used in the content recommendation system. Imagine some site recommending you some articles, news, videos, etc which you might not have seen earlier. So there are probably thousands of stuff that can be recommended and you might have seen tens or hundreds of recommendations already. In order to skip the recommendations that are already served to you, bloom filters are used.
-K Himaanshu Shuklaa..
No comments:
Post a Comment