What is Bloom Filter?
Introduction
If you’re here, there is a high chance you are into scaling and optimizing queries, and you may have come across an approach to optimize your queries for your large scale application, this approach is based on Bloom Filter. So let’s see together what Bloom Filters are, how they work, and their applications.
What is Bloom Filter?
Bloom Filter is a probabilsitic Data structutre conceived by Burton Howard in 1972. It provides an efficient way to tell that an entry is certainly not in a set, or maybe is in the set, i.e there can be false positives but never false negatives. So, if you have a database with millions or let’s say even billions of records, and you want to check whether a given username exists, then you can save querying your databases directly as the query can be expensive, and use the Bloom filter to check whether the given username exists, if the username does not exist in the Bloom Filter, it means that certainly the username is not there, and hence you will avoid querying the database and just return a response that the username does not exist. If the response from the Bloom filter is username maybe exists, in this case, you’ll query your database to confirm whether it exists or not. Let’s see how Bloom Filter works at implementation level 😉 🌸 🌺
How Bloom Filters work?
- Bloom Filter uses an array of bits initialized to 0, its size is predetermined m.
- It also uses a set of k hash functions
- n is the number of elements expected to be stored in the Bloom Filter
- To add an element to the array, it is passed to each of the k hash functions
- Each hash function produces an index for the bit array
- The bits at these indices are set to 1
Example
Let’s say we have an array of size m = 10, and 3 hash functions h1, h2, and h3, let’s try to add an element, and let’s suppose the first hash function will will place the element at index 1, the second index 5, and the third index 7:
h1(element) % 10 = 1
h2(element) % 10 = 5
h3(element) % 10 = 7
In order to check whether an element exists in the Bloom filter, we use the same hash functions. If All elements returned by the hash functions are 1, this means that the elements maybe in the Bloom filter, as the bits were altered, it means an element was added. If any of the bits is 0, it means the element is certainly not in the Bloom filter (bits remain the same, no insertion was done).
The example explains pretty well how Bloom Filter works. If we analyze the Bloom Filters, we can get some limitations and drawbacks, let’s highlight them in the following, and show how to deal with them:
- Since Bloom Filter uses hash functions, there is a risk of collisions, and in order to reduce collision rate, Bloom Filter uses multiple hash functions (multiple hash functions generates different index positions, and hence less chance that elements collide in all same indices, it also allows to share the space efficiently instead of using a large array and having lots of collisions). The hash functions need to be uniformly distributed, meaning it distributes the values uniformly across the Bloom Filter. The hash functions also need to be independent of each other, meaning the output of one hash function should not influence the output of the other hash function. The optimal number of hash functions will be shown how to calculate it below.
- When an element is added, it can share the same index as other elements, let’s say in our example above, we add another element e, for which h1(e) % 10 = 5, h2(e) % 10 = 1, and h3(e) % 10 = 6. We notice that index 1 holds value of both elements. If we delete one of the elements, we will clear their bits, they will become zero, and hence we delete other elements too that share the same index. Hence the limitation of not being able to delete elements from the Bloom Filter. An advanced version, named Counting Bloom Filters allow to delete elements, and Cuckoo filters is more advanced, more space efficient, and allows to remove elements.
- Since the Bloom Filter size needs to be known beforehand, it may not be flexible enough, although it can be scaled, and it may not be the best choice in this case since it involves looking in several Bloom Filters.
- In order to reduce collision rate and false positive rate, the size of the bit array, and the fill ratio, are important. If the Bloom filter is filled more than it should, collisions rate increases, and hence the rate of false positives increases as well (many bits are set to 1, and even if a given element does not exist, it will be give maybe it exists, and you will need to hit the database).
- Bloom Filters use hash functions, which can be time-consuming in some cases (cryptographic hash functions, and depending on CPU).
Here is the formula used for optimal number of hash function that reduces false positives:
k = (m/n) * ln(2)
Where k is the number of hash functions, m is the size of the bloom filter, and n is the number of elements to be added to the Bloom Filter.
If too few hash functions are used, there will be more collisions. If too many hash functions are used, you’ll risk to check the same bits repeatedly, that will result in slowing down performance.
The probablity of false positive is calculated as:
P = (1− [1−1/m] ^ kn) ^k
and the size of the bit array m is calculated as following:
m = −(n * ln (P)) / (ln2) ^2
Why Bloom Filters and not other data structures?
Other data strcutures can still be used, such as binary trees and hash tables. Hash table is better for large data, however, it uses a lot of space as the data grows. The thing about Bloom Filter is that it occupies little space per element (counted in bits instead of bytes), it is therefore, space-efficient.
Applications of Bloom Filters
Bloom Filters are used in a variety of applications:
- Bloom Filters are used in Redis
- In Google Chrome, used to check whether a given url is dangerous, and you’ll receive a warning if that’s the case.
- In a number of databases databases such as PostgreSQL and Cassandra to reduce disk lookups for non-existing rows.
- In Medium in filtering recommended posts to users, by showing them posts that have not been recommended to them yet.
That’s it for this article, happy blooming! 🌸🌺 😊😉