Dwarves
Memo
Type ESC to close search bar

Bloom Filter

What is Bloom Filter?

A Bloom Filter is a probabilistic data structure used for testing whether an element is a member of a set or not. It’s space-efficient compared to other data structures like hash tables, but it may give false positives (indicating that an element is in the set when it’s not) and never gives false negatives (indicating that an element is not in the set when it actually is not).

How Bloom Filter work?

Initialization: You start with a bit array of size m (usually a large prime number) initially set to all zeros, and k different hash functions.

Adding Elements: When you want to add an element to the Bloom Filter, you run it through each of the k hash functions, each of which maps the element to one of the m bits in the array. You then set those bits to 1.

Checking Membership: To check if an element is in the Bloom Filter, you again run it through each of the k hash functions. If all k bits are set to 1 in the bit array, the element is probably in the set. If any of the bits are 0, then the element is definitely not in the set. However, due to potential hash collisions, false positives are possible.

When to use?

Imagine that you have a system which millions user, and then you have a user table which had millions of records. So everytime a new user is registered, you have to query millions records table to check username is existed or not which cost Time complexity = O(n) (n will be vary upto millions) in case not aggregate further. Instead of that, you can Checking membership in Bloom Filter which just have Time complexity = O(k) (k is constant number of hash function) and then response to user that username is alreay existed. The result has a small probability of false positive(username not existed but response existed) due to potential hash collion but it acceptable by notify user can register with other username.

Besides above scenario, Bloom Filter can apply in other case like:

Overall, Bloom Filters are suitable for applications where memory usage needs to be minimized, and a small probability of false positives is acceptable. However, they are not suitable for scenarios where false positives are unacceptable or where exact membership information is required.

Pros and Cons?

Pros:

Cons:

Simple Implementation:

Nowadays, many cache system integrated bloom filter as a feature. Can use it without implementation, but if you want try to hand-on. the simple implementation is a example:

package main

import (
	"fmt"
	"hash/crc32"
	"hash/fnv"
)

type BloomFilter struct {
	bitArray []bool
	size     uint
	hashFunc []func(data string) uint
}

func NewBloomFilter(size uint) *BloomFilter {
	bf := &BloomFilter{
		bitArray: make([]bool, size),
		size:     size,
		hashFunc: []func(data string) uint{
			func(data string) uint { return uint(fnv.New32().Sum32()) },
			func(data string) uint { return uint(crc32.ChecksumIEEE([]byte(data))) },
			// Add more hash functions here if needed
		},
	}
	return bf
}

func (bf *BloomFilter) Add(data string, numHash int) error {
	if numHash > len(bf.hashFunc) {
		return fmt.Errorf("number of hash functions exceeds the limit")
	}
	for i := 0; i < numHash; i++ {
		index := bf.hashFunc[i](data) % bf.size
		bf.bitArray[index] = true
	}
	return nil
}

func (bf *BloomFilter) Contains(data string) bool {
	for _, hashFunc := range bf.hashFunc {
		index := hashFunc(data) % bf.size
		if !bf.bitArray[index] {
			return false
		}
	}
	return true
}

func main() {
	bloomFilter := NewBloomFilter(100)

	// Add user identities to the Bloom Filter
	userIdentities := []string{"user123", "user456", "user789"}
	for _, identity := range userIdentities {
		bloomFilter.Add(identity, 2)
	}

	// Check for existence of user identities
	fmt.Println(bloomFilter.Contains("user123")) // true
	fmt.Println(bloomFilter.Contains("user789")) // true
	fmt.Println(bloomFilter.Contains("user999")) // false
}