|
Filters
The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. more...
Home
ATV Parts
Apparel & Merchandise
Automotive Tools
Aviation Parts
Boats Parts
Car & Truck Parts
Car Audio, Video
Manuals & Literature
Motorcycle Parts
Other
Other Vehicle Parts
Personal Watercraft Parts
Racing Parts
Services & Installation
Snowmobile Parts
Vintage Car & Truck Parts
AC & Heating
Accessories
Air Intake & Fuel Delivery
Brakes
Charging & Starting Systems
Cooling System
Decals
Engines & Components
Exhaust
Exterior
Filters
Gaskets
Gauges
Glass
Ignition
Interior
Lighting & Lamps
Other Parts
Parts Cars
Radio & Speaker Systems
Suspension & Steering
Transmission & Drivetrain
Wheels, Tires, & Hub Caps
Wholesale Lots
Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.
Example
For example, one might use a Bloom filter to do spell-checking in a space-efficient way. A Bloom filter to which a dictionary of correct words has been added will accept all words in the dictionary and reject almost all words which are not, which is good enough in some cases. Depending on the false positive rate, the resulting data structure can require as little as a byte per dictionary word.
One peculiar attribute of this spell-checker is that it is not possible to extract the list of correct words from it – at best, one can extract a list containing the correct words plus a significant number of false positives. This limitation can be considered a feature, when you want to check for a set of items without disclosing those items; for example in a security application which scans your disk for Social Security numbers; or in a program to scrub opted-out email addresses from the lists of mass mailers, where you do not want to make known any of the opted-out addresses to the companies using your list. This is not a completely secure solution, however, as it may be possible to separate the false positives from the real data by some other means.
Google BigTable uses Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
Algorithm description
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps a key value to one of the m array positions.
To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.
To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements.
The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k-1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger m and/or k, independence among the hash functions can be relaxed with negligible increase in false positive rate (Dillinger & Manolios (2004a), Kirsch & Mitzenmacher (2006)). Specifically, Dillinger & Manolios (2004b) show the effectiveness of using enhanced double hashing or triple hashing, variants of double hashing, to derive the k indices using simple arithmetic on two or three indices computed with independent hash functions.
Read more at Wikipedia.org
|
|