Sunday, April 7, 2013

Using Bloom Filters to Lower Cost of Large Join Jobs



Data management company <a href=”http://liveramp.com/”> LiveRamp</a> recently began opensourcing some of their internal data analysis and management tools. In this process they added a new tool for reducing the cost of MapReduce join jobs, BloomJoin. BloomJoin is useful when you are trying to join two groups where one is a very large dataset and the other is significantly smaller with a significantly smaller proportion of the data from the larger set. To complete this job normally, a user would first sort both sets of data, and then reduce both sets. This works fairly well but is inefficient with regards to sorting the larger dataset. To alleviate this BloomJoin first applies a bloom filter based on the target dataset to the larger dataset. A bloom filter is a probabilistic representation of datasets. By giving the filter a target set of objects it rejects objects that are not found within the target. Bloom filters hash the datasets keys and note the position of targets within an array. A bloom filer never creates false negatives, but can create false positives. This is why the bloom filter does not completely eliminate the need to sort and join the datasets. By applying a bloom filter to the data before sorting, the amount of data that must be sorted can be lowered dramatically. On LiveRamp’s test dataset running the BloomJoin job versus a standard CoGroup, and found that BloomJoin spent only 49.3% of the CPU time as the CoGroup job.
This is significant because, it can allow larger join tasks to be run with reduced costs in both time and dollars, and it is available to the open source community to improve upon further.

Sources:


No comments:

Post a Comment