Data management company <a href=”http://liveramp.com/”> LiveRamp</a>
recently began opensourcing some of their internal data analysis and management
tools. In this process they added a new tool for reducing the cost of MapReduce
join jobs, BloomJoin. BloomJoin is useful when you are trying to join two
groups where one is a very large dataset and the other is significantly smaller
with a significantly smaller proportion of the data from the larger set. To
complete this job normally, a user would first sort both sets of data, and then
reduce both sets. This works fairly well but is inefficient with regards to
sorting the larger dataset. To alleviate this BloomJoin first applies a bloom
filter based on the target dataset to the larger dataset. A bloom filter is a
probabilistic representation of datasets. By giving the filter a target set of
objects it rejects objects that are not found within the target. Bloom filters
hash the datasets keys and note the position of targets within an array. A
bloom filer never creates false negatives, but can create false positives. This
is why the bloom filter does not completely eliminate the need to sort and join
the datasets. By applying a bloom filter to the data before sorting, the amount
of data that must be sorted can be lowered dramatically. On LiveRamp’s test
dataset running the BloomJoin job versus a standard CoGroup, and found that
BloomJoin spent only 49.3% of the CPU time as the CoGroup job. 
This is significant because, it can allow larger join tasks
to be run with reduced costs in both time and dollars, and it is available to
the open source community to improve upon further.
Sources:
 
No comments:
Post a Comment