Wednesday, February 6, 2013

What is MapReduce?


MapReduce is an algorithm to distribute work around a cluster.

The Map:
A map transform is provided to transform an input data row of key and value to an output key/value
map(key1,value) -> list<key2,value2>
That is, for an input it returns a list containing zero or more pairs.
·         The output can be a different key from the input
·         The output can have multiple entries with same key

The Reduce:

A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.
·         reduce(key2, list<value2>) -> list<value3>

The MapReduce Engine:

The key aspect of MapReduce algorithm is that if every map and reduce are independent of all other on-going maps and reduces, then the operation can run in parallel on different keys and lists of data. On a larger cluster of machines, you can go one step further and run the map operations on servers where the data lives. Rather than copy the data over the network to the program , you push out the program to the machines. The output list can then be saved  to the distributed filesystem and the reducers run to merge the results. Again, it may be possible to run these in parallel, each reducing different keys.
·         A distributed filesystem spreads multiple copies of the data across different machines. This not only offers reliability, it offers multiple locations to run on mapping. If a machine with one copy of the data is busy, another machine can be used.
·         A job scheduler keeps track of which jobs are being executed, monitors the success and failure of these individual tasks, and works to complete the entire task.
·         Apache Hardoop  is such a MapReduce engine which provides its own filesystem and runs jobs on servers near the data stored on the filesystem.

This might seem to be a little confusing. The concept of MapReduce is illustrated in the following video which might give you guys a clear idea of the basics of MapReduce.


Reference: www.apache.org/mapreduce




2 comments:

  1. Over the past few weeks in class we have discussed MapReduce in class and discussed algorithms which can be used to implement MapReduce. It can be difficult to understand initially and successfully implement it with datasets. But understanding that the run-time system is capable in handling the most part in implementing the algorithms and scheduling program execution it becomes an easy tool to use. They can be implemented on a large cluster of commodity machines which are highly scalable and can process many terabytes of data.
    MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience in parallel and distributed systems to easily utilize the resources of a large distributed system.
    I have not worked on implementing an algorithm on MapReduce using an actual dataset. I have looked at various tutorials online on how to process large datasets using a software framework.
    Once I am able to understand it completely and work on my own to implement MapReduce I will post a tutorial or blog on my insights.

    Reference : MapReduce Simplified Data Processing; Jeffrey Dean and Sanjay Ghemawat

    ReplyDelete