MapReduce is an algorithm to distribute work around a
cluster.
The Map:
A map transform is
provided to transform an input data row of key and value to an output key/value
map(key1,value)
-> list<key2,value2>
That is, for an input it
returns a list containing zero or more pairs.
·
The output can be a different key from the input
·
The output can have multiple entries with same
key
The Reduce:
A reduce transform
is provided to take all values for a specific key, and generate a new list of
the reduced output.
·
reduce(key2,
list<value2>) -> list<value3>
The MapReduce Engine:
The
key aspect of MapReduce algorithm is that if every map and reduce are independent
of all other on-going maps and reduces, then the operation can run in parallel
on different keys and lists of data. On a larger cluster of machines, you can
go one step further and run the map operations on servers where the data lives.
Rather than copy the data over the network to the program , you push out the
program to the machines. The output list can then be saved to the distributed filesystem and the
reducers run to merge the results. Again, it may be possible to run these in
parallel, each reducing different keys.
·
A distributed
filesystem spreads multiple copies of the data across different machines. This
not only offers reliability, it offers multiple locations to run on mapping. If
a machine with one copy of the data is busy, another machine can be used.
·
A job scheduler
keeps track of which jobs are being executed, monitors the success and failure
of these individual tasks, and works to complete the entire task.
·
Apache Hardoop is such a MapReduce engine which provides its
own filesystem and runs jobs on servers near the data stored on the filesystem.
This might seem to
be a little confusing. The concept of MapReduce is illustrated in the following
video which might give you guys a clear idea of the basics of MapReduce.
Reference: www.apache.org/mapreduce
Over the past few weeks in class we have discussed MapReduce in class and discussed algorithms which can be used to implement MapReduce. It can be difficult to understand initially and successfully implement it with datasets. But understanding that the run-time system is capable in handling the most part in implementing the algorithms and scheduling program execution it becomes an easy tool to use. They can be implemented on a large cluster of commodity machines which are highly scalable and can process many terabytes of data.
ReplyDeleteMapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience in parallel and distributed systems to easily utilize the resources of a large distributed system.
I have not worked on implementing an algorithm on MapReduce using an actual dataset. I have looked at various tutorials online on how to process large datasets using a software framework.
Once I am able to understand it completely and work on my own to implement MapReduce I will post a tutorial or blog on my insights.
Reference : MapReduce Simplified Data Processing; Jeffrey Dean and Sanjay Ghemawat
Thanks for sharingData Mining software service providers
ReplyDelete