Bash Reduce: Bash reduce was developed by Eric Frey of
last.fm. Bash reduce only works with certain Unix commands. It implements
Mapreduce for standard Unix commands such as sort,awk,grep, etc. There is no
task coordination as a master process simply fires off jobs and data. The
communication to all hosts is facilitated through passwordless ssh, in hadoop
data is centrally stored in HDFS. It has less complexity but lack of
flexibility as it works only with Unix commands.
Reference: http://www.linux-mag.com/id/7407/
Disco Project: Disco was developed by Nokia Research, as a
framework for distributed data processing. Disco is written in Erlang, a scalable
functional language with built-in support for concurrency, fault tolerance and
distribution. Users of Disco write jobs in python which makes it possible to
express even complex algorithms with little code. Disco distributed file system
is designed specifically to support use
cases that are typical for Disco and mapreduce. Fault tolerance and high
availability are ensured by k-way replication of both data and metadata, so the
system tolerates K-1 simultaneous hardware failures without interruptions. Even
under a catastrophic failure, data is recoverable using standard tools, as the
DDFS stores data with extensions such as ext3 or xfs. DDFS operates on two
concepts mainly blobs and tags. Blobs are arbitrary objects that have been
pushed to DDFS. Blobs are distributed to storage nodes and stored on local file
systems. Multiple copies or replicas are stored for each blob. Tags contain
metadata about blobs. A tag contains a list of URL that refer to blobs that
have been assigned this tag. I may also include user-defined data. Disco also
has efficient job scheduling features.
Reference: http://discoproject.org/doc/disco/intro.html
Spark: Spark was developed in the UC Berkeley AMPLab. It is
used by researchers at Berkeley to run large-scale applications such as spam
filtering, natural language processing and accelerate data analytics at
Conviva, Klout, Quantifind and other companies. Spark is implemented in Scala,
a functional object oriented language. Users can use Spark to query big data
straight from the Scala interpreter. Spark runs on a cluster manager called Apache
Mesos. Mesos allows spark to co-exist with Hadoop and it can read any data that
Hadoop supports. To run programs faster, spark provides in-memory cluster
computing: your job can load data into memory and query it repeatedly much
quicker than disk based systems like hadoop mapreduce.
Reference: http://spark-project.org/
Graphlab: It was developed at Carnegie Mellon, and is
designed for use in machine learning. Graphlab’s goal is to make the design and
implementation of efficient and correct parallel machine learning algorithms
easier. Graphlab has its own map stage
called the update stage. The update stage can both read and modify overlapping sets
of data. It allows user to specify data as a graph where each vertex and edge
in the graph is associated memory. The update phases can be chained in such a
way that one update function can trigger other update functions that operate on
vertices in the graph. It makes machine learning on graphs more tractable and
improves iterative algorithm. It has its own reduce phase called sync
operation. The results of sync are global and can be used by all vertices in
the graph. The sync happens at time intervals and there is not a strong tie
between update and sync phases. It looks like a powerful generalization and
re-specification of mapreduce.
Reference: http://select.cs.cmu.edu/code/graphlab/
Storm: It was developed by Nathan Marz of BackType presently
with twitter. Storm operates real time – processes data as it streams. Storm is
written in Clojure, but any programming language like scala,ruby,etc can be
used to write programs. Storm uses ZeroMQ for message passing, which removes
intermediate queueing and allows
messages to flow directly between the tasks themselves. Storm implements fault
detection at the task level, where upon failure of a task, messages are
automatically reassigned to quickly restart processing. Storm is a computation
system and incorporates no storage, it processes data as it streams.
HPCC Systems: It was developed by LexisNexis for massive big
data analytics. It attempts to make writing parallel-processing workflows
easier by using Enterprise Control
Language. HPCC is written in C++, which will make in-memory querying much faster.
HPCC has two systems for processing and serving data: the Thor Data refinery
cluster and Roxy Rapid data delivery cluster. Thor is a data processor like
hadoop. Roxie is similar to data warehouses and supports transactions. HPCC
uses a distributed file system. Files are divided on even record boundaries
specified by the user. Master architecture with name services and file mapping
information are stored on a separate server.
Reference: www.hpccsystems.com
Sam, nice mention of HPCC Systems. Furthermore, their open source Machine Learning Library and Matrix processing algorithms assist data scientists and developers with business intelligence and predictive analytics. Its integration with Hadoop, R and Pentaho extends further capabilities providing a complete solution for data ingestion, processing and delivery. In fact, a webhdfs implementation, (web based API provided by Hadoop) was recently released. See http://hpccsystems.com/h2h
ReplyDeletereally Good blog post.provided a helpful information.I hope that you will post more updates like thisBig data hadoop online training Hyderabad
ReplyDeleteVery excellent post!!! Thank you so much for your great content. Keep posting.....
ReplyDeletePython Training In Pune
Python Classes in Pune