Tuesday, February 19, 2013

Alternatives to hadoop


Bash Reduce: Bash reduce was developed by Eric Frey of last.fm. Bash reduce only works with certain Unix commands. It implements Mapreduce for standard Unix commands such as sort,awk,grep, etc. There is no task coordination as a master process simply fires off jobs and data. The communication to all hosts is facilitated through passwordless ssh, in hadoop data is centrally stored in HDFS. It has less complexity but lack of flexibility as it works only with Unix commands.
Disco Project: Disco was developed by Nokia Research, as a framework for distributed data processing. Disco is written in Erlang, a scalable functional language with built-in support for concurrency, fault tolerance and distribution. Users of Disco write jobs in python which makes it possible to express even complex algorithms with little code. Disco distributed file system is designed specifically to support  use cases that are typical for Disco and mapreduce. Fault tolerance and high availability are ensured by k-way replication of both data and metadata, so the system tolerates K-1 simultaneous hardware failures without interruptions. Even under a catastrophic failure, data is recoverable using standard tools, as the DDFS stores data with extensions such as ext3 or xfs. DDFS operates on two concepts mainly blobs and tags. Blobs are arbitrary objects that have been pushed to DDFS. Blobs are distributed to storage nodes and stored on local file systems. Multiple copies or replicas are stored for each blob. Tags contain metadata about blobs. A tag contains a list of URL that refer to blobs that have been assigned this tag. I may also include user-defined data. Disco also has efficient job scheduling features.
Spark: Spark was developed in the UC Berkeley AMPLab. It is used by researchers at Berkeley to run large-scale applications such as spam filtering, natural language processing and accelerate data analytics at Conviva, Klout, Quantifind and other companies. Spark is implemented in Scala, a functional object oriented language. Users can use Spark to query big data straight from the Scala interpreter. Spark runs on a cluster manager called Apache Mesos. Mesos allows spark to co-exist with Hadoop and it can read any data that Hadoop supports. To run programs faster, spark provides in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than disk based systems like hadoop  mapreduce.
Graphlab: It was developed at Carnegie Mellon, and is designed for use in machine learning. Graphlab’s goal is to make the design and implementation of efficient and correct parallel machine learning algorithms easier.  Graphlab has its own map stage called the update stage. The update stage can both read and modify overlapping sets of data. It allows user to specify data as a graph where each vertex and edge in the graph is associated memory. The update phases can be chained in such a way that one update function can trigger other update functions that operate on vertices in the graph. It makes machine learning on graphs more tractable and improves iterative algorithm. It has its own reduce phase called sync operation. The results of sync are global and can be used by all vertices in the graph. The sync happens at time intervals and there is not a strong tie between update and sync phases. It looks like a powerful generalization and re-specification of mapreduce.
Storm: It was developed by Nathan Marz of BackType presently with twitter. Storm operates real time – processes data as it streams. Storm is written in Clojure, but any programming language like scala,ruby,etc can be used to write programs. Storm uses ZeroMQ for message passing, which removes intermediate queueing  and allows messages to flow directly between the tasks themselves. Storm implements fault detection at the task level, where upon failure of a task, messages are automatically reassigned to quickly restart processing. Storm is a computation system and incorporates no storage, it processes data as it streams.
HPCC Systems: It was developed by LexisNexis for massive big data analytics. It attempts to make writing parallel-processing workflows easier by using  Enterprise Control Language. HPCC is written in C++, which will make in-memory querying much faster. HPCC has two systems for processing and serving data: the Thor Data refinery cluster and Roxy Rapid data delivery cluster. Thor is a data processor like hadoop. Roxie is similar to data warehouses and supports transactions. HPCC uses a distributed file system. Files are divided on even record boundaries specified by the user. Master architecture with name services and file mapping information are stored on a separate server.

3 comments:

  1. Sam, nice mention of HPCC Systems. Furthermore, their open source Machine Learning Library and Matrix processing algorithms assist data scientists and developers with business intelligence and predictive analytics. Its integration with Hadoop, R and Pentaho extends further capabilities providing a complete solution for data ingestion, processing and delivery. In fact, a webhdfs implementation, (web based API provided by Hadoop) was recently released. See http://hpccsystems.com/h2h

    ReplyDelete
  2. really Good blog post.provided a helpful information.I hope that you will post more updates like thisBig data hadoop online training Hyderabad

    ReplyDelete
  3. Very excellent post!!! Thank you so much for your great content. Keep posting.....

    Python Training In Pune
    Python Classes in Pune

    ReplyDelete