Thursday, March 21, 2013

Top 5 Open Source Projects in Big Data


I watched a short video presented by Silicon Angle that was an interview with Abhi-Shake Meta, the founder of Tresata, who spoke about what he considers to be the Top 5 Open Source Projects in Big Data.



1. Trevni is a  core storage engine that rapidly speeds up data access. It is a columnar file format and is actually part of the Cloudera Impala project (discussed below). There are several features of Trevni, but I will list a few that stick out (and make sense) to me:
  • Each column that data is stored in contains a different type of data.
  • Data sets are partitioned into row groups that contain a distinct collection of rows.
  • Each row group is written as a separate file.
  • The reduction of the number of row groups also reduces the number of HDFS files created, which also reduces the load on the name node.
  • Many types of data are supported (int, long, float, double, string, and  byte data type).



2. Spark is an extremely fast cluster computing system for algorithm processing. In-memory cluster computing that Spark uses is much quicker than systems like Hadoop MapReduce. Spark is useful in bother iterative algorithms and interactive data mining.

3. D3 is a data driven document visualization platform that is used to communicate results. One example of what D3 can do is generating an HTML table from an array of numbers. D3 can also be used to create interactive charts. This is just one example of how D3 can be used to understand data: http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-budget-proposal-graphic.html?_r=0

4. HCatalog is metadata management framework that works in HDFS. Users are able to share data and metadata across Hive, Pig, and MapReduce and to write applications without concern of where their data is stored. There are three main uses of HCatalog: Complex Data Processing, Data Discovery Checkpoints, and to Integrate Hadoop with everything. To read more about these three categories, visit: http://hortonworks.com/hdp/hdp-hcatalog-metadata-services/.

5. Cloudera Impala is a real-time query engine for Hadoop. Some of the basic features of Impala are the following:
  • Real time queries in seconds
  • Support for HDFS and HBase systems
  • Low latency scheduler
  • In-memory data transfers

To view a more complete list and to read more about Cloudera Impala, visit: 


All of these open source projects are making big data easier to consume for businesses. There are so many open source software systems out there, but these are ones that should be popping up more and more in the near future!

Other Sources:




No comments:

Post a Comment