I watched a short video presented by Silicon Angle that was
an interview with Abhi-Shake Meta, the founder of Tresata, who spoke about what
he considers to be the Top 5 Open Source Projects in Big Data.
1. Trevni is
a core storage engine that rapidly
speeds up data access. It is a columnar file format and is actually part of the
Cloudera Impala project (discussed below). There are several features of Trevni,
but I will list a few that stick out (and make sense) to me:
- Each column that data is stored in contains a different type of data.
- Data sets are partitioned into row groups that contain a distinct collection of rows.
- Each row group is written as a separate file.
- The reduction of the number of row groups also reduces the number of HDFS files created, which also reduces the load on the name node.
- Many types of data are supported (int, long, float, double, string, and byte data type).
For a more complete list, see: http://venkat-sp.blogspot.com/2012/12/trevni-columnar-file-format-for.html.
2. Spark is an extremely
fast cluster computing system for algorithm processing. In-memory cluster
computing that Spark uses is much quicker than systems like Hadoop MapReduce.
Spark is useful in bother iterative algorithms and interactive data mining.
3. D3 is a data
driven document visualization platform that is used to communicate results. One
example of what D3 can do is generating an HTML table from an array of numbers.
D3 can also be used to create interactive charts. This is just one example of
how D3 can be used to understand data: http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-budget-proposal-graphic.html?_r=0
4. HCatalog is metadata
management framework that works in HDFS. Users are able to share data and
metadata across Hive, Pig, and MapReduce and to write applications without
concern of where their data is stored. There are three main uses of HCatalog:
Complex Data Processing, Data Discovery Checkpoints, and to Integrate Hadoop
with everything. To read more about these three categories, visit: http://hortonworks.com/hdp/hdp-hcatalog-metadata-services/.
5. Cloudera Impala
is a real-time query engine for Hadoop. Some of the basic features of Impala
are the following:
- Real time queries in seconds
- Support for HDFS and HBase systems
- Low latency scheduler
- In-memory data transfers
To view a more complete list and to read more about Cloudera
Impala, visit:
All of these open source projects are making big data easier
to consume for businesses. There are so many open source software systems out
there, but these are ones that should be popping up more and more in the near
future!
Other Sources:
No comments:
Post a Comment