Sunday, February 17, 2013

Hadoop Distributed File System




Hadoop Distributed File System
In the past, applications that called for parallel processing, such as large scientific calculations, were done on special-purpose parallel computers with many processors and specialized hardware. However, the prevalence of large-scale web services has caused more and more computing to be done installations with thousands of compute nodes operating more or less independently. It was initially done with the Google File System (GFS) in order to successfully exploit cluster computing. The Hadoop Distributed File System (HDFS) is a subproject of the Apache Hadoop project. It is a distributed, highly fault-tolerant file system designed to run on low-cost commodity hardware. HDFS provides high-throughput access to application data and is suitable for applications with large data sets. Hadoop is ideal for storing large amounts of data, like terabytes and petabytes, and uses HDFS as its storage system. HDFS lets you connect nodes (commodity personal computers) contained within clusters over which data files are distributed. You can then access and store the data files as one seamless file system. Access to data files is handled in a streaming manner, meaning that applications or commands are executed directly using the MapReduce processing model. HDFS is fault tolerant and provides high-throughput access to large data sets. HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access. Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. HDFS provides interfaces for applications to move them closer to where the data is located. HDFS can be accessed via so many different ways. HDFS provides a native Java application programming interface (API) and a native C-language wrapper for the Java API. In addition, you can use a web browser to browse HDFS files. This is a big advantage as it provides portability to the application. The following video takes us through a tutorial about the architecture of the Hadoop Distributed File system (HDFS). HDFS has many similarities with other distributed file systems. But a unique feature of this model is that it is a simplified in terms of data understanding and is capable of handling larger volumes of data. It also employs a form of logic where data is stored in parallel nodes, which is easy to access via the process node. The architecture comprises of a name node and a process node with data distributed on 2 servers with multiple stacks. This accommodates for processing larger data sets and it works well with the logic model employed to search for replicated data. The data replicas are stored in various name nodes to ensure redundancy. A data set which is split into blocks for processing in clusters is typically in the size of 64MB to 128MB.A secondary name node cannot take over if the primary name node fails. Data is replicated periodically and it can be reconstructed by using a certain logic and can be accessed via the secondary name node. It is optimized for batch processing and it assumes commodity hardware. References: Jeff Hanson-An Introduction to Hadoop Distributed File System http://www.ibm.com/developerworks/library/wa-introhdfs/ Dr. Sreerama K. Murthy-International School of Engineering


5 comments:

  1. Prashant,

    Your post needs a lot of paraphrasing and references since it matches very closely to both the IBD description of Hadoop and the one of the Apache foundation. Please edit your post and provide citations.

    Thank you,
    Fadel

    ReplyDelete
  2. Thank you for your feedback.I have provided the references. I have taken most of the technical points about the HDFS from the referenced article. There are some comments placed in between about my understanding. The comments in the last paragraph are on certain key points on the architecture of the HDFS and the goals of the HDFS. The video link provided talks about these parameters which were predominant in the discussion of HDFS

    Hope this is fine. I will make any other additional changes if required.
    Thanks
    Prashant

    ReplyDelete
  3. Nice information .We need to learn from real time application and we need some training. Then only we can learn easily. https://www.apponix.com/Big-Data-Institute/hadoop-training-in-bangalore.html

    ReplyDelete
  4. Very nice article,Thank you for sharing it.
    keep updating..

    Big Data and Hadoop Training

    ReplyDelete