Introduction to HDFS (Hadoop Distributed File System)

HDFS stands for Hadoop Distributed File System. When a set of data exceed the storage capacity of the system that is processing the data, the HDFS comes in to distribute the data across multiple system. When this distribution occurs, one of the biggest issues that need to be handled is having a suitable failure tolerable mechanism and recovery method within the system to ensure no data loss.

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

HDFS has several great capabilities. HDFS is designed for storing very large files; it can go up to megabytes to even terabytes in size. HDFS is also designed so it can run on commonly available hardware that is not very expensive or highly reliable. With the system designed to handle node failure the tasks can be carrying on without visible disruption to the user. On the other hand, HDFS has a few currently known issues that make it not as compatibles in some scenarios. For instance, because HDFS is optimized for processing a big amount of data, it does not work so well with applications that requires low-latency access to data. In this scenario, an additional layer of HBase on top of HDFS is a more suitable choice and it will be discussed under HBase section. When the number of files became too big, the distributed system will not be able to store them. Storage unit of Hadoop. Relies on principles of Distributed File System. HDFS have a Master-Slave architecture. 3+ replicas for each block. Default Block Size : 128MB Can be built out of commodity hardware. Resistant to Failure. Individual node failure does not disrupt the system.Each of the name nodes requires memory in the system, on average, a file and directory takes about 150 bytes of memory, so even though the data within the file can be stored without problem, the number of files will go beyond the storage capacity.

Main Components:

Name Node : Master
Data Node : Slave

Hadoop Development Platform:

Uses a MapReduce model for working with data
Users can program in Java, C++, and other languages.

How does HDFS work?

1.) Split Data

Data copied into HDFS is split into blocks
Typical HDFS block size is 128 MB
(VS 4 KB on UNIX File Systems).

2.) Replication

Each block is replicated to multiple machines
This allows for node failure without data loss.

Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

Hardware at data − A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.