Difference between Hadoop 1.x and Hadoop 2.x

These are all about Hadoop 1 vs Hadoop 2. Hadoop 2 has definitely overcome most of the issues those were with Hadoop 1. Another difference between Hadoop 1.0 and Hadoop 2.0 is the block size. In Hadoop 1, the default size was 64MB and with Hadoop 2.0. the default block size is 128 MB.

Hadoop 1 Architecture

JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks.

architecture

Hadoop 1 Limitations

  1. Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower.
  2. Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000.
  3. Availability Failure Kills Queued & Running Jobs
  4. Hard partition of resources into map and reduce slots Non-optimal Resource Utilization

Hadoop 2 - YARN Architecture

  1. ResourceManager (RM) Central agent: Manages and allocates cluster resources.
  2. NodeManager (NM) Per-Node agent :- Manages and enforces node resource allocations.
  3. ApplicationMaster (AM) Per-Application :– Manages application lifecycle and task scheduling.

architecture

Compute Layer Architectural Evolution Earlier every non-Map Reduce applications were forced to be modeled as MapReduce as compute resources in Hadoop1.X were only available to MapReduce programs. In Hadoop 2.X, YARN(also called MRv2) component generalizes the compute layer to execute not just MapReduce style but other new breed of applications, such as stream processing, to be supported in a first-class manner. The new architecture is more decentralized and allows Hadoop clusters to be scaled significantly to more cores and servers. 

In hadoop 1.X we have only JobTracker to manage both the compute resources and the jobs that use the resources. In hadoop 2.X, YARN is a resource manager that splits function into two. First is a Resource Manager (RM) which focuses on managing the cluster resources and second is an Application Master (AM), which manages each running application (such as a MapReduce job) one-per-running-application. The AM requests resources from the RM, based on the needs and characteristics of the application being run. YARN is designed to allow multiple, diverse user applications to run on a multitenant platform. In addition to MapReduce YARN supports multiple processing models. YARN is also called as next generation execution layer of Hadoop. 

graph

in Hadoop 1 architecture only HDFS and MapReduce are there while in Hadoop 2 architecture, another component called YARN has been introduced. So, in Hadoop 1, both application and resource management were taken care by the MapReduce but in Hadoop 2, application management is with MapReduce and resource management is taken care by YARN.

Here in Hadoop 2, NameNode and Resource Manager is the master daemon while DataNode and Node Manager are the slave daemons. Each Node Manager will be associated with each DataNode.

graph

In Hadoop, the cluster’s storage resources are available only to HDFS. But now in Hadoop 2.X the new storage architecture generalizes the block storage layer so that it can be used not only by HDFS but also other storage services which is as similar to that of YARN. Hadoop 2.X also support for heterogeneous storage. Hadoop 1.X treated all storage devices such as spinning disks on a DataNode as a single uniform pool. Hadoop 2.X will differentiate between storage types along with making the storage type information available to frameworks and applications by which they can take advantage of storage properties

Hadoop 1.X has a single master server called NameNode where all the metadata is stored. When the NameNode is brought down by any software or hardware failure, the cluster would be unavailable until it is restarted. Hadoop 2.X handles this situation by triggering automatic failover by which the standby NameNode becomes active. Here ZKFC (Zookeeper-based Failover Controller) manages failover of NameNodes. On each of the NameNodes this daemon runs and a session is maintained with the Zookeeper. An active local NameNode is elected by one of the ZKFC with the coordination of Zookeeper. Periodically NameNode health check is done by ZKFC. The local ZKFC resigns as the leader when the active NameNode fails health check. Similarly, when failure occur in the active NameNode machine, Zookeeper detects the loss and removes the ZKFC from the failed node as the leader and the ZKFC running on standby becomes the leader by makes the local standby NameNode active. This results in automatic failover.

Hadoop 1.X was developed to support only the UNIX family of operating systems. But With Hadoop 2.X, the Windows operating system is indigenously supported because of the fact that Hadoop was written in Java. The compute and storage resource which were dependent on UNIX have been generalized to support Windows. This broadens Hadoop to reach Windows Server market.

Hadoop 2.X is having several betterments to the RPC layer shared by HDFS, YARN, and MapReduce v2. The on-the-wire protocol instead of using java serialization uses protocol buffers which helps in extending the protocol in the future without breaking the wire protocol compatibility. RPC also adds support for client-side retries of the operation, a key functionality for supporting highly available server implementation. These betterments help in running different versions of daemons within the cluster, paving the way for rolling upgrades.

Hadoop 1.X is facing drawbacks like Low Latency, No Updates, Single point of failure (NameNode, JobTracker), Lots of small files not solving, OS dependent (Linux), Job Tracker Resource allocation and Scheduling the jobs. Of all these some are solved in latest version of hadoop named Hadoop 2.X. Two of the most important advances in Hadoop 2.X are the introduction of HDFS federation and the resource manager YARN (yet another resource negotiator).The HDFS federation adds important measures of scalability and reliability to Hadoop1.X. YARN brings significant performance.

Hadoop Federation is the new concept introduced in the Hadoop version 2 and it basically separates the namespace layer with block storage layer. So, basically NameNode is having metadata and in metadata, we have the following- 

  • Namespace layer
  • Block storage layer

The namespace layer is responsible for the following-

  • Info about block or folder
  • Info about directory and files
  • Responsible for file-level operation- create/modify/delete
  • Directory or File Listing

And the block storage layer is mainly divided into following two-

  • Block Management and
  • Physical Storage

Further block management is responsible for-

  • Block Information
  • Replication
  • Replication Placement

And physical storage is responsible for-

  • Stores the blocks
  • Provide read & write access