Difference between Hadoop 2.x and Hadoop 3.x

Hadoop 2 and Hadoop 3 are data processing engines developed in Java and released in 2013 and 2017 respectively. Hadoop was created with the primary goal to maintain the data analysis from a disk, known as batch processing. Therefore, native Hadoop does not support the real-time analytics and interactivity.

In Hadoop 2 for data balancing balancer whereas In Hadoop 3 for data balancing uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI. In Hadoop 2 Fault tolerance can be handled by replication (which is wastage of space) whereas In Hadoop 3 it can be handled by erasure coding

Spark 2.X is a processing and analytics engine developed in Scala and released in 2016. The real-time analysis of the information was becoming crucial, as many giant internet services strongly relied on the ability to process data immediately. Consequently, Apache Spark was built for live data processing and is now popular because it can efficiently deal with live streams of information and process data in an interactive mode.

Both Hadoop and Spark are open source, Apache 2 licensed. The release of Hadoop 3 in December 2017 marked the beginning of a new era for data science. The Hadoop framework is at the core of the entire Hadoop ecosystem, and various other libraries strongly depend on it. Hadoop works with a disk, so it does not need a lot of RAM to operate. This can be cheaper than having large RAM. Hadoop 3 requires less disk space than Hadoop 2 due to changes in fault-tolerance providing system.

Spark needs a lot of RAM to operate in the in-memory mode so that the total cost can be more expensive than Hadoop. Generally, Hadoop is slower than Spark, as it works with a disk. Hadoop cannot cache the data in memory. Hadoop 3 can work up to 30% faster than Hadoop 2 due to the addition of native Java implementation of the map output collector to the MapReduce. Spark can process the information in memory 100 times faster than Hadoop. If working with a disk, Spark is 10 times faster than Hadoop.

The fault tolerance in Hadoop 2 is provided by the replication technique where each block of information is copied to create 2 replicas. This means that instead of storing 1 piece of information, Hadoop 2 stores three times more. This raises the problem of wasting the disk space.

In Hadoop 3 the fault tolerance is provided by the erasure coding. This method allows recovering a block of information using the other block and the parity block. Hadoop 3 creates one parity block on every two blocks of data. This requires only 1,5 times more disk space compared with 3 times more with the replications in Hadoop 2. The level of fault tolerance in Hadoop 3 remains the same, but less disk space is required for its operations. Spark can recover information by the recomputation of the DAG (Directed Acyclic Graph). DAG is formed by vertices and edges. Vertices represent RDDs, and edges represent the operations on the RDDs. In the situation, where some part of the data was lost, Spark can recover it by applying the sequence of operations to the RDDs. Note, that each time you will need to recompute RDD, you will need to wait until Spark performs all the necessary calculations. Spark also creates checkpoints to protect against failures.

Hadoop 2 uses YARN version 1. YARN (Yet Another Resource Negotiator) is the resource manager. It manages the available resources (CPU, memory, disk). Besides, YARN performs Jobs Scheduling. YARN was updated to version 2 in Hadoop 3. There are several significant changes improving usability and scalability. YARN 2 supports the flows - logical groups of YARN application and provides aggregating metrics at the level of flows. The separation between the collection processes (writing data) and the serving processes (reading data) improves the scalability. Also, YARN 2 uses Apache HBase as the primary backing storage. Spark can operate independently, on a cluster with YARN, or with Mesos. Hadoop 2 supports single active NameNode and single standby NameNode for the entire Namespace while Hadoop 3 works with multiple standby NameNodes.

The main Hadoop 2 file system is HDFS - Hadoop Distributed File System. The framework is also compatible with several other file systems, Blob stores like Amazon S3 and Azure storage, as well as alternatively distributed file systems. Hadoop 3 supports all the file systems, as Hadoop 2. In addition, Hadoop 3 is compatible with Microsoft Azure Data Lake and Aliyun Object Storage System. Spark supports local file systems, Amazon S3 and HDFS. For your convenience, we created a table that summarises all of the above information and presents a brief comparison of the key parameters of the two versions of Hadoop and Spark 2.X.

 Hadoop 2.xHadoop 3.x
LicenseApache 2.0, Open SourceApache 2.0, Open Source
Minimum supported version of JavaMinimum supported version of java is java 7.Minimum supported version of java is java 8
Fault ToleranceFault tolerance can be handled by replication (which is wastage of space).Fault tolerance can be handled by Erasure coding.
Data BalancingFor data, balancing uses HDFS balancer. For data, balancing uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI.
Storage OverheadHDFS has 200% overhead in storage space. Example:-  If there is 6 block so there will be 18 blocks occupied the space because of the replication scheme.Storage overhead is only 50%.  Example:-  If there is 6 block so there will be 9 blocks occupied the space 6 block and 3 for parity.
Default Ports RangeIn Hadoop 2.0 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind.But in Hadoop 3.0 these ports have been moved out of the ephemeral range.
Compatible File SystemHDFS (Default FS), FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system.It supports all the previous one as well as Microsoft Azure Data Lake filesystem.
ScalabilityWe can scale up to 10,000 Nodes per cluster.Better scalability. we can scale more than 10,000 nodes per cluster.