Limitations of HDFS

Issues with Small Files

The main problem with Hadoop is that it is not suitable for small data. HDFS lacks the ability to support the random reading of small due to its high capacity design. Small files are smaller than the HDFS Block size (default 128MB). If you are storing these huge numbers of small files, HDFS cannot handle these lots of small files. As HDFS was designed to work with a small number of large files for storing large data sets rather than a large number of small files. If there are lot many small files, then the NameNode will be overloaded since it stores the namespace of HDFS.

Solution:

Hadoop Archives or HAR files is one of the solutions to small files problem. Hadoop archives act as another layer of the file system over Hadoop. With Hadoop archive command we can build HAR files. This command runs a map-reduce job at the backend to pack the archived files into a small number of HDFS files. But again reading through HAR files is not much efficient than reading through HDFS. This is because it requires to access two index files and then finally the data file. Sequence file is another solution to small file problem. In this, we write a program to merge a number of small files into one sequence file. Then we process this sequence file in a streaming fashion. Map-reduce can break this sequence files into chunks and process it in parallel as we can split the sequence file.

Slow Processing Speed

MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. So, MapReduce requires a lot of time to perform these tasks, thus increasing latency. Hence, reduces processing speed.

Solution:

Spark is the solution for the slow processing speed of map-reduce. It does in-memory calculations which makes it a hundred times faster than Hadoop. Spark while processing reads the data from RAM and writes the data to RAM thereby making it a fast processing tool. Flink is one more technology which is faster than Hadoop map-reduce as it does in-memory calculations. Flink is even faster than Spark. This is due to the stream processing engine at the core as opposed to Spark which has batch processing engine

Support for Batch Processing only

Hadoop only supports batch processing, it is not suitable for streaming data. Hence, overall performance is slower. MapReduce framework doesn’t leverage the memory of the Hadoop cluster to the maximum.

Solution

Apache Spark solves this problem as it supports stream processing. But Spark stream processing is not as much efficient as Flink as it uses micro-batch processing. Apache Flink improves the overall performance as it provides single run-time for the streaming as well as batch processing.

No Real-time Processing

Hadoop with its core Map-Reduce framework is unable to process real-time data. Hadoop process data in batches. First, the user loads the file into HDFS. Then the user runs map-reduce job with the file as input. It follows the ETL cycle of processing. The user extracts the data from the source. Then the data gets transformed to meet the business requirements. And finally loaded into the data warehouse. The users can generate insights from this data. The companies use these insights for the betterment of their business.

Solution

Spark has come up as a solution to the above problem. Spark supports real-time processing. It processes the incoming streams of data by forming micro-batches and then applying computations on these micro-batches.

Flink is also one more solution for slow processing speed. It is even much faster than Spark as it has a stream processing engine at the core. Flink is a true streaming engine with adjustable latency and throughput. It has a rich set of APIs exploiting streaming runtime.

Iterative Processing

Core Hadoop does not support iterative processing. Iterative processing requires a cyclic data flow. In this output of a previous stage serves as an input to the next stage. Hadoop map-reduce is capable of batch processing. It works on the principle of write-once-read-many. The data gets written on the disk once. And then read multiple times to get insights. The Map-reduce of Hadoop has a batch processing engine at its core. It is not able to iterate through data.

Solution

Spark supports iterative processing. In Spark, each iteration needs to get scheduled and executed separately. It accomplishes iterative processing through DAG i.e. Directed Acyclic Graph. Spark has RDDs or Resilient Distributed Datasets. These are a collection of elements partitioned across the cluster of nodes. Spark creates RDDs from HDFS files. We can also cache them allowing reusability of RDDs. The iterative algorithms apply operations repeatedly over data. Thus they benefit from RDDs caching across iterations. Flink also supports iterative processing. Flink iterates data using streaming architecture. We can instruct Flink to process only the data which gets changed thereby improving the performance. Flink implements iterative algorithms by defining a step function. It embeds the step functions into special iteration operator. The two variants of this operator are — iterate and delta iterate. Both these operators apply the step function over and over again until they meet a terminating condition.

Latency

MapReduce in Hadoop is slower because it supports different format, structured and huge amount of data. In MapReduce, Map takes a set of data and converts it into another set of data, where an individual element is broken down into a key-value pair. Reduce takes the output from the map as and Reduce takes the output from the map as input and process further. MapReduce requires a lot of time to perform these tasks thereby increasing latency.

Solution:

Apache Spark can reduce this issue. Although Spark is the batch system, it is relatively faster, because it caches much of the input data on memory by RDD. Apache Flink data streaming achieves low latency and high throughput.

No Ease of Use

In Hadoop, we have to hand code each and every operation. This has two drawbacks first it is difficult to use. And second, it increases the number of lines to code. There is no interactive mode available with Hadoop Map-Reduce. This also makes it difficult to debug as it runs in the batch mode. In this mode, we have to specify the jar file, the input as well as the location of the output file. If the program fails in between, it is difficult to find the culprit code.

Solution

Spark is easy for the user as compared to Hadoop. This is because it has many APIs for Java, Scala, Python, and Spark SQL. Spark performs batch processing, stream processing and machine learning on the same cluster. This makes life easy for users. They can use the same infrastructure for various workloads. In Flink, the number of high-level operators is available. This reduces the number of lines of code to achieve the same result.

Security Issue

Hadoop does not implement encryption-decryption at the storage as well as network levels. Thus it is not much secure. For security, Hadoop adopts Kerberos authentication which is difficult to maintain.

Solution

Spark encrypts temporary data written to local disk. It does not support encryption of output data generated by applications having APIs such as save As Hadoop File or save ASTable. Spark implements AES-based encryption for RPC connections. We should enable RPC authentication to enable encryption. It should be properly configured.

No Caching

Apache Hadoop is not efficient for caching. MapReduce cannot cache the intermediate data in memory for the further requirement and this diminishes the performance of Hadoop.

Solution:

Spark and Flink overcome this issue. Spark and Flink cache data in memory for further iterations which enhance the overall performance.

Lengthy Code

Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of bugs. Hence it will take more time to execute the programs.

Solution:

Spark and Flink are written in Scala and Java. But the implementation is in Scala, so the number of line of code is lesser than Hadoop. Thus, it takes less time to execute the programs.