Hadoop Interview Questions

Displaying 1 - 10 of 60

Write a Word Count Program in Hadoop using MapReduce?

 Here we have 3 classes one for Mapper, one for Reducer and one main class for word count.

Map.Java

package Hadoop.Practice;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.StringTokenizer;


import org.apache.hadoop.io.*;


public class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{ 
   @Override
   public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
   {
	   IntWritable one = new IntWritable(1);
	   Text word = new Text();
	   
	   String line = value.toString();
	   StringTokenizer tokens = new StringTokenizer(line);
	   
	   while(tokens.hasMoreTokens())
	   {
		   word.set(tokens.nextToken());
		   context.write(word, one);
	   }
   }
}

Reduce.Java

package Hadoop.Practice;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;


public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
	@Override
	public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
	{
		int sum = 0;
		
		Iterator<IntWritable> itr = values.iterator(); 
		
		while(itr.hasNext())
		{
			sum += itr.next().get();
		}
		
		context.write(key, new IntWritable(sum));
	}
}

WordCount.java

package Hadoop.Practice;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount 
{
	public static void main(String args[])
	{
		try
		{
			Job job = Job.getInstance();
			job.setJobName("Word Count Job");
			job.setJarByClass(WordCount.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			job.setMapperClass(Map.class);
			job.setReducerClass(Reduce.class);
			job.setInputFormatClass(TextInputFormat.class);
			job.setOutputFormatClass(TextOutputFormat.class);
			
			FileInputFormat.addInputPath(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			
			job.waitForCompletion(true);
		}
		catch(Exception e)
		{
			e.printStackTrace();
		}
	}
}

What are real-time industry applications of Hadoop?

Hadoop is used in almost all departments and sectors today. Some of the instances where Hadoop is used:

  • Managing traffic on streets.
  • Streaming processing.
  • Content Management and Archiving Emails.
  • Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
  • Fraud detection and Prevention.
  • Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data.
  • Managing content, posts, images and videos on social media platforms.
  • Analyzing customer data in real-time for improving business performance.
  • Public sector fields such as intelligence, defense, cyber security and scientific research.
  • Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction.
  • Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data.

How would you compress data while ingesting data to HDFS?

Data can be compressed using the compression options in MapReduce/Pig, Sqoop, etc. Examples of compression formats are Snappy, Deflate, .gz/gzip, etc. These compression algorithms compress files up to one-fifth of their size, and thus help save space.

What is the DistCp command?

In Hadoop, Distcp allows simultaneously copying data to and within the inter/intra Hadoop filesystem. Distcp is implemented using the MapReduce job where the data is copied by the map-only jobs that run in parallel across the cluster.

The following command is used to copy data from dir1 to dir2.

$ hadoop distcp dir1 dir2

A very common use case for Distcp is in transferring data efficiently from one Hadoop cluster to another.

$ hadoop distcp hdfs://namenode1/dir1 hdfs://namenode2/dir2

Where

  • namenode1 - It is the public IP of the namenode of the source cluster.
  • namenode2 - It is the public IP of the namenode of the target cluster.

What is HA in YARN ResourceManager?

The YARN Resource Manager is responsible for managing the resources in a cluster and scheduling applications.Prior to Hadoop 2.4, the Resource Manager was a single point of failure in a YARN cluster.

The Resource Manager provides High Availability (HA) by implementing an active-standby Resource Manager pair to remove this single point of failure. When the active Resource Manager Node fails, the control switches to the standby Resource Manager, and all halted applications resume from the last state saved in the state store. This allows handling failover without any performance degradation in the following situations:

  • Unplanned events such as machine crashes
  • Planned maintenance events of software or hardware upgrades to the machine running the ResourceManager
  • ResourceManager HA requires the ZooKeeper and HDFS services to be running

What is Hadoop Streaming?

Hadoop Streaming is an API that allows writing Mappers and Reduces in any language. It uses Unix standard streams as the interface between Hadoop and the user application.

Streaming is naturally suited for text processing. The data view is line-oriented and processed as a key-value pair separated by a 'tab' character. The Reduce function reads lines from the standard input, which is sorted by key, and writes its results to the standard output.

What is HA in a NameNode?

The implementation of Standby and Secondary NameNodes from Hadoop 2.x ensures High Availability in Hadoop clusters, which was not present in Hadoop 1.x. In the case of Hadoop 1.x clusters (one NameNode, multiple DataNodes), a NameNode was a single point of failure. If a NameNode went down owing to lack of backup, then the entire cluster would become unavailable. Hadoop 2.x solved this problem of a single point of failure by including an additional Standby/Secondary NameNode to the cluster. In Hadoop, a pair of NameNodes is in an active-standby configuration. The standby NameNode acts as a backup for the NameNode metadata. The standby NameNode also receives block reports from the DataNodes and maintains a synced copy of edit logs with the active NameNode, and in case the NameNode is down, the standby NameNode takes charge and ensures cluster availability.

What is the difference between Data Block and Input Split?

Data Block: HDFS stores data by first splitting it into smaller chunks. HDFS splits a large file into smaller chunks known as blocks. Thus, it stores each file as a set of data blocks. These data blocks are replicated and distributed across multiple DataNodes.

Input Split: An input split represents the amount of data that is processed by an individual Mapper at a time. In MapReduce, the number of input splits is equal to that of Map tasks. Hence, it is used to configure the number of Map tasks which is equal to the number of Input Splits.

Explain the small files problem in Hadoop.

HDFS is designed for processing/storing big data. So, in case of small files, it is not prepared to efficiently process/store numerous small files. These files generate a lot of overhead to the NameNode and the DataNodes. Reading through small files normally causes a lot of seeks and hopping from one DataNode to another to retrieve each small file. All of this adds up to inefficient data read/write operations.