Apache Hadoop: Key Features and Its Advantages

The power of the Hadoop is in the parallel access to data that can reside on a single node or on thousands of nodes. When the data is loaded into the system it is split into chucks of data which is typically of 64Mb or 128Mb. MapReduce is the process that enables access to run on each of the nodes in the cluster. A master node called the name node allocated the work to nodes such that the MapReduce task runs on those nodes locally. All these jobs which are assigned by the name node are run in parallel, each on their own part of the complete data set that is stored in the file system.

Key Features

1.) Anyone can go to the Apache Hadoop Website, From there you Download Hadoop, Install and work with it.

2.) Its Source code is available, you can modify, change as per your requirements.

3.) It can handle Volume, Variety, Velocity & Value. Hadoop is a concept of handling Big Data, & it handles it with the help of the Ecosystem Approach.

4.) Hadoop is not just for storage & Processing, Hadoop is an ecosystem, that is the main feature of Hadoop. It can acquire the data from RDBMS, then arrange it on the Cluster with the help of HDFS, after then it cleans the data & make it eligible for analyzing by using processing techniques with the help of MPP(Massive Parallel Processing) which shared nothing architecture, then in last it Analyze the data & then it Visualize the data. This is what Hadoop does, So basically Hadoop is an Ecosystem.

5.) Hadoop is a shared nothing architecturethat means Hadoop is a cluster with independent machines. (Cluster with Nodes), that every node perform its job by using its own resources.

6.) Data is Distributed on Multiple Machines as a cluster & Data can stripe & mirror automatically without the use of any third party tools. It has a built-in capability to stripe & mirror data. Hence, it can handle the volume. In this, there are a bunch of machines connected together & data is distributed among the bunch of machines on the back panel & data is striping & mirroring among them.

7.) Hadoop can run on commodity hardware that means Hadoop does not require a very high-end server with large memory and processing power. Hadoop runs on JBOD (just bunch of disk), so every node is independent in Hadoop.

8.) We do not need to build large clusters, we just keep on adding nodes. As the data keeps on growing, we keep adding nodes.

9.)  With the help of distributors, we get the bundles, also built-in packages, we do not need to install each package individually. we just get the bundle & we will install what we need for.

10.) Cloudera is a US Based Company, started by the employees of Facebook, LinkedIn & Yahoo. It provides the solution for Hadoop & enterprise solution. The products of Cloudera is known as CDH(Cloudera Distribution for Hadoop), it is a strong package which we can download from Cloudera, we can install & work with it. Cloudera has designed a graphical tool called Cloudera Manager, which helps to do the administration easily in a graphical way

11.) Hortonworks Product are called as HDP (Hortonworks Data Platform), it is not enterprise, it is Open Source & License free. It has a tool called Apache Ambariwhich built the Hortonworks Clusters. 

Components of hadoop:

Hadoop mainly consists of two core components. The hadoop distributed file system (HDFS) and the MapReduce Software Program. So a set of machines running these two core components is known as a hadoop cluster. More the number of these machines better will be the performance. HDFS is responsible for storing the data on the cluster. These data files are split into blocks and stored multiple times across the nodes. The default replication factor is three. This ensures reliability and availability. MapReduce is the system used to process the data in the hadoop cluster. This process consists of two phases: a maples and a reduce phase. Hadoop is comprised of five specific demons. The NameNode which holds the metadata of the HDFS, Secondary Name Node which performs the housekeeping functions of the nematode, the Detained which stores the actual HDFS data blocks, JobTracker which assigns the map reduce jobs to the datanodes where the processing of the data will take place, and finally the Task Tracker which is on the detained which is responsible for initiating and monitoring individual Map and Reduce tasks.

Advantages of using Hadoop

The main advantages of the Hadoop are automation and parallelization. It has a great fault tolerance which partially slows the process but does not have a complete system failure. Even with the failure of some of the data nodes there is no loss in data because we have a replication of the same data on other data nodes. Component recovery is another advantage where a failed node can rejoin the system without restarting the complete system itself. Component failures while running a job does not affect the outcome of the job. This also ensures greater scalability, where increasing the resources will increase the performance.

Hadoop helps organizations make decisions based on comprehensive analysis of multiple variables and data sets, rather than a small sampling of data or anecdotal incidents. The ability to process large sets of disparate data gives Hadoop users a more comprehensive view of their customers, operations, opportunities, risks, etc. To develop a similar perspective without big data, organizations would need to conduct multiple, limited data analyses then find a way to synthesize the results, which would likely involve a lot of manual effort and subjective analysis.

  • Advanced data analysis can be done in-house – Hadoop makes it practical to work with large data sets and customize the outcome without having to outsource the task to specialist service providers. Keeping operations in-house helps organizations be more agile, while also avoiding the ongoing operational expense of outsourcing.
  • Organizations can fully leverage their data – One alternative to not using Hadoop is simply not to use all the data and inputs that are available to support business activity. With Hadoop organizations can take full advantage of all their data – structured and unstructured, real-time and historical. Leveraging adds more value to the data itself and improves the return on investment (ROI) for the legacy systems used to collect, process, and store the data, including ERP and CRM systems, social media programs, sensors, industrial automation systems, etc.
  • Run a commodity vs. custom architecture – Some of the tasks that Hadoop is being used for today were formerly run by MPCC and other specialty, expensive computer systems. Hadoop commonly runs on commodity hardware. Because it is the de facto big data standard, it is supported by a large and competitive solution provider community, which protects customers from vendor lock-in. 
  • Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
  • Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
  • Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as compared to traditional relational database management system.
  • Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are replicated thrice but the replication factor is configurable.