Introduction to Apache Hadoop

Hadoop is an open-source implementation of Google's distributed computing. It consists of two parts: Hadoop Distributed File System (HDFS), which is modeled after Google's GFS, and Hadoop MapReduce, which is modeled after Google's MapReduce. Google's system is proprietary code, so when Google teaches college students the ideas of MapReduce programming, they, too, use Hadoop. To further emphasize the difference we can note that the Hadoop engineers at Yahoo like to challenge the engineers at Google to sorting competitions between Hadoop and MapReduce. 

Hadoop provides fast and reliable analysis of both structured data and unstructured data. Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage. 

1.) At first Hadoop was mainly known for two core products:

  • HDFS: Hadoop Distributed FileSystem
  • MapReduce: Distributed data processing framework

2.) Today, in addition to HDFS and MapReduce, the term also represents a multitude of products:

  • HBase: Hadoop column database; supports batch and random reads and limited queries
  • Zookeeper: Highly-Available Coordination Service
  • Oozie: Hadoop workflow scheduler and manager
  • Pig: Data processing language and execution environment
  • Hive: Data warehouse with SQL interface

3.) To start building an application, you need a file system

  • In Hadoop world that would be Hadoop Distributed File System (HDFS)
  • In Linux it could be ext3 or ext4 • Addition of a data store would provide a nicer interface to store and manage your data
  • HBase: A key-value store implemented on top of HDFS
  • Traditionally one could use RDBMS on top of a local file system

4.) For batch processing, you will need to utilize a framework

  • In Hadoop’s world that would be MapReduce
  • MapReduce will ease implementation of distributed applications that will run on a cluster of commodity hardware

As Hadoop is open source, the software is free. However running Hadoop does have other cost components.

  • Cost of hardware : Hadoop runs on a cluster of machines. The cluster size can be anywhere from 10 nodes to 1000s of nodes. For a large cluster, the hardware costs will be significant.
  • The cost of IT / OPS for standing up a large Hadoop cluster and supporting it will need to be factored in.
  • Since Hadoop is a newer technology, finding people to work on this ecosystem is not easy.

Why Hadoop ?

1.) Hadoop provides storage for Big Data at reasonable cost :

Storing Big Data using traditional storage can be expensive. Hadoop is built around commodity hardware. Hence it can provide fairly large storage for a reasonable cost. Hadoop has been used in the field at Peta byte scale. Hadoop infrastructure provides these capabilities. Hadoop can process it in parallel on the nodes where the data is located. Hadoop 100 drives working at the same time can read 1TB of data in 2 minutes. Storage capacity has grown exponentially but read speed has not kept up :

  • IN 1990: Store 1,400 MB :  Transfer speed of 4.5MB/s and Read the entire drive in approx 5 minutes
  • IN 2010: Store 1 TB, Transfer speed of 100MB/s and Read the entire drive in approx 3 hours

2.) Hadoop allows to capture new or more data

Some times organizations don't capture a type of data, because it was too cost prohibitive to store it. Since Hadoop provides storage at reasonable cost, this type of data can be captured and stored. One example would be web site click logs. Because the volume of these logs can be very high, not many organizations captured these. Now with Hadoop it is possible to capture and store the logs.

3) With Hadoop, you can store data longer

To manage the volume of data stored, companies periodically purge older data. For example only logs for the last 3 months could be stored and older logs were deleted. With Hadoop it is possible to store the historical data longer. This allows new analytics to be done on older historical data. For example, take click logs from a web site. Few years ago, these logs were stored for a brief period of time to calculate statics like popular pages ..etc. Now with Hadoop it is viable to store these click logs for longer period of time.

4.) Hadoop provides scalable analytics

There is no point in storing all the data, if we can't analyze them. Hadoop not only provides distributed storage, but also distributed processing as well. Meaning we can crunch a large volume of data in parallel. The compute framework of Hadoop is called Map Reduce. Map Reduce has been proven to the scale of peta bytes. 

5.) Hadoop provides rich analytics

Native Map Reduce supports Java as primary programming language. Other languages like Ruby, Python and R can be used as well. Of course writing custom Map Reduce code is not the only way to analyze data in Hadoop. Higher level Map Reduce is available. For example a tool named Pig takes english like data flow language and translates them into Map Reduce. Another tool Hive, takes SQL queries and runs them using Map Reduce. Business Intelligence (BI) tools can provide even higher level of analysis. Quite a few BI tools can work with Hadoop and analyze data stored in Hadoop