History of Apache Hadoop

Hadoop was started by Doug Cutting to support two of his other well known projects, Lucene and Nutch. Hadoop has been inspired by Google's File System (GFS) which was detailed in a paper by released by Google in 2003. Hadoop, originally called Nutch Distributed File System (NDFS) split from Nutch in 2006 to become a sub-project of Lucene. At this point it was renamed to Hadoop. 

Started as a sub-project of Apache Nutch. Nutch’s job is to index the web and expose it for searching – Open Source alternative to Google has Started by Doug Cutting. In 2004 Google publishes Google File System (GFS) and MapReduce framework papers. Doug Cutting and Nutch team implemented Google’s frameworks in Nutch. In 2006 Yahoo! hires Doug Cutting to work on Hadoop with a dedicated team. In 2008 Hadoop became Apache Top Level Project. 

It all started in the year 2002 with the Apache Nutch project. Doug Cutting and Mike Cafarella were working on Apache Nutch Project that aimed at building a web search engine that would crawl and index websites. Apache Nutch was started in the year 2002 by Doug Cutting which is an effort to build an open source web search engine based on Lucene and Java for the search and index component. Nutch was based on sort/merge processing. In June 2003, it was successfully demonstrated on 4 nodes by crawling 100 million pages. However they realised that their architecture wouldn’t scale to billions of pages on web. There comes the help with the publication of a paper in 2003 that described the architecture of the Google’s Distributed Filesystem, called GFS which has been used in production at Google which would solve their storage needs for the very large files generated as part of the web crawling and indexing process. Google released a search paper on Google distributed File System (GFS) that described the architecture for GFS that provided an idea for storing large datasets in a distributed environment.

Nutch’s developers set about writing an open-source implementation, the Nutch Distributed File System (NDFS). Google introduced MapReduce to the world by releasing a paper on MapReduce. This paper provided the solution for processing those large datasets. It gave a full solution to the Nutch developers. Google provided the idea for distributed storage and MapReduce. Nutch developers implemented MapReduce in the middle of 2004. 

In the year 2004, they started writing the open source implementation called Nutch Distributed Filesystem (NDFS). In the same year Google published a paper that introduces MapReduce to the world. Early in the year 2005, the Nutch developers had a working MapReduce Implementation in Nutch and by the middle of that year all the major Nutch algorithms had been ported using the MapReduce and NDFS (Nutch Distributed FileSystem). In Febraury, 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.

  • 2004: Initial versions of  what is now Hadoop Distributed FileSystem and MapReduce implemented by Doug Cutting and Mike Cafarella.
  • December 2005: Nutch ported to a new framework. Hadoop runs reliably on 20 nodes.

The Apache community realized that the implementation of MapReduce and NDFS could be used for other tasks as well. In February 2006, they came out of Nutch and formed an independent subproject of Lucene called “Hadoop” (which is the name of Doug’s kid’s yellow elephant). As the Nutch project was limited to 20 to 40 nodes cluster, Doug Cutting in 2006 itself joined Yahoo to scale the Hadoop project to thousands of nodes cluster. Doug Cutting joined Yahoo! in the year 2006, which provided him the dedicated team and resources to turn Hadoop in to a system that ran at web scale. Hadoop was made Apache’s top level project in the year 2008. In 2007, Yahoo started using Hadoop on 1000 nodes cluster. 

  • February 2006: Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS.
  • February 2006: Adoption of Hadoop by Yahoo! Grid Team.
  • April 2006: Sort benchmark ( 10 GB/node ) run on 188 nodes in 47.9.
  • May 2006: Yahoo! set up a Hadoop 300 nodes research cluster.
  • May 2006: Sort benchmark run on 500 nodes in 42 hours ( better hardware than April benchmark )
  • October 2006: Research cluster reaches 600 nodes.
  • December 2006: Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
  • January 2007: Research cluster reaches 900 nodes.
  • April 2007: Research clusters – two cluster of 1000 nodes.
  • April 2008: Won 1 Terabyte sort benchmark in 208 seconds on 990 nodes.
  • October 2008: Loading 10 Terabytes of data per day into research clusters.

After 2008 there is a full time development that is going on. In January 2008, Hadoop confirmed its success by becoming the top-level project at Apache, many other companies like Last.fm, Facebook, and the New York Times started using Hadoop.

  • March 2009: 17 clusters with a total of 24,000 nodes.
  • April 2009: Won the minute sort by sorting 500 GB in 59 seconds on 1,400 nodes and 100 TB sort in 173 minutes on 3,400 nodes.
  • 2011: Yahoo was running its search engine across 42,000 nodes.
  • July 2013: Gray sort by sorting at a rate of 1.42 Terabytes per minute.

On 27 December 2011, Apache released Hadoop version 1.0 that includes support for Security, Hbase, etc. On 10 March 2012, release 1.0.1 was available. This is a bug fix release for version 1.0. On 23 May 2012, the Hadoop 2.0.0-alpha version was released. This release contains YARN. The second (alpha) version in the Hadoop-2.x series with a more stable version of YARN was released on 9 October 2012.

On 13 December 2017, release 3.0.0 was available. On 25 March 2018, Apache released Hadoop 3.0.1, which contains 49 bug fixes in Hadoop 3.0.0. On 6 April 2018, Hadoop release 3.1.0 came that contains 768 bug fixes, improvements, and enhancements since 3.0.0. Later, in May 2018, Hadoop 3.0.3 was released. On 8 August 2018, Apache 3.1.1 was released. Hadoop 3.1.3 is the latest version of Hadoop.