Skip to main content
Home
  • Tutorials
    • Quality Assurance
    • Software Development
    • Machine Learning
    • Data Science
  • About Us
  • Contact
programsbuzz facebook programsbuzz twitter programsbuzz linkedin
  • Log in

Main navigation

  • Tutorials
    • Quality Assurance
    • Software Development
    • Machine Learning
    • Data Science
  • About Us
  • Contact

Introduction to Apache Hadoop

Profile picture for user shiksha.dahiya
Written by shiksha.dahiya on 06/30/2021 - 22:22

Hadoop is an open-source implementation of Google's distributed computing. It consists of two parts: Hadoop Distributed File System (HDFS), which is modeled after Google's GFS, and Hadoop MapReduce, which is modeled after Google's MapReduce. Google's system is proprietary code, so when Google teaches college students the ideas of MapReduce programming, they, too, use Hadoop. To further emphasize the difference we can note that the Hadoop engineers at Yahoo like to challenge the engineers at Google to sorting competitions between Hadoop and MapReduce. 

Hadoop provides fast and reliable analysis of both structured data and unstructured data. Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage. 

1.) At first Hadoop was mainly known for two core products:

  • HDFS: Hadoop Distributed FileSystem
  • MapReduce: Distributed data processing framework

2.) Today, in addition to HDFS and MapReduce, the term also represents a multitude of products:

  • HBase: Hadoop column database; supports batch and random reads and limited queries
  • Zookeeper: Highly-Available Coordination Service
  • Oozie: Hadoop workflow scheduler and manager
  • Pig: Data processing language and execution environment
  • Hive: Data warehouse with SQL interface

3.) To start building an application, you need a file system

  • In Hadoop world that would be Hadoop Distributed File System (HDFS)
  • In Linux it could be ext3 or ext4 • Addition of a data store would provide a nicer interface to store and manage your data
  • HBase: A key-value store implemented on top of HDFS
  • Traditionally one could use RDBMS on top of a local file system

4.) For batch processing, you will need to utilize a framework

  • In Hadoop’s world that would be MapReduce
  • MapReduce will ease implementation of distributed applications that will run on a cluster of commodity hardware

As Hadoop is open source, the software is free. However running Hadoop does have other cost components.

  • Cost of hardware : Hadoop runs on a cluster of machines. The cluster size can be anywhere from 10 nodes to 1000s of nodes. For a large cluster, the hardware costs will be significant.
  • The cost of IT / OPS for standing up a large Hadoop cluster and supporting it will need to be factored in.
  • Since Hadoop is a newer technology, finding people to work on this ecosystem is not easy.

Why Hadoop ?

1.) Hadoop provides storage for Big Data at reasonable cost :

Storing Big Data using traditional storage can be expensive. Hadoop is built around commodity hardware. Hence it can provide fairly large storage for a reasonable cost. Hadoop has been used in the field at Peta byte scale. Hadoop infrastructure provides these capabilities. Hadoop can process it in parallel on the nodes where the data is located. Hadoop 100 drives working at the same time can read 1TB of data in 2 minutes. Storage capacity has grown exponentially but read speed has not kept up :

  • IN 1990: Store 1,400 MB :  Transfer speed of 4.5MB/s and Read the entire drive in approx 5 minutes
  • IN 2010: Store 1 TB, Transfer speed of 100MB/s and Read the entire drive in approx 3 hours

2.) Hadoop allows to capture new or more data

Some times organizations don't capture a type of data, because it was too cost prohibitive to store it. Since Hadoop provides storage at reasonable cost, this type of data can be captured and stored. One example would be web site click logs. Because the volume of these logs can be very high, not many organizations captured these. Now with Hadoop it is possible to capture and store the logs.

3) With Hadoop, you can store data longer

To manage the volume of data stored, companies periodically purge older data. For example only logs for the last 3 months could be stored and older logs were deleted. With Hadoop it is possible to store the historical data longer. This allows new analytics to be done on older historical data. For example, take click logs from a web site. Few years ago, these logs were stored for a brief period of time to calculate statics like popular pages ..etc. Now with Hadoop it is viable to store these click logs for longer period of time.

4.) Hadoop provides scalable analytics

There is no point in storing all the data, if we can't analyze them. Hadoop not only provides distributed storage, but also distributed processing as well. Meaning we can crunch a large volume of data in parallel. The compute framework of Hadoop is called Map Reduce. Map Reduce has been proven to the scale of peta bytes. 

5.) Hadoop provides rich analytics

Native Map Reduce supports Java as primary programming language. Other languages like Ruby, Python and R can be used as well. Of course writing custom Map Reduce code is not the only way to analyze data in Hadoop. Higher level Map Reduce is available. For example a tool named Pig takes english like data flow language and translates them into Map Reduce. Another tool Hive, takes SQL queries and runs them using Map Reduce. Business Intelligence (BI) tools can provide even higher level of analysis. Quite a few BI tools can work with Hadoop and analyze data stored in Hadoop

    Related Content
    Apache Hadoop Tutorial
    Apache Hadoop: Key Features and Its Advantages
    Pre-requisites and Skills needed to learn Apache Hadoop
    Tags
    Apache Hadoop
    • Log in or register to post comments

    Choose Your Technology

    1. Agile
    2. Apache Groovy
    3. Apache Hadoop
    4. Apache HBase
    5. Apache Spark
    6. Appium
    7. AutoIt
    8. AWS
    9. Behat
    10. Cucumber Java
    11. Cypress
    12. DBMS
    13. Drupal
    14. GitHub
    15. GitLab
    16. GoLang
    17. Gradle
    18. HTML
    19. ISTQB Foundation
    20. Java
    21. JavaScript
    22. JMeter
    23. JUnit
    24. Karate
    25. Kotlin
    26. LoadRunner
    27. matplotlib
    28. MongoDB
    29. MS SQL Server
    30. MySQL
    31. Nightwatch JS
    32. PactumJS
    33. PHP
    34. Playwright
    35. Playwright Java
    36. Playwright Python
    37. Postman
    38. Project Management
    39. Protractor
    40. PyDev
    41. Python
    42. Python NumPy
    43. Python Pandas
    44. Python Seaborn
    45. R Language
    46. REST Assured
    47. Ruby
    48. Selenide
    © Copyright By iVagus Services Pvt. Ltd. 2023. All Rights Reserved.

    Footer

    • Cookie Policy
    • Privacy Policy
    • Terms of Use