Hadoop Distributed Filesystem (HDFS)
Storage component – infinitely scalable and flexible to enable storage and analysis of unlimited amounts and types of data, across clusters of industry-standard commodity hardware in an easily accessible format. HDFS is a distributed file system that is designed for storing very large files with streaming data access patterns running on clusters of commodity hardware. It was originally created and implemented by Google, where it was known as the Google File System (GFS). HDFS is designed such that it can handle large amounts of data and reduces the overall input/output operations on the network. It also increases the scalability and availability of the cluster because of data replication and fault tolerance.
HBase is a supporting component in the Hadoop group of open source projects, a non-relational database that is the de facto standard for working with Hadoop and is frequently included in commercial distribution bundles. Modeled after Google’s BigTable, HBase is capable of hosting extremely large tables (billions of rows, millions of columns) atop clusters of commodity hardware and provides real-time, programmatic and query access to HDFS.
Apache Hbase is a non-relational(NoSQL) wide column database that sits on top of HDFS and is part of the Apache Hadoop Big Data Ecosystem. It runs on top of your Hadoop cluster and provides you random real-time read/write access to your data.
Apache Hadoop and Hbase support both structured and unstructured data. It stores data as key/value pairs in columnar fashion while HDFS can store data in various formats (Flat files, compressed format).
Differences between HDFS & HBase
HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during system failures. HBase is a non-relational and open source Not-Only-SQL database that runs on top of Hadoop. HBase comes under CP type of CAP (Consistency, Availability, and Partition Tolerance) theorem.
HDFS is most suitable for performing batch analytics. However, one of its biggest drawbacks is its inability to perform real-time analysis, the trending requirement of the IT industry. HBase, on the other hand, can handle large data sets and is not appropriate for batch analytics. Instead, it is used to write/read data from Hadoop in real-time.
Both HDFS and HBase are capable of processing structured, semi-structured as well as un-structured data. HDFS lacks an in-memory processing engine slowing down the process of data analysis; as it is using plain old MapReduce to do it. HBase, on the contrary, boasts of an in-memory processing engine that drastically increases the speed of read/write.
HDFS is very transparent in its execution of data analysis. HBase, on the other hand, being a NoSQL database in tabular format, fetches values by sorting them under different key values.
- HDFS is a storage system and HBase is a non-relational column-oriented database
- Apache HBase provides low latency access to small amounts of data within large data sets while HDFS provides high latency operations.
- Apache HBase supports random reads and writes while HDFS supports WORM (Write once Read Many times) and does not support real random reads.
- HDFS is primarily accessed through MapReduce jobs while Apache HBase is accessed through shell commands, Java API, REST, Avro, or Thrift API.
HDFS stores large data sets in a distributed environment and leverages batch processing on that data. E.g. it would help an e-commerce website to store millions of customer’s data in a distributed environment which grew over a long period of time(might be 4-5 years or more). Then it leverages batch processing over that data and analyze customer behaviors, pattern, requirements. Then the company could find out what type of product, customer purchase in which months. It helps to store archived data and execute batch processing over it.
While HBase stores data in a column oriented manner where each column is stored together so that, reading becomes faster leveraging real time processing. E.g. in a similar e-commerce environment, it stores millions of product data. So if you search for a product among millions of products, it optimizes the request and search process, producing the result immediately (or you can say in real time).