HBase Interview Questions

Displaying 1 - 7 of 7

Differentiate between Hive and HBase.

Hive

HBase

Hive is a ‘data warehouse software’ that enables you to query and manipulate data using an SQL-like language known as HiveQL.

HBase is a distributed data store built on top of HDFS, and it can leverage all the benefits provided by Hadoop or HDFS.

Hive abstracts the programming complexity of MapReduce and provides a simple SQL-like language known as HiveQL for querying data sets.

HBase does not have a native data-processing engine and relies on Map-Reduce and Spark APIs for data processing.

Hive has model.

a

relational

DBMS

data

HBase has a columnar data model.

Apache Hive has high latency as compared with HBase. Hence, it is not preferred for looking up individual records.

HBase provides a random and fast lookup on top of HDFS, which allows a user to query for individual records.

What are the limitations of HBase?

Following are some of the limitation of HBase-

  • HBase allows random and fast lookups on top of HDFS thus making it a very resource intensive database; hence it requires regular maintenance.
  • HBase doesn’t support secondary indexes. In case you want to search from more than one field or other than Row key, scan performance would be very slow. To handle such scenarios efficiently, MapReduce framework or Apache Phoenix can be used.
  • Unlike RDBMS, HBase supports only one default sort per table, i.e., w.r.t the row key.
  • It doesn't support SQL functions like join, group by etc., these functionalities can be provided by integrating it with Apache Phoenix or by Map-Reduce.

How deletions are handled inside HBase?

Delete is a special type UpDate in HBase, where the values for which the delete request is submitted are not deleted immediately. Rather these values are masked by assigning a tombstone marker to them. Every request to read these values(with tombstone markers) returns null to the client, which gives the client the impression that the values are already deleted(Consistency).

The reason why HBase does this, is because HFiles are immutable(Recall: HDFS doesn't allow modifying data of a file). All the values with the tombstone marker are permanently removed during the next Major Compaction.

There are three types of tombstone markers:

  • Version Delete Marker: which is used to mark a single version of a column value.
  • Column Delete Marker: Marks all versions of a column.
  • Family Delete Marker: Marks all versions of all columns for a column family. 

Finally, during the next Major compaction, the values with tombstone markers (deleted data) along with expired values(whose TTL is over) are removed from the HBase.

Differentiate between Major and Minor Compaction in HBase.

Here are the differences:

Minor Compaction

Major Compaction

In Minor Compaction, HBase picks only some of the smaller HFiles and rewrites them into a few larger HFiles

Major Compaction, all HFiles of a store are picked and rewritten into a single large HFile.

Minor compaction helps in reducing the number of HFiles by rewriting smaller HFiles into fewer but larger HFiles, performing a merge sort.

During Major compaction, the values with tombstone markers (deleted data) and expired values(whose TTL is over) are removed from the HFiles.

Less resource intensive hence scheduled more frequently.

Resource intensive and major compactions are scheduled when the load the server is minimal.

What are Bloom filters?

A Bloom filter is an efficient data structure that is used to test whether an element is a member of a set. It is both time and space efficient implementation for searching an element.

What is the difference between RDBMS and HBase?

Here are the differences:

Point of Distinction

HBase

RDBMS

Type of Database

HBase is a distributed database

RDBMS generally are single node databases

Row or Column Oriented

HBase has Column-oriented table schema.

RDBMS offers Row-oriented table schema.

Data Type

Good for both semi-structured data and structured data.

Good for only structured data.

Schema Type

HBase offers Flexible schema meaning we can add columns on the Fly without any overhead.

RDBMS systems have Fixed schema, although you can modify schema but the operation(alter) is very expensive.

Sparse Table

Its columnar schema makes it Good with sparse tables, Which results in a memory efficient database.

With the fixed schema, the space of each element in the row is fixed which results in  the allocation of space for null values in the row  too, hence Not optimized for sparse tables.

Use Cases OLAP - Online Transaction Processing Data Discovery, Data Analytics and OLAP System.
Native Query Language

Apart from basic DML and lookup commands, HBase doesn't provide a native query  language.

RDBMS provides a powerful query language known as SQL.

Tables

The columnar table schema results in Wide tables.

RDBMS tables are Narrow because of the row based schema.

Optimized for Joins

Since HBase doesn’t have a query engine analytics operations such as Joins are generally performed using MR, which are not optimized. [compare to RDBMS joins]

Native SQL provides optimized for operations i.e. Joins(small, fast ones) etc

Integration with MR

HBase has tight integration with MR for data processing applications

No integration with MR.

Scalability

HBase can easily be scaled Horizontally By adding more nodes.

Hard to share and scale.

Cost MySQL is Free, Oracle and MS Server is paid. it's open source
Data Size comparative smaller in GegaBytes/TeraBytes large can be up to HB - Hexabytes, PB - Petabyte
Consistency and Partition

Provides Consistency and Partition tolerance.

Being a single node system offers Consistency, Availability. [can say partition tolerance too].