Hive Interview Questions

Displaying 1 - 10 of 12

How does Hive improve performance with tables in ORC format?

Using the ORC format leads to a reduction in the size of the data stored, as this   file format has high compression ratios. As the data size is reduced, the time to read and write the data is also reduced. The ORC format improves query performance also by the way it stores data in a file. Data is stored in a columnar format and columns that are not needed in a query can be skipped, thus leading to better performance.

What are the limitations of Hive?

Here are few limitations of Hive:

  • Hive does not support insert and update functions at a row-level, which makes it unsuitable for OLTP systems.
  • Hive does not support real-time processing.
  • Hive queries have high latency due to the start-up overhead of the MapReduce job.

Can Hive be used as an OLTP system like MySQL?

Hive does not support insert and update functions at a row-level, which makes it unsuitable for OLTP systems. Note: OLTP is an online transaction-processing system that involves INSERT, UPDATE and DELETE operations.

Differentiate between Hive and HBase.

Hive

HBase

Hive is a ‘data warehouse software’ that enables you to query and manipulate data using an SQL-like language known as HiveQL.

HBase is a distributed data store built on top of HDFS, and it can leverage all the benefits provided by Hadoop or HDFS.

Hive abstracts the programming complexity of MapReduce and provides a simple SQL-like language known as HiveQL for querying data sets.

HBase does not have a native data-processing engine and relies on Map-Reduce and Spark APIs for data processing.

Hive has model.

a

relational

DBMS

data

HBase has a columnar data model.

Apache Hive has high latency as compared with HBase. Hence, it is not preferred for looking up individual records.

HBase provides a random and fast lookup on top of HDFS, which allows a user to query for individual records.

Hive: What are the instances where you can use Indexing?

The key instances where you can use indexing are as follows:

  • When the data set is large
  • When faster query execution is needed
  • For columns that are used more frequently than others
  • For read-heavy applications, where you need to read the data more frequently

Hive: In which scenarios do you use external tables?

We use external tables in the following scenarios:

  • When we need to store data in a custom location
  • Unlike Internal tables, if we delete external tables, they still continue to reside in HDFS
  • Data from external tables should not be owned by Hive

Hive: Explain the difference between External and Internal tables.

External Table: Unlike RDBMSes where data and tables are tightly coupled, data in External Tables is loosely coupled. External Tables reside in HDFS. Even if you drop an external table in Hive, the data mapped to it remains intact inside HDFS.

Internal Table: An Internal Table in Hive is similar to the tables in RDBMSes. The data and the table schema are tightly coupled in internal tables. If you drop an internal table in Hive, the data stored in it will get deleted.

Hive: Explain ORDER BY, CLUSTER BY, SORT BY and DISTRIBUTE BY in brief.

Both ORDER BY and SORT BY are used for sorting query results in ascending   or descending order. However, one of the differences between them is the way they sort results. ORDER BY sorts the entire data using a reducer, whereas SORT BY does not guarantee overall sorting of data. There may be overlapping data and it might need more than one reducer.

Both DISTRIBUTE BY and CLUSTER BY are used for categorising query results on the basis of one or more columns. CLUSTER BY is a shortcut for both DISTRIBUTE BYand SORT BY. Hive uses the columns in DISTRIBUTE BY to distribute the rows among the reducers. All rows with the same DISTRIBUTE BY columns will go to the same reducer. However, DISTRIBUTE BY does not guarantee clustering or sorting of properties.