Apache Hive- Overview, Components, Architecture, Step by Step Execution Via Apache Tez or Spark

Hive: A Complete Guide to Big Data Warehousing on Hadoop

What is Hive?

Hive is an open-source data warehouse infrastructure built on top of Hadoop. It provides a SQL-like query language called HiveQL to query and analyze large datasets residing in HDFS (Hadoop Distributed File System). It transforms SQL-like queries into MapReduce or Tez or Spark jobs.

Key Features of Hive

SQL-Like Interface: HiveQL supports SQL-like syntax (SELECT, WHERE, GROUP BY, JOIN, etc.)
Scalability: Efficiently processes petabytes of data stored in HDFS.
Schema-on-Read: Schema is applied at query time, not during data loading.
Supports Complex Types: ARRAY, MAP, STRUCT in addition to primitive types.
Extensibility: Create UDFs, UDAFs, UDTFs.
Hadoop Ecosystem Integration: Works seamlessly with HDFS, YARN, MapReduce, Tez, HBase.

Hive Architecture and Components

1. Metastore

Stores metadata about tables, columns, partitions, and their locations.
Backed by traditional RDBMS (e.g., MySQL, PostgreSQL).

2. HiveServer2 (Thrift Server)

Exposes JDBC/ODBC interfaces for clients to execute HiveQL queries.
Supports concurrency, security, and session handling.

3. CLI / Beeline

CLI: Command Line Interface for local Hive queries (legacy).
Beeline: JDBC client to connect to HiveServer2.

4. Execution Engine

Converts HiveQL into DAGs executed by MapReduce (legacy), Tez (default), or Spark.

5. Storage (HDFS)

All table data is stored in HDFS unless otherwise configured.

6. SerDe (Serializer/Deserializer)

Converts data between Hive and HDFS.
Supports built-in and custom SerDes (e.g., JsonSerDe, RegexSerDe).

Hive Table Types

Managed (Internal) Tables

Data stored under Hive warehouse (/user/hive/warehouse).
Dropping the table deletes the data and metadata.

External Tables

Data location explicitly defined with LOCATION.
Dropping the table deletes only metadata, not actual data.

Example:

CREATE EXTERNAL TABLE external_table (
  id INT, name STRING
) LOCATION '/path/to/data';

Partitioning and Bucketing

Partitioning

Divides table by values of partition key(s).
Example:

CREATE TABLE sales (
  id INT, amount DOUBLE
) PARTITIONED BY (year INT, month INT);

Improves performance by scanning only relevant partitions (partition pruning).

Bucketing

Divides data using hash of a column value.
Example:

CREATE TABLE employees (
  id INT, name STRING
) CLUSTERED BY (id) INTO 10 BUCKETS;

Key Differences

Feature	Partitioning	Bucketing
Basis	Column value	Hash function
Storage	Directories	Files
Optimization	Pruning	Sampling, Joins

Hive Query Language (HiveQL)

SQL-like: Supports SELECT, INSERT, UPDATE, DELETE.
Supports complex joins, aggregations, window functions.
Supports views and materialized views.

Example:

SELECT name, age FROM users WHERE age > 30;

Hive File Formats

Format	Description
TextFile	Default format, plain text
SequenceFile	Binary key-value pairs
RCFile	Record columnar format
ORC	Columnar, supports compression and schema evolution
Parquet	Columnar format used widely in big data tools

Hive Data Types

Primitive: INT, STRING, BOOLEAN, DOUBLE
Complex:

STRUCT<height:INT, weight:DOUBLE>,
ARRAY<STRING>,
MAP<STRING, STRING>

User-Defined Functions

UDF: Row-wise custom logic
UDAF: Aggregate logic (like custom SUM)
UDTF: Generates multiple rows from single input

Execution Engines

Engine	Characteristics
MapReduce	Default, slower
Tez	Faster DAG-based execution
Spark	In-memory, faster for iterative tasks

Query Optimization Techniques

Partition Pruning: Scans only relevant partitions
Predicate Pushdown: Applies WHERE filter as early as possible
Map-Side Join: Join before shuffling
Bucketing and Sorting: For optimized joins and aggregations

ACID Transactions in Hive

Supported using transactional tables (ORC + Bucketed + Table Properties)
Enable with:

SET hive.support.concurrency = true;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

Views and Materialized Views

Views: Logical representations (no data storage)
Materialized Views: Precomputed, stored views for faster querying

Compression Support

Codec	Characteristics
Gzip	High compression, slower
Snappy	Balanced
LZO	Fast decompression

Hive and HDFS Relationship

Hive stores data in HDFS
Managed table data goes to Hive Warehouse: /user/hive/warehouse
External table data can be anywhere in HDFS
Hive doesn’t own HDFS—it uses it as a storage backend

Dynamic Partitioning

Enables creation of partitions dynamically during INSERT

SET hive.exec.dynamic.partition = true;
INSERT INTO TABLE sales PARTITION (year, month)
SELECT id, amount, year, month FROM temp_sales;

Hive Indexes (Optional Use)

Compact Index: Works with high-cardinality columns
Bitmap Index: Effective on low-cardinality columns

Example:

CREATE INDEX idx ON TABLE sales (year)
AS 'COMPACT' WITH DEFERRED REBUILD;

HCatalog

Table and storage management layer
Lets Pig, MapReduce, Hive share metadata and storage

Security in Hive

Authentication: LDAP, Kerberos
Authorization: SQL-based or storage-based access control
Encryption: Data at rest and in motion

Resource Management

Hive relies on YARN for managing execution resources (CPU, memory)
Tez/Spark used as optimized engines instead of MapReduce

Summary: How Hive Works Internally

Query Flow

Parse: Convert query to AST
Bind: Map AST to metastore schema
Optimize: Logical plan optimization (filter pushdown, pruning)
Compile: Convert plan to physical plan (MapReduce/Tez/Spark job)
Execute: Job runs on Hadoop cluster
Retrieve Result: Output fetched from HDFS

Does Partition Pruning Happen in Traditional RDBMS?

Yes, but behavior differs:

Hive: Prunes HDFS directories.
RDBMS (Oracle, PostgreSQL, MySQL): Prunes logical partitions internally.
Both use it for improving performance but mechanisms differ.

Final Notes:

Use ORC/Parquet with predicate pushdown for maximum efficiency.
Leverage partitioning + bucketing for performance tuning.
Use external tables when you don’t want Hive to delete or manage data.
Integrate Hive with BI tools via HiveServer2 (Thrift, JDBC/ODBC)

This comprehensive guide covers everything you need to know about Hive—from basic architecture to advanced optimization techniques. For Big Data engineers, Hive remains an essential tool for structured data analysis over HDFS.

recent posts

about

Hive: A Complete Guide to Big Data Warehousing on Hadoop

What is Hive?

Key Features of Hive

Hive Architecture and Components

1. Metastore

2. HiveServer2 (Thrift Server)

3. CLI / Beeline

4. Execution Engine

5. Storage (HDFS)

6. SerDe (Serializer/Deserializer)

Hive Table Types

Managed (Internal) Tables

External Tables

Partitioning and Bucketing

Partitioning

Bucketing

Key Differences

Hive Query Language (HiveQL)

Hive File Formats

Hive Data Types

User-Defined Functions

Execution Engines

Query Optimization Techniques

ACID Transactions in Hive

Views and Materialized Views

Compression Support

Hive and HDFS Relationship

Dynamic Partitioning

Hive Indexes (Optional Use)

HCatalog

Security in Hive

Resource Management

Summary: How Hive Works Internally

Query Flow

Does Partition Pruning Happen in Traditional RDBMS?

Final Notes:

Like this:

Discover more from HintsToday

Discover more from HintsToday