Hive: A Complete Guide to Big Data Warehousing on Hadoop


What is Hive?

Hive is an open-source data warehouse infrastructure built on top of Hadoop. It provides a SQL-like query language called HiveQL to query and analyze large datasets residing in HDFS (Hadoop Distributed File System). It transforms SQL-like queries into MapReduce or Tez or Spark jobs.


Key Features of Hive

  • SQL-Like Interface: HiveQL supports SQL-like syntax (SELECT, WHERE, GROUP BY, JOIN, etc.)
  • Scalability: Efficiently processes petabytes of data stored in HDFS.
  • Schema-on-Read: Schema is applied at query time, not during data loading.
  • Supports Complex Types: ARRAY, MAP, STRUCT in addition to primitive types.
  • Extensibility: Create UDFs, UDAFs, UDTFs.
  • Hadoop Ecosystem Integration: Works seamlessly with HDFS, YARN, MapReduce, Tez, HBase.

Hive Architecture and Components

1. Metastore

  • Stores metadata about tables, columns, partitions, and their locations.
  • Backed by traditional RDBMS (e.g., MySQL, PostgreSQL).

2. HiveServer2 (Thrift Server)

  • Exposes JDBC/ODBC interfaces for clients to execute HiveQL queries.
  • Supports concurrency, security, and session handling.

3. CLI / Beeline

  • CLI: Command Line Interface for local Hive queries (legacy).
  • Beeline: JDBC client to connect to HiveServer2.

4. Execution Engine

  • Converts HiveQL into DAGs executed by MapReduce (legacy), Tez (default), or Spark.

5. Storage (HDFS)

  • All table data is stored in HDFS unless otherwise configured.

6. SerDe (Serializer/Deserializer)

  • Converts data between Hive and HDFS.
  • Supports built-in and custom SerDes (e.g., JsonSerDe, RegexSerDe).

Hive Table Types

Managed (Internal) Tables

  • Data stored under Hive warehouse (/user/hive/warehouse).
  • Dropping the table deletes the data and metadata.

External Tables

  • Data location explicitly defined with LOCATION.
  • Dropping the table deletes only metadata, not actual data.

Example:

CREATE EXTERNAL TABLE external_table (
  id INT, name STRING
) LOCATION '/path/to/data';

Partitioning and Bucketing

Partitioning

  • Divides table by values of partition key(s).
  • Example:
CREATE TABLE sales (
  id INT, amount DOUBLE
) PARTITIONED BY (year INT, month INT);
  • Improves performance by scanning only relevant partitions (partition pruning).

Bucketing

  • Divides data using hash of a column value.
  • Example:
CREATE TABLE employees (
  id INT, name STRING
) CLUSTERED BY (id) INTO 10 BUCKETS;

Key Differences

FeaturePartitioningBucketing
BasisColumn valueHash function
StorageDirectoriesFiles
OptimizationPruningSampling, Joins

Hive Query Language (HiveQL)

  • SQL-like: Supports SELECT, INSERT, UPDATE, DELETE.
  • Supports complex joins, aggregations, window functions.
  • Supports views and materialized views.

Example:

SELECT name, age FROM users WHERE age > 30;

Hive File Formats

FormatDescription
TextFileDefault format, plain text
SequenceFileBinary key-value pairs
RCFileRecord columnar format
ORCColumnar, supports compression and schema evolution
ParquetColumnar format used widely in big data tools

Hive Data Types

  • Primitive: INT, STRING, BOOLEAN, DOUBLE
  • Complex:
STRUCT<height:INT, weight:DOUBLE>,
ARRAY<STRING>,
MAP<STRING, STRING>

User-Defined Functions

  • UDF: Row-wise custom logic
  • UDAF: Aggregate logic (like custom SUM)
  • UDTF: Generates multiple rows from single input

Execution Engines

EngineCharacteristics
MapReduceDefault, slower
TezFaster DAG-based execution
SparkIn-memory, faster for iterative tasks

Query Optimization Techniques

  • Partition Pruning: Scans only relevant partitions
  • Predicate Pushdown: Applies WHERE filter as early as possible
  • Map-Side Join: Join before shuffling
  • Bucketing and Sorting: For optimized joins and aggregations

ACID Transactions in Hive

  • Supported using transactional tables (ORC + Bucketed + Table Properties)
  • Enable with:
SET hive.support.concurrency = true;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

Views and Materialized Views

  • Views: Logical representations (no data storage)
  • Materialized Views: Precomputed, stored views for faster querying

Compression Support

CodecCharacteristics
GzipHigh compression, slower
SnappyBalanced
LZOFast decompression

Hive and HDFS Relationship

  • Hive stores data in HDFS
  • Managed table data goes to Hive Warehouse: /user/hive/warehouse
  • External table data can be anywhere in HDFS
  • Hive doesn’t own HDFS—it uses it as a storage backend

Dynamic Partitioning

  • Enables creation of partitions dynamically during INSERT
SET hive.exec.dynamic.partition = true;
INSERT INTO TABLE sales PARTITION (year, month)
SELECT id, amount, year, month FROM temp_sales;

Hive Indexes (Optional Use)

  • Compact Index: Works with high-cardinality columns
  • Bitmap Index: Effective on low-cardinality columns

Example:

CREATE INDEX idx ON TABLE sales (year)
AS 'COMPACT' WITH DEFERRED REBUILD;

HCatalog

  • Table and storage management layer
  • Lets Pig, MapReduce, Hive share metadata and storage

Security in Hive

  • Authentication: LDAP, Kerberos
  • Authorization: SQL-based or storage-based access control
  • Encryption: Data at rest and in motion

Resource Management

  • Hive relies on YARN for managing execution resources (CPU, memory)
  • Tez/Spark used as optimized engines instead of MapReduce

Summary: How Hive Works Internally

Query Flow

  1. Parse: Convert query to AST
  2. Bind: Map AST to metastore schema
  3. Optimize: Logical plan optimization (filter pushdown, pruning)
  4. Compile: Convert plan to physical plan (MapReduce/Tez/Spark job)
  5. Execute: Job runs on Hadoop cluster
  6. Retrieve Result: Output fetched from HDFS

Does Partition Pruning Happen in Traditional RDBMS?

Yes, but behavior differs:

  • Hive: Prunes HDFS directories.
  • RDBMS (Oracle, PostgreSQL, MySQL): Prunes logical partitions internally.
  • Both use it for improving performance but mechanisms differ.

Final Notes:

  • Use ORC/Parquet with predicate pushdown for maximum efficiency.
  • Leverage partitioning + bucketing for performance tuning.
  • Use external tables when you don’t want Hive to delete or manage data.
  • Integrate Hive with BI tools via HiveServer2 (Thrift, JDBC/ODBC)

This comprehensive guide covers everything you need to know about Hive—from basic architecture to advanced optimization techniques. For Big Data engineers, Hive remains an essential tool for structured data analysis over HDFS.


Pages: 1 2 3


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading