Hive: A Complete Guide to Big Data Warehousing on Hadoop
What is Hive?
Hive is an open-source data warehouse infrastructure built on top of Hadoop. It provides a SQL-like query language called HiveQL to query and analyze large datasets residing in HDFS (Hadoop Distributed File System). It transforms SQL-like queries into MapReduce or Tez or Spark jobs.
Key Features of Hive
- SQL-Like Interface: HiveQL supports SQL-like syntax (SELECT, WHERE, GROUP BY, JOIN, etc.)
- Scalability: Efficiently processes petabytes of data stored in HDFS.
- Schema-on-Read: Schema is applied at query time, not during data loading.
- Supports Complex Types: ARRAY, MAP, STRUCT in addition to primitive types.
- Extensibility: Create UDFs, UDAFs, UDTFs.
- Hadoop Ecosystem Integration: Works seamlessly with HDFS, YARN, MapReduce, Tez, HBase.
Hive Architecture and Components
1. Metastore
- Stores metadata about tables, columns, partitions, and their locations.
- Backed by traditional RDBMS (e.g., MySQL, PostgreSQL).
2. HiveServer2 (Thrift Server)
- Exposes JDBC/ODBC interfaces for clients to execute HiveQL queries.
- Supports concurrency, security, and session handling.
3. CLI / Beeline
- CLI: Command Line Interface for local Hive queries (legacy).
- Beeline: JDBC client to connect to HiveServer2.
4. Execution Engine
- Converts HiveQL into DAGs executed by MapReduce (legacy), Tez (default), or Spark.
5. Storage (HDFS)
- All table data is stored in HDFS unless otherwise configured.
6. SerDe (Serializer/Deserializer)
- Converts data between Hive and HDFS.
- Supports built-in and custom SerDes (e.g., JsonSerDe, RegexSerDe).
Hive Table Types
Managed (Internal) Tables
- Data stored under Hive warehouse (
/user/hive/warehouse
). - Dropping the table deletes the data and metadata.
External Tables
- Data location explicitly defined with
LOCATION
. - Dropping the table deletes only metadata, not actual data.
Example:
CREATE EXTERNAL TABLE external_table (
id INT, name STRING
) LOCATION '/path/to/data';
Partitioning and Bucketing
Partitioning
- Divides table by values of partition key(s).
- Example:
CREATE TABLE sales (
id INT, amount DOUBLE
) PARTITIONED BY (year INT, month INT);
- Improves performance by scanning only relevant partitions (partition pruning).
Bucketing
- Divides data using hash of a column value.
- Example:
CREATE TABLE employees (
id INT, name STRING
) CLUSTERED BY (id) INTO 10 BUCKETS;
Key Differences
Feature | Partitioning | Bucketing |
---|---|---|
Basis | Column value | Hash function |
Storage | Directories | Files |
Optimization | Pruning | Sampling, Joins |
Hive Query Language (HiveQL)
- SQL-like: Supports SELECT, INSERT, UPDATE, DELETE.
- Supports complex joins, aggregations, window functions.
- Supports views and materialized views.
Example:
SELECT name, age FROM users WHERE age > 30;
Hive File Formats
Format | Description |
---|---|
TextFile | Default format, plain text |
SequenceFile | Binary key-value pairs |
RCFile | Record columnar format |
ORC | Columnar, supports compression and schema evolution |
Parquet | Columnar format used widely in big data tools |
Hive Data Types
- Primitive: INT, STRING, BOOLEAN, DOUBLE
- Complex:
STRUCT<height:INT, weight:DOUBLE>,
ARRAY<STRING>,
MAP<STRING, STRING>
User-Defined Functions
- UDF: Row-wise custom logic
- UDAF: Aggregate logic (like custom SUM)
- UDTF: Generates multiple rows from single input
Execution Engines
Engine | Characteristics |
---|---|
MapReduce | Default, slower |
Tez | Faster DAG-based execution |
Spark | In-memory, faster for iterative tasks |
Query Optimization Techniques
- Partition Pruning: Scans only relevant partitions
- Predicate Pushdown: Applies WHERE filter as early as possible
- Map-Side Join: Join before shuffling
- Bucketing and Sorting: For optimized joins and aggregations
ACID Transactions in Hive
- Supported using transactional tables (ORC + Bucketed + Table Properties)
- Enable with:
SET hive.support.concurrency = true;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
Views and Materialized Views
- Views: Logical representations (no data storage)
- Materialized Views: Precomputed, stored views for faster querying
Compression Support
Codec | Characteristics |
---|---|
Gzip | High compression, slower |
Snappy | Balanced |
LZO | Fast decompression |
Hive and HDFS Relationship
- Hive stores data in HDFS
- Managed table data goes to Hive Warehouse:
/user/hive/warehouse
- External table data can be anywhere in HDFS
- Hive doesn’t own HDFS—it uses it as a storage backend
Dynamic Partitioning
- Enables creation of partitions dynamically during INSERT
SET hive.exec.dynamic.partition = true;
INSERT INTO TABLE sales PARTITION (year, month)
SELECT id, amount, year, month FROM temp_sales;
Hive Indexes (Optional Use)
- Compact Index: Works with high-cardinality columns
- Bitmap Index: Effective on low-cardinality columns
Example:
CREATE INDEX idx ON TABLE sales (year)
AS 'COMPACT' WITH DEFERRED REBUILD;
HCatalog
- Table and storage management layer
- Lets Pig, MapReduce, Hive share metadata and storage
Security in Hive
- Authentication: LDAP, Kerberos
- Authorization: SQL-based or storage-based access control
- Encryption: Data at rest and in motion
Resource Management
- Hive relies on YARN for managing execution resources (CPU, memory)
- Tez/Spark used as optimized engines instead of MapReduce
Summary: How Hive Works Internally
Query Flow
- Parse: Convert query to AST
- Bind: Map AST to metastore schema
- Optimize: Logical plan optimization (filter pushdown, pruning)
- Compile: Convert plan to physical plan (MapReduce/Tez/Spark job)
- Execute: Job runs on Hadoop cluster
- Retrieve Result: Output fetched from HDFS
Does Partition Pruning Happen in Traditional RDBMS?
Yes, but behavior differs:
- Hive: Prunes HDFS directories.
- RDBMS (Oracle, PostgreSQL, MySQL): Prunes logical partitions internally.
- Both use it for improving performance but mechanisms differ.
Final Notes:
- Use ORC/Parquet with predicate pushdown for maximum efficiency.
- Leverage partitioning + bucketing for performance tuning.
- Use external tables when you don’t want Hive to delete or manage data.
- Integrate Hive with BI tools via HiveServer2 (Thrift, JDBC/ODBC)
This comprehensive guide covers everything you need to know about Hive—from basic architecture to advanced optimization techniques. For Big Data engineers, Hive remains an essential tool for structured data analysis over HDFS.