HDFS: Hadoop Distributed File System – Complete Guide
โ Why HDFS?
- Designed for storing large datasets (100s of GB to TB+)
- Provides fault-tolerance and high throughput access
- Optimized for write-once, read-many workloads
- Scales horizontally across commodity hardware
๐ Common Terminology
Term | Description |
---|---|
NameNode | Master node, manages metadata (namespace, file locations) |
DataNode | Worker node, stores actual data blocks |
Block | Unit of storage in HDFS (default 128 MB) |
Replication Factor | Number of block copies (default: 3) |
Rack Awareness | Data placement strategy across different racks |
Secondary NameNode | Not a backup! It checkpoints NameNode metadata |
Standby NameNode | Backup node for HA setup (active/passive failover) |
๐๏ธ HDFS Architecture
+-------------------+ +----------------+
| Client | <--> | NameNode |
+-------------------+ +----------------+
| ^
v |
+-------------+ +-------------+
| DataNode1 | <--> | DataNode2 | ...
+-------------+ +-------------+
- NameNode stores metadata (file locations, permissions, block mapping)
- DataNodes store actual file blocks
- Client interacts via API/CLI (write/read files)
๐งฑ Blocks in HDFS
- Default block size: 128 MB (configurable)
- Large files are split into blocks and stored across DataNodes
- Each block is replicated (default: 3 times)
๐ Replication Factor
- Ensures fault tolerance
- HDFS maintains multiple copies across different DataNodes (and racks)
- Can be configured per file/directory
hdfs dfs -setrep -w 2 /user/hadoop/datafile.txt
๐ Rack Awareness
- Prevents data loss due to rack switch failures
- Blocks are placed across different racks
- At least one replica is stored on a different rack
โ ๏ธ Node Failures
๐น DataNode Failure
- NameNode re-replicates blocks to maintain replication factor
๐น NameNode Failure
- In legacy setup: Single point of failure
- Modern setup: Hadoop HA (High Availability)
๐ Hadoop HA Architecture
- Active NameNode + Standby NameNode
- Shared storage (e.g., NFS, Quorum Journal Manager)
- Zookeeper handles failover
hdfs haadmin -getServiceState nn1
hdfs haadmin -failover nn1 nn2
๐ฅ Data Write in HDFS
- Client contacts NameNode โ gets block locations
- Client writes data block by block to DataNodes
- Data is pipelined to other replicas
- Once all blocks written โ NameNode is updated
๐ค Data Read in HDFS
- Client requests file โ NameNode returns block locations
- Client reads blocks directly from the nearest DataNodes
- If one block is missing, HDFS retries from other replicas
๐ป Linux + HDFS Commands
Basic Linux File Commands
ls -l
cat filename.txt
touch file.txt
mkdir new_folder
rm -rf old_folder
Essential HDFS Commands
# Copy from local to HDFS
hdfs dfs -put file.txt /user/data/
# Copy from HDFS to local
hdfs dfs -get /user/data/file.txt .
# List HDFS directory
hdfs dfs -ls /user/data/
# Remove file/directory
hdfs dfs -rm /user/data/file.txt
hdfs dfs -rm -r /user/data/folder/
# Check replication
hdfs fsck /user/data/file.txt -files -blocks -locations
# Change block replication
hdfs dfs -setrep -w 2 /user/data/file.txt
๐งพ HDFS Scripts (Bash Examples)
Upload Daily File
#!/bin/bash
DATE=$(date +%F)
LOCAL_FILE="/data/input/data_$DATE.csv"
HDFS_PATH="/user/data/input/"
hdfs dfs -put $LOCAL_FILE $HDFS_PATH
Check Disk Usage
hdfs dfs -du -h /user/data/
๐ผ Use Cases of HDFS
Use Case | Description |
---|---|
Big Data Storage | Store large log files, events, IoT data, images, etc. |
Batch Processing | Data input/output for Hive, Pig, Spark, MR jobs |
Data Lake | Store structured + unstructured data at scale |
Data Backup | Cost-effective backup of raw/processed data |
๐ Tips
- Use HDFS for append-only, sequential access data
- Avoid random writes or too many small files
- Monitor via
hdfs dfsadmin -report
Perfect! Let’s move forward step by step. Here’s what we can do:
๐ง Step 1: HDFS + Spark Integration Mini-Project (ETL Flow)
๐ผ Project: Log File Analysis from HDFS Using PySpark
๐น Objective:
Read logs from HDFS, clean and transform them using Spark, and write clean data back to HDFS.
โ Project Setup
๐ Sample Folder Structure:
/data/logs/raw_logs/
- access_log_2025-06-12.txt
- access_log_2025-06-11.txt
/output/logs/cleaned/
๐ Step 2: Mini-Project Plan
๐น Step A: Upload raw log files to HDFS
hdfs dfs -mkdir -p /user/data/logs/raw_logs/
hdfs dfs -put access_log_2025-06-12.txt /user/data/logs/raw_logs/
๐น Step B: PySpark Script to Clean Logs
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, col
spark = SparkSession.builder \
.appName("HDFS Log ETL") \
.getOrCreate()
# Read raw logs from HDFS
log_df = spark.read.text("hdfs:///user/data/logs/raw_logs/")
# Extract common log format components
log_cleaned = log_df.select(
regexp_extract('value', r'(^\S+)', 1).alias('ip'),
regexp_extract('value', r'\[(.*?)\]', 1).alias('timestamp'),
regexp_extract('value', r'\"(GET|POST|PUT|DELETE)\s(\S+)', 2).alias('endpoint'),
regexp_extract('value', r'HTTP/\d.\d\"\s(\d+)', 1).alias('status_code')
)
log_cleaned.write.mode("overwrite").parquet("hdfs:///user/data/logs/cleaned/")
๐น Step C: HDFS Validation
hdfs dfs -ls /user/data/logs/cleaned/
hdfs dfs -cat /user/data/logs/cleaned/part-*.parquet
๐น Step D: Optional Aggregation
log_cleaned.groupBy("status_code").count().show()
๐ ๏ธ Step 3: Admin Operations
- Monitor health:
hdfs dfsadmin -report
- Block info:
hdfs fsck /user/data/logs/raw_logs/ -files -blocks
- List replication:
hdfs fsck /user/data/logs/raw_logs/ -files -blocks -locations
Leave a Reply