HDFS: Hadoop Distributed File System – Complete Guide


โœ… Why HDFS?

  • Designed for storing large datasets (100s of GB to TB+)
  • Provides fault-tolerance and high throughput access
  • Optimized for write-once, read-many workloads
  • Scales horizontally across commodity hardware

๐Ÿ”‘ Common Terminology

TermDescription
NameNodeMaster node, manages metadata (namespace, file locations)
DataNodeWorker node, stores actual data blocks
BlockUnit of storage in HDFS (default 128 MB)
Replication FactorNumber of block copies (default: 3)
Rack AwarenessData placement strategy across different racks
Secondary NameNodeNot a backup! It checkpoints NameNode metadata
Standby NameNodeBackup node for HA setup (active/passive failover)

๐Ÿ—๏ธ HDFS Architecture

+-------------------+       +----------------+
|      Client       | <-->  |    NameNode    |
+-------------------+       +----------------+
          |                        ^
          v                        |
    +-------------+         +-------------+
    |  DataNode1  |  <-->   |  DataNode2  |  ...
    +-------------+         +-------------+
  • NameNode stores metadata (file locations, permissions, block mapping)
  • DataNodes store actual file blocks
  • Client interacts via API/CLI (write/read files)

๐Ÿงฑ Blocks in HDFS

  • Default block size: 128 MB (configurable)
  • Large files are split into blocks and stored across DataNodes
  • Each block is replicated (default: 3 times)

๐Ÿ” Replication Factor

  • Ensures fault tolerance
  • HDFS maintains multiple copies across different DataNodes (and racks)
  • Can be configured per file/directory
hdfs dfs -setrep -w 2 /user/hadoop/datafile.txt

๐ŸŒ Rack Awareness

  • Prevents data loss due to rack switch failures
  • Blocks are placed across different racks
  • At least one replica is stored on a different rack

โš ๏ธ Node Failures

๐Ÿ”น DataNode Failure

  • NameNode re-replicates blocks to maintain replication factor

๐Ÿ”น NameNode Failure

  • In legacy setup: Single point of failure
  • Modern setup: Hadoop HA (High Availability)

๐Ÿ” Hadoop HA Architecture

  • Active NameNode + Standby NameNode
  • Shared storage (e.g., NFS, Quorum Journal Manager)
  • Zookeeper handles failover
hdfs haadmin -getServiceState nn1
hdfs haadmin -failover nn1 nn2

๐Ÿ“ฅ Data Write in HDFS

  1. Client contacts NameNode โ†’ gets block locations
  2. Client writes data block by block to DataNodes
  3. Data is pipelined to other replicas
  4. Once all blocks written โ†’ NameNode is updated

๐Ÿ“ค Data Read in HDFS

  1. Client requests file โ†’ NameNode returns block locations
  2. Client reads blocks directly from the nearest DataNodes
  3. If one block is missing, HDFS retries from other replicas

๐Ÿ’ป Linux + HDFS Commands

Basic Linux File Commands

ls -l
cat filename.txt
touch file.txt
mkdir new_folder
rm -rf old_folder

Essential HDFS Commands

# Copy from local to HDFS
hdfs dfs -put file.txt /user/data/

# Copy from HDFS to local
hdfs dfs -get /user/data/file.txt .

# List HDFS directory
hdfs dfs -ls /user/data/

# Remove file/directory
hdfs dfs -rm /user/data/file.txt
hdfs dfs -rm -r /user/data/folder/

# Check replication
hdfs fsck /user/data/file.txt -files -blocks -locations

# Change block replication
hdfs dfs -setrep -w 2 /user/data/file.txt

๐Ÿงพ HDFS Scripts (Bash Examples)

Upload Daily File

#!/bin/bash
DATE=$(date +%F)
LOCAL_FILE="/data/input/data_$DATE.csv"
HDFS_PATH="/user/data/input/"

hdfs dfs -put $LOCAL_FILE $HDFS_PATH

Check Disk Usage

hdfs dfs -du -h /user/data/

๐Ÿ’ผ Use Cases of HDFS

Use CaseDescription
Big Data StorageStore large log files, events, IoT data, images, etc.
Batch ProcessingData input/output for Hive, Pig, Spark, MR jobs
Data LakeStore structured + unstructured data at scale
Data BackupCost-effective backup of raw/processed data

๐Ÿ“Œ Tips

  • Use HDFS for append-only, sequential access data
  • Avoid random writes or too many small files
  • Monitor via hdfs dfsadmin -report

Perfect! Let’s move forward step by step. Here’s what we can do:


๐Ÿ”ง Step 1: HDFS + Spark Integration Mini-Project (ETL Flow)

๐Ÿ’ผ Project: Log File Analysis from HDFS Using PySpark

๐Ÿ”น Objective:

Read logs from HDFS, clean and transform them using Spark, and write clean data back to HDFS.


โœ… Project Setup

๐Ÿ“ Sample Folder Structure:

/data/logs/raw_logs/
    - access_log_2025-06-12.txt
    - access_log_2025-06-11.txt

/output/logs/cleaned/

๐Ÿš€ Step 2: Mini-Project Plan

๐Ÿ”น Step A: Upload raw log files to HDFS

hdfs dfs -mkdir -p /user/data/logs/raw_logs/
hdfs dfs -put access_log_2025-06-12.txt /user/data/logs/raw_logs/

๐Ÿ”น Step B: PySpark Script to Clean Logs

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, col

spark = SparkSession.builder \
    .appName("HDFS Log ETL") \
    .getOrCreate()

# Read raw logs from HDFS
log_df = spark.read.text("hdfs:///user/data/logs/raw_logs/")

# Extract common log format components
log_cleaned = log_df.select(
    regexp_extract('value', r'(^\S+)', 1).alias('ip'),
    regexp_extract('value', r'\[(.*?)\]', 1).alias('timestamp'),
    regexp_extract('value', r'\"(GET|POST|PUT|DELETE)\s(\S+)', 2).alias('endpoint'),
    regexp_extract('value', r'HTTP/\d.\d\"\s(\d+)', 1).alias('status_code')
)

log_cleaned.write.mode("overwrite").parquet("hdfs:///user/data/logs/cleaned/")

๐Ÿ”น Step C: HDFS Validation

hdfs dfs -ls /user/data/logs/cleaned/
hdfs dfs -cat /user/data/logs/cleaned/part-*.parquet

๐Ÿ”น Step D: Optional Aggregation

log_cleaned.groupBy("status_code").count().show()

๐Ÿ› ๏ธ Step 3: Admin Operations

  • Monitor health: hdfs dfsadmin -report
  • Block info: hdfs fsck /user/data/logs/raw_logs/ -files -blocks
  • List replication: hdfs fsck /user/data/logs/raw_logs/ -files -blocks -locations

๐Ÿ“ฆ

Pages: 1 2

Posted in

Leave a Reply

Your email address will not be published. Required fields are marked *