BDL Ecosystem- Components, HDFS in Detail

HDFS: Hadoop Distributed File System – Complete Guide

✅ Why HDFS?

Designed for storing large datasets (100s of GB to TB+)
Provides fault-tolerance and high throughput access
Optimized for write-once, read-many workloads
Scales horizontally across commodity hardware

🔑 Common Terminology

Term	Description
NameNode	Master node, manages metadata (namespace, file locations)
DataNode	Worker node, stores actual data blocks
Block	Unit of storage in HDFS (default 128 MB)
Replication Factor	Number of block copies (default: 3)
Rack Awareness	Data placement strategy across different racks
Secondary NameNode	Not a backup! It checkpoints NameNode metadata
Standby NameNode	Backup node for HA setup (active/passive failover)

🏗️ HDFS Architecture

+-------------------+       +----------------+
|      Client       | <-->  |    NameNode    |
+-------------------+       +----------------+
          |                        ^
          v                        |
    +-------------+         +-------------+
    |  DataNode1  |  <-->   |  DataNode2  |  ...
    +-------------+         +-------------+

NameNode stores metadata (file locations, permissions, block mapping)
DataNodes store actual file blocks
Client interacts via API/CLI (write/read files)

🧱 Blocks in HDFS

Default block size: 128 MB (configurable)
Large files are split into blocks and stored across DataNodes
Each block is replicated (default: 3 times)

🔁 Replication Factor

Ensures fault tolerance
HDFS maintains multiple copies across different DataNodes (and racks)
Can be configured per file/directory

hdfs dfs -setrep -w 2 /user/hadoop/datafile.txt

🌐 Rack Awareness

Prevents data loss due to rack switch failures
Blocks are placed across different racks
At least one replica is stored on a different rack

⚠️ Node Failures

🔹 DataNode Failure

NameNode re-replicates blocks to maintain replication factor

🔹 NameNode Failure

In legacy setup: Single point of failure
Modern setup: Hadoop HA (High Availability)

🔁 Hadoop HA Architecture

Active NameNode + Standby NameNode
Shared storage (e.g., NFS, Quorum Journal Manager)
Zookeeper handles failover

hdfs haadmin -getServiceState nn1
hdfs haadmin -failover nn1 nn2

📥 Data Write in HDFS

Client contacts NameNode → gets block locations
Client writes data block by block to DataNodes
Data is pipelined to other replicas
Once all blocks written → NameNode is updated

📤 Data Read in HDFS

Client requests file → NameNode returns block locations
Client reads blocks directly from the nearest DataNodes
If one block is missing, HDFS retries from other replicas

💻 Linux + HDFS Commands

Basic Linux File Commands

ls -l
cat filename.txt
touch file.txt
mkdir new_folder
rm -rf old_folder

Essential HDFS Commands

# Copy from local to HDFS
hdfs dfs -put file.txt /user/data/

# Copy from HDFS to local
hdfs dfs -get /user/data/file.txt .

# List HDFS directory
hdfs dfs -ls /user/data/

# Remove file/directory
hdfs dfs -rm /user/data/file.txt
hdfs dfs -rm -r /user/data/folder/

# Check replication
hdfs fsck /user/data/file.txt -files -blocks -locations

# Change block replication
hdfs dfs -setrep -w 2 /user/data/file.txt

🧾 HDFS Scripts (Bash Examples)

Upload Daily File

#!/bin/bash
DATE=$(date +%F)
LOCAL_FILE="/data/input/data_$DATE.csv"
HDFS_PATH="/user/data/input/"

hdfs dfs -put $LOCAL_FILE $HDFS_PATH

Check Disk Usage

hdfs dfs -du -h /user/data/

💼 Use Cases of HDFS

Use Case	Description
Big Data Storage	Store large log files, events, IoT data, images, etc.
Batch Processing	Data input/output for Hive, Pig, Spark, MR jobs
Data Lake	Store structured + unstructured data at scale
Data Backup	Cost-effective backup of raw/processed data

📌 Tips

Use HDFS for append-only, sequential access data
Avoid random writes or too many small files
Monitor via hdfs dfsadmin -report

Perfect! Let’s move forward step by step. Here’s what we can do:

🔧 Step 1: HDFS + Spark Integration Mini-Project (ETL Flow)

💼 Project: Log File Analysis from HDFS Using PySpark

🔹 Objective:

Read logs from HDFS, clean and transform them using Spark, and write clean data back to HDFS.

✅ Project Setup

📁 Sample Folder Structure:

/data/logs/raw_logs/
    - access_log_2025-06-12.txt
    - access_log_2025-06-11.txt

/output/logs/cleaned/

🚀 Step 2: Mini-Project Plan

🔹 Step A: Upload raw log files to HDFS

hdfs dfs -mkdir -p /user/data/logs/raw_logs/
hdfs dfs -put access_log_2025-06-12.txt /user/data/logs/raw_logs/

🔹 Step B: PySpark Script to Clean Logs

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract, col

spark = SparkSession.builder \
    .appName("HDFS Log ETL") \
    .getOrCreate()

# Read raw logs from HDFS
log_df = spark.read.text("hdfs:///user/data/logs/raw_logs/")

# Extract common log format components
log_cleaned = log_df.select(
    regexp_extract('value', r'(^\S+)', 1).alias('ip'),
    regexp_extract('value', r'\[(.*?)\]', 1).alias('timestamp'),
    regexp_extract('value', r'\"(GET|POST|PUT|DELETE)\s(\S+)', 2).alias('endpoint'),
    regexp_extract('value', r'HTTP/\d.\d\"\s(\d+)', 1).alias('status_code')
)

log_cleaned.write.mode("overwrite").parquet("hdfs:///user/data/logs/cleaned/")

🔹 Step C: HDFS Validation

hdfs dfs -ls /user/data/logs/cleaned/
hdfs dfs -cat /user/data/logs/cleaned/part-*.parquet

🔹 Step D: Optional Aggregation

log_cleaned.groupBy("status_code").count().show()

🛠️ Step 3: Admin Operations

Monitor health: hdfs dfsadmin -report
Block info: hdfs fsck /user/data/logs/raw_logs/ -files -blocks
List replication: hdfs fsck /user/data/logs/raw_logs/ -files -blocks -locations

Apache Hive- Overview, Components, Architecture, Step by Step Execution Via Apache Tez or Spark

HintsToday

recent posts

about

HDFS: Hadoop Distributed File System – Complete Guide

🔑 Common Terminology

🏗️ HDFS Architecture

🧱 Blocks in HDFS

🔁 Replication Factor

🌐 Rack Awareness

⚠️ Node Failures

🔹 DataNode Failure

🔹 NameNode Failure

🔁 Hadoop HA Architecture

📥 Data Write in HDFS

📤 Data Read in HDFS

💻 Linux + HDFS Commands

Basic Linux File Commands

Essential HDFS Commands

🧾 HDFS Scripts (Bash Examples)

Upload Daily File

Check Disk Usage

💼 Use Cases of HDFS

📌 Tips

🔧 Step 1: HDFS + Spark Integration Mini-Project (ETL Flow)

💼 Project: Log File Analysis from HDFS Using PySpark

🔹 Objective:

✅ Project Setup

📁 Sample Folder Structure:

🚀 Step 2: Mini-Project Plan

🔹 Step A: Upload raw log files to HDFS

🔹 Step B: PySpark Script to Clean Logs

🔹 Step C: HDFS Validation

🔹 Step D: Optional Aggregation

🛠️ Step 3: Admin Operations

Discover more from HintsToday

One response to “BDL Ecosystem- Components, HDFS in Detail”

Leave a Reply Cancel reply