Hadoop Tutorial: Components, Architecture, Data Processing, Interview Questions

Exploring a Hadoop Cluster involves understanding its architecture, components, setup, and how to interact with and manage the system. Below is a structured guide to help you explore a Hadoop cluster effectively, from basic to advanced tasks.

🔷 1. Hadoop Cluster Architecture Overview

✅ Components:

Component	Description
NameNode	Master daemon managing HDFS metadata and namespace
DataNode	Worker daemon storing actual HDFS blocks
ResourceManager (YARN)	Allocates resources and schedules jobs (MapReduce, Spark, etc.)
NodeManager	Runs on each node to monitor and manage resource usage
Secondary NameNode / Checkpoint Node	Assists NameNode by merging FSImage and EditLogs

🔷 2. Basic Commands to Explore Hadoop Cluster

🗂 HDFS Exploration

# List files in HDFS
hdfs dfs -ls /

# View file contents
hdfs dfs -cat /user/hadoop/filename.txt

# Disk usage
hdfs dfs -du -h /

# File system report
hdfs dfsadmin -report

🧠 NameNode Status

hdfs dfsadmin -report                 # List DataNodes and storage info
hdfs dfsadmin -metasave metadata.txt # Save metadata for debugging

🔷 3. Cluster Node Information

👀 Check Live/Dead Nodes:

hdfs dfsadmin -report

🛠 Useful Web UIs:

Service	Default Port	URL
NameNode UI	9870	`http://namenode-host:9870`
ResourceManager	8088	`http://resourcemanager:8088`
History Server	19888	`http://historyserver:19888`
Job Tracker	50030	(old MRv1)

🔷 4. YARN and Job Monitoring

🧾 List Running Applications

yarn application -list

📄 Kill an Application

yarn application -kill <Application-ID>

🔄 Check NodeManager health

yarn node -list

🔷 5. Sample Job Submission

🔹 Run a WordCount Example:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
  /input /output

🔷 6. Cluster Health and Logs

🧪 Check Disk Health

hdfs fsck /               # Check file system consistency
hdfs fsck / -blocks       # Show block locations

📜 View Log Files (locally on node)

cd $HADOOP_HOME/logs
less hadoop-hadoop-namenode-*.log

🔧 Tail logs from HDFS jobs:

yarn logs -applicationId <application_id>

🔷 7. Advanced Tools for Exploration

Tool	Purpose
Ambari	Cluster management UI
Cloudera Manager	Monitoring, managing, and alerting
Ganglia / Prometheus	Monitoring and metrics
Hue	UI for browsing HDFS, running queries

🔷 8. Troubleshooting Tips

Disk full on DataNode?: Check /data/dfs usage
High NameNode memory: Review edit logs and fsimage size
Node not responding?: Check NodeManager or DataNode logs
Slow jobs?: Check job counters, task logs, and shuffle stats

✅ Recommended Exercises

List top 5 biggest files in HDFS
Find replication factor of a file
Submit a sample MapReduce job
Check the status of all nodes
Write a file to HDFS and read it back

Great! Here’s how you can perform each of the requested tasks on a Hadoop Cluster using the HDFS and YARN CLI tools.

✅ 1. List Top 5 Biggest Files in HDFS

hdfs dfs -du -s -h -R / | sort -hr | head -5

📌 Explanation:

-du -s -h -R /: Recursively gets size of all files/folders in human-readable format.
sort -hr: Sorts by size in descending order.
head -5: Gets top 5 biggest files/folders.

🔁 You can modify / to another directory, like /user/hadoop/ if needed.

✅ 2. Find Replication Factor of a File

hdfs fsck /path/to/file -files -blocks -locations

📌 Output will include:

File block info
Replication count per block

Example:

hdfs fsck /user/hadoop/mydata.txt -files -blocks -locations

You can also get replication using:

hdfs dfs -stat %r /path/to/file

✅ 3. Submit a Sample MapReduce Job

📁 WordCount Example (comes pre-installed):

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
/input /output

Make sure /input exists in HDFS and /output doesn’t already exist.

🧪 Sample steps:

# Upload input file
hdfs dfs -mkdir /input
hdfs dfs -put input.txt /input/

# Run job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output

# View output
hdfs dfs -cat /output/part-r-00000

✅ 4. Check the Status of All Nodes

yarn node -list

🧾 Shows:

Node address
Node status (RUNNING/UNHEALTHY)
Containers
Memory usage

You can also view it at:
http://<ResourceManager-host>:8088/cluster/nodes

✅ 5. Write a File to HDFS and Read it Back

📝 Write to HDFS:

echo "Hadoop is powerful" > sample.txt
hdfs dfs -mkdir -p /user/hadoop/demo
hdfs dfs -put sample.txt /user/hadoop/demo/

📖 Read from HDFS:

hdfs dfs -cat /user/hadoop/demo/sample.txt

🧠 Bonus: Clean Up

hdfs dfs -rm -r /input /output /user/hadoop/demo

# 📘 Hadoop HDFS & YARN Hands-on Practice Notebook

# ✅ 1. List Top 5 Biggest Files in HDFS

hdfs dfs -du -s -h -R / | sort -hr | head -5

# ✅ 2. Find Replication Factor of a File

# Replace with your file path
hdfs fsck /user/hadoop/mydata.txt -files -blocks -locations

# OR
hdfs dfs -stat %r /user/hadoop/mydata.txt

# ✅ 3. Submit a Sample MapReduce Job (WordCount)

# Step 1: Upload input file to HDFS
hdfs dfs -mkdir -p /input
hdfs dfs -put input.txt /input/

# Step 2: Run WordCount job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output

# Step 3: View Output
hdfs dfs -cat /output/part-r-00000

# ✅ 4. Check the Status of All Nodes

yarn node -list

# ✅ 5. Write a File to HDFS and Read It Back

echo "Hadoop is powerful" > sample.txt

hdfs dfs -mkdir -p /user/hadoop/demo
hdfs dfs -put sample.txt /user/hadoop/demo/

hdfs dfs -cat /user/hadoop/demo/sample.txt

# ✅ 6. Clean Up Test Directories (Optional)

hdfs dfs -rm -r /input /output /user/hadoop/demo

# 🔗 Bonus: View Hadoop Web UI (replace <host> with your machine IP)
# NameNode UI: http://<namenode-host>:9870
# ResourceManager: http://<resourcemanager-host>:8088/cluster
# History Server: http://<host>:19888

recent posts

about