Exploring a Hadoop Cluster involves understanding its architecture, components, setup, and how to interact with and manage the system. Below is a structured guide to help you explore a Hadoop cluster effectively, from basic to advanced tasks.
๐ท 1. Hadoop Cluster Architecture Overview
โ Components:
Component | Description |
---|---|
NameNode | Master daemon managing HDFS metadata and namespace |
DataNode | Worker daemon storing actual HDFS blocks |
ResourceManager (YARN) | Allocates resources and schedules jobs (MapReduce, Spark, etc.) |
NodeManager | Runs on each node to monitor and manage resource usage |
Secondary NameNode / Checkpoint Node | Assists NameNode by merging FSImage and EditLogs |
๐ท 2. Basic Commands to Explore Hadoop Cluster
๐ HDFS Exploration
# List files in HDFS
hdfs dfs -ls /
# View file contents
hdfs dfs -cat /user/hadoop/filename.txt
# Disk usage
hdfs dfs -du -h /
# File system report
hdfs dfsadmin -report
๐ง NameNode Status
hdfs dfsadmin -report # List DataNodes and storage info
hdfs dfsadmin -metasave metadata.txt # Save metadata for debugging
๐ท 3. Cluster Node Information
๐ Check Live/Dead Nodes:
hdfs dfsadmin -report
๐ Useful Web UIs:
Service | Default Port | URL |
---|---|---|
NameNode UI | 9870 | http://namenode-host:9870 |
ResourceManager | 8088 | http://resourcemanager:8088 |
History Server | 19888 | http://historyserver:19888 |
Job Tracker | 50030 | (old MRv1) |
๐ท 4. YARN and Job Monitoring
๐งพ List Running Applications
yarn application -list
๐ Kill an Application
yarn application -kill <Application-ID>
๐ Check NodeManager health
yarn node -list
๐ท 5. Sample Job Submission
๐น Run a WordCount Example:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
/input /output
๐ท 6. Cluster Health and Logs
๐งช Check Disk Health
hdfs fsck / # Check file system consistency
hdfs fsck / -blocks # Show block locations
๐ View Log Files (locally on node)
cd $HADOOP_HOME/logs
less hadoop-hadoop-namenode-*.log
๐ง Tail logs from HDFS jobs:
yarn logs -applicationId <application_id>
๐ท 7. Advanced Tools for Exploration
Tool | Purpose |
---|---|
Ambari | Cluster management UI |
Cloudera Manager | Monitoring, managing, and alerting |
Ganglia / Prometheus | Monitoring and metrics |
Hue | UI for browsing HDFS, running queries |
๐ท 8. Troubleshooting Tips
- Disk full on DataNode?: Check
/data/dfs
usage - High NameNode memory: Review edit logs and fsimage size
- Node not responding?: Check
NodeManager
orDataNode
logs - Slow jobs?: Check job counters, task logs, and shuffle stats
โ Recommended Exercises
- List top 5 biggest files in HDFS
- Find replication factor of a file
- Submit a sample MapReduce job
- Check the status of all nodes
- Write a file to HDFS and read it back
Great! Here’s how you can perform each of the requested tasks on a Hadoop Cluster using the HDFS and YARN CLI tools.
โ 1. List Top 5 Biggest Files in HDFS
hdfs dfs -du -s -h -R / | sort -hr | head -5
๐ Explanation:
-du -s -h -R /
: Recursively gets size of all files/folders in human-readable format.sort -hr
: Sorts by size in descending order.head -5
: Gets top 5 biggest files/folders.
๐ You can modify
/
to another directory, like/user/hadoop/
if needed.
โ 2. Find Replication Factor of a File
hdfs fsck /path/to/file -files -blocks -locations
๐ Output will include:
- File block info
- Replication count per block
Example:
hdfs fsck /user/hadoop/mydata.txt -files -blocks -locations
You can also get replication using:
hdfs dfs -stat %r /path/to/file
โ 3. Submit a Sample MapReduce Job
๐ WordCount Example (comes pre-installed):
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
/input /output
Make sure
/input
exists in HDFS and/output
doesn’t already exist.
๐งช Sample steps:
# Upload input file
hdfs dfs -mkdir /input
hdfs dfs -put input.txt /input/
# Run job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output
# View output
hdfs dfs -cat /output/part-r-00000
โ 4. Check the Status of All Nodes
yarn node -list
๐งพ Shows:
- Node address
- Node status (RUNNING/UNHEALTHY)
- Containers
- Memory usage
You can also view it at:
http://<ResourceManager-host>:8088/cluster/nodes
โ 5. Write a File to HDFS and Read it Back
๐ Write to HDFS:
echo "Hadoop is powerful" > sample.txt
hdfs dfs -mkdir -p /user/hadoop/demo
hdfs dfs -put sample.txt /user/hadoop/demo/
๐ Read from HDFS:
hdfs dfs -cat /user/hadoop/demo/sample.txt
๐ง Bonus: Clean Up
hdfs dfs -rm -r /input /output /user/hadoop/demo
# ๐ Hadoop HDFS & YARN Hands-on Practice Notebook
# โ
1. List Top 5 Biggest Files in HDFS
hdfs dfs -du -s -h -R / | sort -hr | head -5
# โ
2. Find Replication Factor of a File
# Replace with your file path
hdfs fsck /user/hadoop/mydata.txt -files -blocks -locations
# OR
hdfs dfs -stat %r /user/hadoop/mydata.txt
# โ
3. Submit a Sample MapReduce Job (WordCount)
# Step 1: Upload input file to HDFS
hdfs dfs -mkdir -p /input
hdfs dfs -put input.txt /input/
# Step 2: Run WordCount job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output
# Step 3: View Output
hdfs dfs -cat /output/part-r-00000
# โ
4. Check the Status of All Nodes
yarn node -list
# โ
5. Write a File to HDFS and Read It Back
echo "Hadoop is powerful" > sample.txt
hdfs dfs -mkdir -p /user/hadoop/demo
hdfs dfs -put sample.txt /user/hadoop/demo/
hdfs dfs -cat /user/hadoop/demo/sample.txt
# โ
6. Clean Up Test Directories (Optional)
hdfs dfs -rm -r /input /output /user/hadoop/demo
# ๐ Bonus: View Hadoop Web UI (replace <host> with your machine IP)
# NameNode UI: http://<namenode-host>:9870
# ResourceManager: http://<resourcemanager-host>:8088/cluster
# History Server: http://<host>:19888