Exploring a Hadoop Cluster involves understanding its architecture, components, setup, and how to interact with and manage the system. Below is a structured guide to help you explore a Hadoop cluster effectively, from basic to advanced tasks.


๐Ÿ”ท 1. Hadoop Cluster Architecture Overview

โœ… Components:

ComponentDescription
NameNodeMaster daemon managing HDFS metadata and namespace
DataNodeWorker daemon storing actual HDFS blocks
ResourceManager (YARN)Allocates resources and schedules jobs (MapReduce, Spark, etc.)
NodeManagerRuns on each node to monitor and manage resource usage
Secondary NameNode / Checkpoint NodeAssists NameNode by merging FSImage and EditLogs

๐Ÿ”ท 2. Basic Commands to Explore Hadoop Cluster

๐Ÿ—‚ HDFS Exploration

# List files in HDFS
hdfs dfs -ls /

# View file contents
hdfs dfs -cat /user/hadoop/filename.txt

# Disk usage
hdfs dfs -du -h /

# File system report
hdfs dfsadmin -report

๐Ÿง  NameNode Status

hdfs dfsadmin -report                 # List DataNodes and storage info
hdfs dfsadmin -metasave metadata.txt # Save metadata for debugging

๐Ÿ”ท 3. Cluster Node Information

๐Ÿ‘€ Check Live/Dead Nodes:

hdfs dfsadmin -report

๐Ÿ›  Useful Web UIs:

ServiceDefault PortURL
NameNode UI9870http://namenode-host:9870
ResourceManager8088http://resourcemanager:8088
History Server19888http://historyserver:19888
Job Tracker50030(old MRv1)

๐Ÿ”ท 4. YARN and Job Monitoring

๐Ÿงพ List Running Applications

yarn application -list

๐Ÿ“„ Kill an Application

yarn application -kill <Application-ID>

๐Ÿ”„ Check NodeManager health

yarn node -list

๐Ÿ”ท 5. Sample Job Submission

๐Ÿ”น Run a WordCount Example:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
  /input /output

๐Ÿ”ท 6. Cluster Health and Logs

๐Ÿงช Check Disk Health

hdfs fsck /               # Check file system consistency
hdfs fsck / -blocks       # Show block locations

๐Ÿ“œ View Log Files (locally on node)

cd $HADOOP_HOME/logs
less hadoop-hadoop-namenode-*.log

๐Ÿ”ง Tail logs from HDFS jobs:

yarn logs -applicationId <application_id>

๐Ÿ”ท 7. Advanced Tools for Exploration

ToolPurpose
AmbariCluster management UI
Cloudera ManagerMonitoring, managing, and alerting
Ganglia / PrometheusMonitoring and metrics
HueUI for browsing HDFS, running queries

๐Ÿ”ท 8. Troubleshooting Tips

  • Disk full on DataNode?: Check /data/dfs usage
  • High NameNode memory: Review edit logs and fsimage size
  • Node not responding?: Check NodeManager or DataNode logs
  • Slow jobs?: Check job counters, task logs, and shuffle stats

โœ… Recommended Exercises

  1. List top 5 biggest files in HDFS
  2. Find replication factor of a file
  3. Submit a sample MapReduce job
  4. Check the status of all nodes
  5. Write a file to HDFS and read it back

Great! Here’s how you can perform each of the requested tasks on a Hadoop Cluster using the HDFS and YARN CLI tools.


โœ… 1. List Top 5 Biggest Files in HDFS

hdfs dfs -du -s -h -R / | sort -hr | head -5

๐Ÿ“Œ Explanation:

  • -du -s -h -R /: Recursively gets size of all files/folders in human-readable format.
  • sort -hr: Sorts by size in descending order.
  • head -5: Gets top 5 biggest files/folders.

๐Ÿ” You can modify / to another directory, like /user/hadoop/ if needed.


โœ… 2. Find Replication Factor of a File

hdfs fsck /path/to/file -files -blocks -locations

๐Ÿ“Œ Output will include:

  • File block info
  • Replication count per block

Example:

hdfs fsck /user/hadoop/mydata.txt -files -blocks -locations

You can also get replication using:

hdfs dfs -stat %r /path/to/file

โœ… 3. Submit a Sample MapReduce Job

๐Ÿ“ WordCount Example (comes pre-installed):

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
/input /output

Make sure /input exists in HDFS and /output doesn’t already exist.

๐Ÿงช Sample steps:

# Upload input file
hdfs dfs -mkdir /input
hdfs dfs -put input.txt /input/

# Run job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output

# View output
hdfs dfs -cat /output/part-r-00000

โœ… 4. Check the Status of All Nodes

yarn node -list

๐Ÿงพ Shows:

  • Node address
  • Node status (RUNNING/UNHEALTHY)
  • Containers
  • Memory usage

You can also view it at:
http://<ResourceManager-host>:8088/cluster/nodes


โœ… 5. Write a File to HDFS and Read it Back

๐Ÿ“ Write to HDFS:

echo "Hadoop is powerful" > sample.txt
hdfs dfs -mkdir -p /user/hadoop/demo
hdfs dfs -put sample.txt /user/hadoop/demo/

๐Ÿ“– Read from HDFS:

hdfs dfs -cat /user/hadoop/demo/sample.txt

๐Ÿง  Bonus: Clean Up

hdfs dfs -rm -r /input /output /user/hadoop/demo

# ๐Ÿ“˜ Hadoop HDFS & YARN Hands-on Practice Notebook

# โœ… 1. List Top 5 Biggest Files in HDFS

hdfs dfs -du -s -h -R / | sort -hr | head -5

# โœ… 2. Find Replication Factor of a File

# Replace with your file path
hdfs fsck /user/hadoop/mydata.txt -files -blocks -locations

# OR
hdfs dfs -stat %r /user/hadoop/mydata.txt

# โœ… 3. Submit a Sample MapReduce Job (WordCount)

# Step 1: Upload input file to HDFS
hdfs dfs -mkdir -p /input
hdfs dfs -put input.txt /input/

# Step 2: Run WordCount job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output

# Step 3: View Output
hdfs dfs -cat /output/part-r-00000

# โœ… 4. Check the Status of All Nodes

yarn node -list

# โœ… 5. Write a File to HDFS and Read It Back

echo "Hadoop is powerful" > sample.txt

hdfs dfs -mkdir -p /user/hadoop/demo
hdfs dfs -put sample.txt /user/hadoop/demo/

hdfs dfs -cat /user/hadoop/demo/sample.txt

# โœ… 6. Clean Up Test Directories (Optional)

hdfs dfs -rm -r /input /output /user/hadoop/demo

# ๐Ÿ”— Bonus: View Hadoop Web UI (replace <host> with your machine IP)
# NameNode UI: http://<namenode-host>:9870
# ResourceManager: http://<resourcemanager-host>:8088/cluster
# History Server: http://<host>:19888

Pages: 1 2 3


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading