Here’s a detailed expansion of your Hadoop Core Concepts – Interview Q&A, now with conceptual answers, examples, commands, and added advanced questions covering HDFS CLI, Python-Hadoop integration, and Hive interaction.
✅ Hadoop Core Interview Questions with Answers & Examples
1. What is Hadoop?
Answer: Hadoop is an open-source framework that allows for distributed storage and parallel processing of large datasets across clusters of computers using simple programming models.
Use Case: Processing 10 TB of server logs spread across 20 nodes using MapReduce.
2. What are the core components of Hadoop?
- HDFS: Distributed storage
- YARN: Cluster resource manager
- MapReduce: Computation engine
3. What is HDFS?
Answer: HDFS stores large files by breaking them into blocks (default: 128 MB) and replicating them across multiple DataNodes for fault tolerance.
Example: A 512 MB file is split into 4 blocks, each stored on 3 nodes.
4. What is the default block size in HDFS? Can it be changed?
Default: 128 MB (Hadoop 2.x+), earlier it was 64 MB
Change via config:
<property>
<name>dfs.block.size</name>
<value>268435456</value> <!-- 256 MB -->
</property>
5. What is NameNode and DataNode?
- NameNode: Stores metadata (filename, block locations, permissions)
- DataNode: Stores actual data blocks
6. What is a Secondary NameNode?
Misconception: It is not a backup
Correct: It periodically checkpoints the NameNode’s in-memory metadata to prevent large fsimage+edits log files.
7. What is the role of YARN in Hadoop?
Answer: YARN (Yet Another Resource Negotiator) manages cluster resources and schedules job execution via:
- ResourceManager
- NodeManager
8. What is MapReduce?
Answer: A programming model with:
- Map(): Filters/sorts data
- Reduce(): Aggregates/interprets results
9. What are Mappers and Reducers?
- Mapper: Converts input → intermediate key-value pairs
- Reducer: Aggregates keys
Example: Word Count
- Mapper:
("Hadoop", 1)
- Reducer:
("Hadoop", 5)
10. What is a Combiner?
Mini-reducer that performs local aggregation before shuffle.
Used to reduce network traffic.
11. What is a Partitioner in Hadoop?
Determines how key-value pairs are distributed to reducers.
Default: hash(key) % numReducers
12. What is a Hadoop InputSplit?
Logical chunk of data given to a mapper.
13. Difference between InputSplit and HDFS Block?
InputSplit | HDFS Block |
---|---|
Logical | Physical |
Used by MapReduce | Used by HDFS |
Can span multiple blocks | Fixed-size |
14. What file formats are supported in Hadoop?
- Text
- SequenceFile
- Avro
- Parquet
- ORC
15. Difference between SequenceFile and Avro?
- SequenceFile: Binary format for key-value pairs (Hadoop native)
- Avro: Row-based, schema-based, good for serializing structured data
16. What is speculative execution in Hadoop?
Executes duplicate tasks for slow-running tasks to avoid stragglers.
17. What is rack awareness in Hadoop?
Hadoop stores replicas across racks to prevent data loss from rack failure.
18. How does HDFS ensure fault tolerance?
- Replication (default 3)
- Heartbeats
- Data rebalancing
19. What is the role of JobTracker and TaskTracker?
(For Hadoop 1.x)
- JobTracker: Assigns tasks
- TaskTracker: Executes them
20. Difference between Hadoop 1.x and 2.x?
Feature | Hadoop 1.x | Hadoop 2.x |
---|---|---|
Resource Mgmt | JobTracker | YARN |
Scalability | Limited | Massive |
Multi-tenancy | No | Yes |
💡 Bonus Practical Questions + Answers
21. How do you load data into HDFS?
hdfs dfs -put myfile.txt /data/
22. Command to list files in HDFS
hdfs dfs -ls /user/hadoop
23. What happens when a DataNode fails?
- Missing blocks reported to NameNode
- NameNode schedules replication on other healthy nodes
24. How do you tune MapReduce performance?
- Set number of mappers/reducers properly
- Use Combiner
- Tune memory and I/O
- Enable compression
25. Difference between Hadoop and Spark?
Hadoop (MapReduce) | Spark |
---|---|
Disk-based | In-memory |
Slower | Faster |
No DAG | DAG-based engine |
Good for batch | Good for batch + streaming |
🔗 Extra Questions: Python, Hive, and CLI
26. How do you connect Hadoop with Python?
a) Using PyDoop
import pydoop.hdfs as hdfs
hdfs.ls('/data/')
b) Using Snakebite (lightweight HDFS client)
snakebite ls /data/
27. How do you connect Hive with Python?
from pyhive import hive
conn = hive.Connection(host='localhost', port=10000, username='hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM my_table LIMIT 10")
28. Important HDFS CLI Commands
Command | Usage |
---|---|
-ls | List directory |
-put | Upload to HDFS |
-get | Download from HDFS |
-rm -r | Delete recursively |
-du -h | Show directory size |
-cat | View file contents |
📘
Great! Below is an enhanced Hadoop Interview Q&A list, now expanded to cover important technical terms like fsimage
, edit logs
, checkpoint
, and other key Hadoop ecosystem components like Hive, Sqoop, Flume, HBase, Oozie, and Zookeeper.
✅ Extended Hadoop Interview Questions — Technical Terms & Ecosystem Components
🗂️ Core Hadoop Internals (Technical Terms)
29. What is fsimage in Hadoop?
Answer:fsimage
is a snapshot of the Hadoop filesystem metadata stored on the NameNode. It contains the entire directory structure and file-to-block mapping at a specific point in time.
30. What is edit log in Hadoop?
Answer:
The edit log records every change made to the HDFS metadata since the last fsimage
was saved.
31. What is a checkpoint in Hadoop?
Answer:
The process of merging the fsimage
and edit log
to create a new, updated fsimage
. Done by the Secondary NameNode to reduce NameNode startup time.
32. What happens when NameNode restarts?
Answer:
- Loads
fsimage
- Applies
edit logs
to bring metadata to current state - Rebuilds the namespace in memory
33. What is Safe Mode in Hadoop?
Answer:
A read-only mode during NameNode startup. It waits for block reports from DataNodes before exiting safe mode and allowing writes.
34. What is a heartbeat in Hadoop?
Answer:
A signal sent by DataNodes every 3 seconds to inform the NameNode they’re alive. If no heartbeat is received in 10 minutes, the DataNode is considered dead.
35. What is data locality in Hadoop?
Answer:
Moving computation to the data rather than data to computation. This reduces network IO and improves job performance.
🌐 Hadoop Ecosystem Components — Key Interview Questions
36. What is Hive? How does it work with Hadoop?
Answer:
Hive is a SQL-like engine on Hadoop. It converts HiveQL into MapReduce, Tez, or Spark jobs.
Example: SELECT COUNT(*) FROM sales;
becomes a MapReduce job behind the scenes.
37. What is the difference between Hive and Pig?
Feature | Hive | Pig |
---|---|---|
Language | SQL-like (HiveQL) | Script-based (Pig Latin) |
Use case | Reporting/BI | Data transformation |
Learning curve | Easy for SQL users | Easy for programmers |
38. What is Sqoop?
Answer:
A tool for transferring data between Hadoop and RDBMS.
sqoop import --connect jdbc:mysql://dbhost/sales --table orders --target-dir /hdfs/orders
39. What is Flume?
Answer:
A distributed service to collect, aggregate, and move large volumes of log data into HDFS or Hive.
40. What is HBase?
Answer:
A NoSQL columnar database that runs on HDFS. Ideal for real-time random read/write operations.
41. What is the difference between HDFS and HBase?
Feature | HDFS | HBase |
---|---|---|
Type | File system | Database |
Access | Batch | Real-time |
Structure | Flat files | Key-Column-Value |
42. What is Oozie?
Answer:
A workflow scheduler for Hadoop. Helps manage dependencies between jobs like Hive → MapReduce → Pig.
43. What is Zookeeper?
Answer:
A coordination service used in Hadoop ecosystem (like HBase, Kafka) for leader election, configuration, and distributed locking.
44. What is Parquet and ORC file format?
- Parquet: Columnar storage format, supports nested data (best with Spark).
- ORC: Optimized Row Columnar format (best with Hive), better compression and read performance.
45. What is Hadoop Archive (HAR)?
Answer:
A method to compress many small files into a single large file to overcome HDFS small files problem.
46. What is small file problem in HDFS?
Answer:
Too many small files overwhelm the NameNode’s memory as it stores metadata for each file.
Solution: HAR, SequenceFile, or CombineFileInputFormat.
47. What is a spill in MapReduce?
Answer:
When the in-memory buffer is full during Map phase, intermediate data is written (spilled) to disk before being shuffled to reducers.
48. What is input format in MapReduce?
Answer:
Defines how input files are split and read.
Example: TextInputFormat
, SequenceFileInputFormat
, ParquetInputFormat
49. What is counters in Hadoop?
Answer:
Built-in or custom metrics for tracking job progress (e.g., number of records, skipped lines).
50. What are shuffle and sort in MapReduce?
Answer:
- Shuffle: Transfer of mapper output to reducers.
- Sort: Sorting mapper output by key before sending to reducer.