Here’s a detailed expansion of your Hadoop Core Concepts – Interview Q&A, now with conceptual answers, examples, commands, and added advanced questions covering HDFS CLI, Python-Hadoop integration, and Hive interaction.


Hadoop Core Interview Questions with Answers & Examples

1. What is Hadoop?

Answer: Hadoop is an open-source framework that allows for distributed storage and parallel processing of large datasets across clusters of computers using simple programming models.

Use Case: Processing 10 TB of server logs spread across 20 nodes using MapReduce.


2. What are the core components of Hadoop?

  • HDFS: Distributed storage
  • YARN: Cluster resource manager
  • MapReduce: Computation engine

3. What is HDFS?

Answer: HDFS stores large files by breaking them into blocks (default: 128 MB) and replicating them across multiple DataNodes for fault tolerance.

Example: A 512 MB file is split into 4 blocks, each stored on 3 nodes.


4. What is the default block size in HDFS? Can it be changed?

Default: 128 MB (Hadoop 2.x+), earlier it was 64 MB
Change via config:

<property>
  <name>dfs.block.size</name>
  <value>268435456</value> <!-- 256 MB -->
</property>

5. What is NameNode and DataNode?

  • NameNode: Stores metadata (filename, block locations, permissions)
  • DataNode: Stores actual data blocks

6. What is a Secondary NameNode?

Misconception: It is not a backup
Correct: It periodically checkpoints the NameNode’s in-memory metadata to prevent large fsimage+edits log files.


7. What is the role of YARN in Hadoop?

Answer: YARN (Yet Another Resource Negotiator) manages cluster resources and schedules job execution via:

  • ResourceManager
  • NodeManager

8. What is MapReduce?

Answer: A programming model with:

  • Map(): Filters/sorts data
  • Reduce(): Aggregates/interprets results

9. What are Mappers and Reducers?

  • Mapper: Converts input → intermediate key-value pairs
  • Reducer: Aggregates keys

Example: Word Count

  • Mapper: ("Hadoop", 1)
  • Reducer: ("Hadoop", 5)

10. What is a Combiner?

Mini-reducer that performs local aggregation before shuffle.
Used to reduce network traffic.


11. What is a Partitioner in Hadoop?

Determines how key-value pairs are distributed to reducers.

Default: hash(key) % numReducers


12. What is a Hadoop InputSplit?

Logical chunk of data given to a mapper.


13. Difference between InputSplit and HDFS Block?

InputSplitHDFS Block
LogicalPhysical
Used by MapReduceUsed by HDFS
Can span multiple blocksFixed-size

14. What file formats are supported in Hadoop?

  • Text
  • SequenceFile
  • Avro
  • Parquet
  • ORC

15. Difference between SequenceFile and Avro?

  • SequenceFile: Binary format for key-value pairs (Hadoop native)
  • Avro: Row-based, schema-based, good for serializing structured data

16. What is speculative execution in Hadoop?

Executes duplicate tasks for slow-running tasks to avoid stragglers.


17. What is rack awareness in Hadoop?

Hadoop stores replicas across racks to prevent data loss from rack failure.


18. How does HDFS ensure fault tolerance?

  • Replication (default 3)
  • Heartbeats
  • Data rebalancing

19. What is the role of JobTracker and TaskTracker?

(For Hadoop 1.x)

  • JobTracker: Assigns tasks
  • TaskTracker: Executes them

20. Difference between Hadoop 1.x and 2.x?

FeatureHadoop 1.xHadoop 2.x
Resource MgmtJobTrackerYARN
ScalabilityLimitedMassive
Multi-tenancyNoYes

💡 Bonus Practical Questions + Answers


21. How do you load data into HDFS?

hdfs dfs -put myfile.txt /data/

22. Command to list files in HDFS

hdfs dfs -ls /user/hadoop

23. What happens when a DataNode fails?

  • Missing blocks reported to NameNode
  • NameNode schedules replication on other healthy nodes

24. How do you tune MapReduce performance?

  • Set number of mappers/reducers properly
  • Use Combiner
  • Tune memory and I/O
  • Enable compression

25. Difference between Hadoop and Spark?

Hadoop (MapReduce)Spark
Disk-basedIn-memory
SlowerFaster
No DAGDAG-based engine
Good for batchGood for batch + streaming

🔗 Extra Questions: Python, Hive, and CLI


26. How do you connect Hadoop with Python?

a) Using PyDoop

import pydoop.hdfs as hdfs
hdfs.ls('/data/')

b) Using Snakebite (lightweight HDFS client)

snakebite ls /data/

27. How do you connect Hive with Python?

from pyhive import hive
conn = hive.Connection(host='localhost', port=10000, username='hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM my_table LIMIT 10")

28. Important HDFS CLI Commands

CommandUsage
-lsList directory
-putUpload to HDFS
-getDownload from HDFS
-rm -rDelete recursively
-du -hShow directory size
-catView file contents

📘

Great! Below is an enhanced Hadoop Interview Q&A list, now expanded to cover important technical terms like fsimage, edit logs, checkpoint, and other key Hadoop ecosystem components like Hive, Sqoop, Flume, HBase, Oozie, and Zookeeper.


Extended Hadoop Interview Questions — Technical Terms & Ecosystem Components

🗂️ Core Hadoop Internals (Technical Terms)


29. What is fsimage in Hadoop?

Answer:
fsimage is a snapshot of the Hadoop filesystem metadata stored on the NameNode. It contains the entire directory structure and file-to-block mapping at a specific point in time.


30. What is edit log in Hadoop?

Answer:
The edit log records every change made to the HDFS metadata since the last fsimage was saved.


31. What is a checkpoint in Hadoop?

Answer:
The process of merging the fsimage and edit log to create a new, updated fsimage. Done by the Secondary NameNode to reduce NameNode startup time.


32. What happens when NameNode restarts?

Answer:

  • Loads fsimage
  • Applies edit logs to bring metadata to current state
  • Rebuilds the namespace in memory

33. What is Safe Mode in Hadoop?

Answer:
A read-only mode during NameNode startup. It waits for block reports from DataNodes before exiting safe mode and allowing writes.


34. What is a heartbeat in Hadoop?

Answer:
A signal sent by DataNodes every 3 seconds to inform the NameNode they’re alive. If no heartbeat is received in 10 minutes, the DataNode is considered dead.


35. What is data locality in Hadoop?

Answer:
Moving computation to the data rather than data to computation. This reduces network IO and improves job performance.


🌐 Hadoop Ecosystem Components — Key Interview Questions


36. What is Hive? How does it work with Hadoop?

Answer:
Hive is a SQL-like engine on Hadoop. It converts HiveQL into MapReduce, Tez, or Spark jobs.

Example: SELECT COUNT(*) FROM sales; becomes a MapReduce job behind the scenes.


37. What is the difference between Hive and Pig?

FeatureHivePig
LanguageSQL-like (HiveQL)Script-based (Pig Latin)
Use caseReporting/BIData transformation
Learning curveEasy for SQL usersEasy for programmers

38. What is Sqoop?

Answer:
A tool for transferring data between Hadoop and RDBMS.

sqoop import --connect jdbc:mysql://dbhost/sales --table orders --target-dir /hdfs/orders

39. What is Flume?

Answer:
A distributed service to collect, aggregate, and move large volumes of log data into HDFS or Hive.


40. What is HBase?

Answer:
A NoSQL columnar database that runs on HDFS. Ideal for real-time random read/write operations.


41. What is the difference between HDFS and HBase?

FeatureHDFSHBase
TypeFile systemDatabase
AccessBatchReal-time
StructureFlat filesKey-Column-Value

42. What is Oozie?

Answer:
A workflow scheduler for Hadoop. Helps manage dependencies between jobs like Hive → MapReduce → Pig.


43. What is Zookeeper?

Answer:
A coordination service used in Hadoop ecosystem (like HBase, Kafka) for leader election, configuration, and distributed locking.


44. What is Parquet and ORC file format?

  • Parquet: Columnar storage format, supports nested data (best with Spark).
  • ORC: Optimized Row Columnar format (best with Hive), better compression and read performance.

45. What is Hadoop Archive (HAR)?

Answer:
A method to compress many small files into a single large file to overcome HDFS small files problem.


46. What is small file problem in HDFS?

Answer:
Too many small files overwhelm the NameNode’s memory as it stores metadata for each file.

Solution: HAR, SequenceFile, or CombineFileInputFormat.


47. What is a spill in MapReduce?

Answer:
When the in-memory buffer is full during Map phase, intermediate data is written (spilled) to disk before being shuffled to reducers.


48. What is input format in MapReduce?

Answer:
Defines how input files are split and read.
Example: TextInputFormat, SequenceFileInputFormat, ParquetInputFormat


49. What is counters in Hadoop?

Answer:
Built-in or custom metrics for tracking job progress (e.g., number of records, skipped lines).


50. What are shuffle and sort in MapReduce?

Answer:

  • Shuffle: Transfer of mapper output to reducers.
  • Sort: Sorting mapper output by key before sending to reducer.

Pages: 1 2 3


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading