What is Hadoop?

Hadoop is an open-source, distributed computing framework that allows for the processing and storage of large datasets across a cluster of computers. It was created by Doug Cutting and Mike Cafarella and is now maintained by the Apache Software Foundation.

History of Hadoop

Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers, which were published in 2003 and 2004, respectively. The first version of Hadoop, version 0.1.0, was released in April 2006. Since then, Hadoop has become one of the most popular big data processing frameworks in the world.

Benefits of Hadoop

  1. Scalability: Hadoop can handle large amounts of data by distributing the processing across a cluster of nodes.
  2. Flexibility: Hadoop can process a wide variety of data formats, including structured, semi-structured, and unstructured data.
  3. Cost-effective: Hadoop is open-source and can run on commodity hardware, making it a cost-effective solution for big data processing.
  4. Fault-tolerant: Hadoop can detect and recover from node failures, ensuring that data processing continues uninterrupted.

How Hadoop Works

Hadoop consists of two main components:

  1. Hadoop Distributed File System (HDFS): HDFS is a distributed storage system that stores data across a cluster of nodes. It is designed to handle large files and provides high throughput access to data.
  2. MapReduce: MapReduce is a programming model and software framework that allows developers to write applications that process large datasets in parallel across a cluster of nodes.

Here’s a high-level overview of how Hadoop works:

  1. Data Ingestion: Data is ingested into HDFS from various sources, such as log files, social media, or sensors.
  2. Data Processing: MapReduce programs are written to process the data in HDFS. These programs consist of two main components: mappers and reducers.
  3. Mapper: The mapper takes input data, processes it, and produces output data in the form of key-value pairs.
  4. Reducer: The reducer takes the output from the mapper, aggregates it, and produces the final output.
  5. Output: The final output is stored in HDFS or other storage systems.

In summary, Hadoop is a powerful big data processing framework that provides a scalable, flexible, and cost-effective solution for processing large datasets. Its ecosystem of tools and technologies provides additional functionality and support for Hadoop, making it a popular choice for big data processing and analytics.

Key Components of Hadoop

  1. Hadoop Distributed File System (HDFS)
    • A distributed file system designed to run on commodity hardware.
    • Highly fault-tolerant and designed to be deployed on low-cost hardware.
    • Provides high throughput access to application data and is suitable for applications that have large datasets.
  2. MapReduce
    • A programming model for processing large datasets with a parallel, distributed algorithm on a cluster.
    • Composed of two main functions:
      • Map: Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
      • Reduce: Takes the output from a map as input and combines those data tuples into a smaller set of tuples.
  3. Hadoop Common
    • The common utilities that support the other Hadoop modules.
    • Includes libraries and utilities needed by other Hadoop modules.
  4. YARN (Yet Another Resource Negotiator)
    • A resource-management platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.
    • Decouples resource management and job scheduling/monitoring functions into separate daemons.

Hadoop Ecosystem Tools and technologies

  1. Hive
    • A data warehouse infrastructure built on top of Hadoop.
    • Provides data summarization, query, and analysis.
    • Uses HiveQL, a SQL-like language.
  2. Pig
    • A high-level platform for creating MapReduce programs used with Hadoop.
    • Uses a language called Pig Latin, which abstracts the programming from the Java MapReduce idiom.
  3. HBase
    • A distributed, scalable, big data store.
    • Runs on top of HDFS.
    • Provides random, real-time read/write access to big data.
  4. Sqoop
    • A tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
  5. Flume
    • A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
  6. Oozie
    • A workflow scheduler system to manage Hadoop jobs.
  7. Zookeeper
    • A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  8. Spark
    • A fast and general-purpose cluster-computing system.
    • Provides in-memory processing to speed up big data analysis.

Hadoop Architecture

Hadoop follows a Master-Slave architecture for both data storage and data processing.

  1. HDFS Architecture:
    • NameNode (Master):
      • Manages the file system namespace and regulates access to files by clients.
      • Maintains the file system tree and metadata for all the files and directories in the tree.
    • DataNode (Slave):
      • Responsible for storing the actual data in HDFS.
      • Serves read and write requests from the file system’s clients.
  2. MapReduce Architecture:
    • JobTracker (Master):
      • Manages resources and job scheduling.
    • TaskTracker (Slave):
      • Executes the tasks as directed by the JobTracker.
  3. YARN Architecture:
    • ResourceManager (Master):
      • Manages resources and scheduling.
    • NodeManager (Slave):
      • Manages resources on a single node.

Hadoop Workflow

  1. Data Ingestion:
    • Data can be ingested into HDFS using tools like Flume, Sqoop, or by directly placing data into HDFS.
  2. Data Storage:
    • Data is split into blocks and distributed across the cluster in HDFS.
    • Replication ensures data reliability and fault tolerance.
  3. Data Processing:
    • Using MapReduce or other processing engines like Spark, data is processed in parallel across the cluster.
    • Intermediate data is stored in local disks and final output is stored back in HDFS.
  4. Data Analysis:
    • Tools like Hive, Pig, or custom MapReduce jobs are used to query and analyze data.
    • Results can be stored back in HDFS or exported to other systems.

Getting Started with Hadoop

  1. Installation:
    • Hadoop can be installed in different modes:
      • Standalone Mode: Default mode, mainly for debugging and development.
      • Pseudo-Distributed Mode: All Hadoop services run on a single node.
      • Fully Distributed Mode: Hadoop runs on multiple nodes, suitable for production.
  2. Configuration:
    • Key configuration files include core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml.
    • These files need to be correctly configured for Hadoop to run efficiently.
  3. Running Hadoop Jobs:
    • Hadoop jobs can be written in Java, Python (with Hadoop streaming), or other languages.
    • Jobs are submitted using the Hadoop command line or APIs.

Here’s a structured set of concepts, tutorials, and resources to help you master Hadoop with PySpark and HDFS for Data Engineering:


🧱 Core Concepts to Master

📁 HDFS (Hadoop Distributed File System)

  • Block storage: default size (128MB+), replicated (default 3x)
  • NameNode vs DataNode
  • Commands: hdfs dfs -ls, -put, -get, -rm, -du, etc.
  • Optimized for large files, not millions of small files
  • Permissions and file structure (superuser, ACLs)

⚙️ Hadoop + YARN

  • Hadoop Ecosystem: HDFS, YARN, MapReduce, Hive, HBase, etc.
  • YARN: resource management and job scheduling
  • Job life cycle: Submission → Resource allocation → Execution → Completion
  • Running PySpark on YARN: spark-submit --master yarn

🔥 PySpark with HDFS

  • Reading/writing data from/to HDFS: df = spark.read.csv("hdfs:///path/file.csv") df.write.parquet("hdfs:///output/path")
  • File formats: Parquet, ORC, Avro (columnar and compressed)
  • Partitioning: .write.partitionBy("col")
  • Performance tuning: memory, partition size, small file problem

📘 Hands-On Tutorials & Resources

1. Beginner-Friendly

2. Intermediate

3. Advanced / Project-Based

  • Build a mini data pipeline: HDFS → PySpark ETL → Hive/Parquet Output
  • Use Spark Structured Streaming to process real-time data into HDFS

🛠 Suggested Practice Projects

  1. Log File Analyzer
    • Ingest log files to HDFS
    • Parse and clean with PySpark
    • Store output in Parquet format
  2. Data Lake Simulation
    • Raw data to Bronze (CSV) → Silver (cleaned Parquet) → Gold (aggregated Hive)
  3. Batch + Incremental Processing
    • First run: full data load from HDFS
    • Next runs: only process new/changed files (delta processing)

📌 Common Interview Topics to Review

TopicPractice
HDFS block replicationDesign scenarios
PySpark DataFrame operationsfilter, groupBy, agg, join
Partitions vs BucketsOptimize large-scale data
PySpark tuningShuffles, caching, broadcast joins
Fault toleranceHDFS & Spark retries

⏩ Want More?

Absolutely! Here’s a comprehensive guide tailored for Data Engineering interviews, focusing on Hadoop, PySpark, and HDFS:


📘 Step-by-Step Project Guide: PySpark ETL on HDFS

Project Title: Log Analytics Pipeline with PySpark and HDFS

Objective:

Build an ETL pipeline that processes server logs stored in HDFS using PySpark and writes the transformed data back to HDFS in Parquet format.

Steps:

  1. Set Up the Environment:
    • Ensure Hadoop and Spark are installed and configured.
    • Start HDFS and YARN services.
  2. Ingest Data into HDFS:
    • Place raw log files into HDFS: hdfs dfs -mkdir -p /data/logs/raw hdfs dfs -put access_log.txt /data/logs/raw/
  3. Develop PySpark Script:
    • Read Data: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("LogProcessor").getOrCreate() logs_df = spark.read.text("hdfs:///data/logs/raw/access_log.txt")
    • Parse Logs:
      Use regular expressions to extract fields like IP address, timestamp, request method, URL, and status code.
    • Transform Data:
      • Filter out irrelevant records.
      • Convert timestamp to proper datetime format.
      • Derive additional columns if needed.
    • Write Transformed Data: logs_df.write.mode("overwrite").parquet("hdfs:///data/logs/processed/")
  4. Schedule the Job:
    • Use cron or Apache Airflow to schedule the PySpark job at desired intervals.
  5. Validation:
    • Read the processed data and perform basic checks: processed_df = spark.read.parquet("hdfs:///data/logs/processed/") processed_df.show(5)

🧾 Cheat Sheet PDFs for Commands and Concepts

Enhance your preparation with these comprehensive cheat sheets:

  • PySpark RDD Basics Cheat Sheet: Covers initialization, data loading, transformations, and actions. (University of Washington Courses)
  • Hadoop HDFS Command Cheat Sheet: Provides a quick reference to essential HDFS commands. (GitHub)
  • Big Data Hadoop Cheat Sheet: Summarizes key Hadoop components and commands. (Intellipaat)(Intellipaat)

🎯 Curated List of Mock Interview Questions with Answers

Prepare for interviews with these commonly asked questions:

HDFS:

  1. What is HDFS and its key features?
    • HDFS (Hadoop Distributed File System) is designed for storing large datasets reliably and streaming data at high bandwidth to user applications. Key features include fault tolerance, high throughput, and scalability.(fdaytalk.com)
  2. Explain the roles of NameNode and DataNode.
    • NameNode manages metadata and namespace operations, while DataNodes store actual data blocks.
  3. How does HDFS ensure fault tolerance?
    • By replicating data blocks across multiple DataNodes (default replication factor is 3).

PySpark:

  1. Differentiate between map() and flatMap() in PySpark.
    • map() applies a function to each element and returns a new RDD with the results. flatMap() applies a function that returns an iterable for each element and flattens the results into a single RDD.
  2. How does caching work in PySpark?
    • Caching stores RDDs in memory to speed up iterative operations. Use rdd.cache() to cache an RDD.
  3. What are broadcast variables?
    • Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

For an extensive list of questions and answers, refer to:

  • DataCamp’s Top 24 Hadoop Interview Questions: (DataCamp)
  • Simplilearn’s Top 80 Hadoop Interview Questions: (Simplilearn.com)
  • ProjectPro’s Top 100 Hadoop Interview Questions: (ProjectPro)

Feel free to ask if you need further details on any topic or additional resources!

Here’s a practical cheat sheet for hdfs dfs commands and a list of the most commonly used HDFS commands that are essential for Data Engineers:


📂 hdfs dfs Command Syntax

hdfs dfs [command] [path]

hdfs dfs is a user-facing shell interface that lets you interact with the Hadoop Distributed File System (HDFS), similar to Unix commands but for distributed data.


✅ Most Used hdfs dfs Commands

CommandPurposeExample
-lsList contents of a directoryhdfs dfs -ls /user/data/
-mkdirCreate a directoryhdfs dfs -mkdir /user/new_folder
-putUpload local file to HDFShdfs dfs -put localfile.csv /user/data/
-copyFromLocalSame as -puthdfs dfs -copyFromLocal file.txt /data/
-getDownload HDFS file to localhdfs dfs -get /user/data/file.csv ./
-copyToLocalSame as -gethdfs dfs -copyToLocal /data/file.txt ./
-catView contents of an HDFS filehdfs dfs -cat /data/file.txt
-rmDelete a filehdfs dfs -rm /data/oldfile.txt
-rm -rDelete a directory recursivelyhdfs dfs -rm -r /data/archive/
-du -hShow space used by files/directorieshdfs dfs -du -h /data/
-df -hShow HDFS disk usage summaryhdfs dfs -df -h
-statShow metadata info of filehdfs dfs -stat /data/file.txt
-tailShow last kilobytes of file (for logs)hdfs dfs -tail /logs/server.log
-moveFromLocalMove local file to HDFS (removes local copy)hdfs dfs -moveFromLocal file.txt /data/

📘 Notes and Tips

  • All HDFS paths are absolute and typically start from /user/username/.
  • HDFS is case-sensitive.
  • Use wildcards like * for pattern matching: hdfs dfs -ls /data/*.csv
  • Permissions can be managed with: hdfs dfs -chmod 755 /data hdfs dfs -chown user:group /data

🛠 Advanced/Useful Utilities

CommandDescription
hdfs dfsadmin -reportCheck cluster-wide usage and health
hdfs dfsadmin -safemode getCheck if NameNode is in safemode
hadoop fsck /File system check and health status

📥 Want a Downloadable Cheat Sheet (PDF)?

I can generate a compact PDF of these commands. Would you like that?

Interactive Hadoop & HDFS Tutorial for Data Engineers


🧠 1. What is Hadoop?

Hadoop is an open-source framework by Apache that enables distributed storage and processing of large data sets using a cluster of commodity hardware.

✨ Core Components:

  1. HDFS (Hadoop Distributed File System)
  2. YARN (Yet Another Resource Negotiator)
  3. MapReduce (Processing Engine)
  4. Hadoop Common (Utilities and libraries)

📁 2. HDFS (Storage Layer)

⚙️ Architecture:

  • NameNode: Master that manages metadata (file names, permissions, block locations).
  • DataNode: Slave nodes that store actual data blocks.
  • Secondary NameNode: Periodically merges FSImage and EditLogs to reduce NameNode load.

🔍 Key Concepts:

  • Block size: Default 128MB, large files split into blocks
  • Replication: Default 3 copies for fault tolerance
  • Write-once-read-many: HDFS is not meant for frequent updates

✅ Real-World Uses:

  • Store raw logs
  • Ingest large CSV/JSON datasets
  • Intermediate storage for Spark/Hive

🚀 3. YARN (Resource Management)

Components:

  • ResourceManager (RM): Allocates resources to applications
  • NodeManager (NM): Monitors resources and containers on each node
  • ApplicationMaster (AM): Manages execution of a single job

Example Workflow:

  1. User submits job to RM
  2. AM is launched on a node
  3. Containers are allocated
  4. Job runs on DataNodes in parallel

⚒️ 4. MapReduce (Processing Engine)

Phases:

  • Map: Processes input and emits intermediate key-value pairs
  • Shuffle & Sort: Groups data by key
  • Reduce: Aggregates the grouped data

Use Cases:

  • Word count
  • Data filtering and summarization

🔧 5. Hadoop Commands Cheat Sheet

File Management (HDFS):

hdfs dfs -ls /path          # List files
hdfs dfs -mkdir /dir        # Create directory
hdfs dfs -put local /hdfs   # Upload file
hdfs dfs -get /hdfs local   # Download file
hdfs dfs -rm -r /dir        # Delete directory recursively

Administration:

hdfs dfsadmin -report         # Cluster report
hdfs dfsadmin -safemode get   # Check safemode status

🤔 6. Common Interview Questions

HDFS:

  • What is the default block size in HDFS?
  • Explain the role of NameNode and DataNode.
  • How does HDFS handle hardware failure?
  • Can HDFS store small files efficiently? (No, causes metadata bloat)

Hadoop Architecture:

  • How does Hadoop achieve fault tolerance?
  • What happens when a DataNode fails?
  • Explain the write process in HDFS.
  • Describe the read process in HDFS.

YARN:

  • Difference between ResourceManager and NodeManager?
  • What is the role of ApplicationMaster?

MapReduce:

  • What happens in the shuffle and sort phase?
  • How would you optimize a MapReduce job?

📊 7. Interactive Learning Activities

Exercise 1:

Run the following command to upload a file and view it:

hdfs dfs -put sample.csv /data/input
hdfs dfs -cat /data/input/sample.csv

Exercise 2:

Simulate a fault-tolerant read:

  • Kill a DataNode process
  • Try reading the same file (HDFS should read from a replica)

Exercise 3:

Run a MapReduce job using a word count example (or PySpark equivalent):

rdd = sc.textFile("hdfs:///data/input/sample.txt")
rdd.flatMap(lambda x: x.split(" ")).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b).collect()

🔹 Want More?

Great — here’s a breakdown of how Hadoop, HDFS, Hive, and PySpark work together under the hood, including a project to tie it all together:


🧠 UNDERSTANDING THE INTERNALS

🗃️ 1. PySpark + HDFS Internals

When you read/write files in HDFS using PySpark:

  • Spark reads HDFS files using Hadoop’s InputFormats (e.g., TextInputFormat, ParquetInputFormat).
  • Data locality is optimized: Spark tries to schedule tasks on nodes where data blocks reside.
  • Spark bypasses MapReduce and uses its own DAG engine, making it much faster than Hive-on-MapReduce.

🐝 2. Hive Internals: How Queries Work

Hive is a SQL abstraction over data in HDFS.

Execution Engine Options:

  • MapReduce (Legacy): Older versions use MapReduce jobs for query execution.
  • Tez: DAG-based engine, faster than MapReduce.
  • Spark: Hive on Spark enables Hive to use Spark’s execution engine.
  • LLAP (Low Latency Analytical Processing): Interactive queries with in-memory data caching and persistent daemons.

Typical Query Flow in Hive:

  1. SQL ➜ Parsed into AST
  2. Optimized by Hive’s query planner (e.g., predicate pushdown, join reordering)
  3. Execution plan is generated (MapReduce, Tez, or Spark job)
  4. Reads from HDFS using SerDe (serializer/deserializer)
  5. Results are written back (e.g., HDFS or stdout)

⚙️ HOW THEY INTERACT

ActionComponent
PySpark reads HDFS fileUses Hadoop InputFormat
Hive reads same fileUses SerDe + execution engine (MapReduce/Tez/Spark)
PySpark writes ParquetHive can query if table is defined on top
Hive queries tableCan trigger MapReduce, Tez, or Spark jobs

Hive doesn’t automatically use PySpark. If you want Spark execution, use Hive on Spark, not PySpark. For programmatic querying with Spark, use Spark SQL instead.


🧪 MINI PROJECT: Unified HDFS-Hive-PySpark Example

Objective: Ingest data into HDFS using PySpark, create Hive table on top, and query it.

STEP 1: Upload data via PySpark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DataIngestion") \
    .enableHiveSupport() \
    .getOrCreate()

df = spark.read.csv("localfile.csv", header=True, inferSchema=True)
df.write.mode("overwrite").parquet("hdfs:///warehouse/sales_data/")

STEP 2: Create external Hive table

CREATE EXTERNAL TABLE sales (
    id INT,
    product STRING,
    revenue DOUBLE
)
STORED AS PARQUET
LOCATION 'hdfs:///warehouse/sales_data/';

STEP 3: Query in Hive or Spark SQL

SELECT product, SUM(revenue) FROM sales GROUP BY product;

OR in PySpark:

df = spark.sql("SELECT product, SUM(revenue) FROM sales GROUP BY product")
df.show()

🔐 SECURITY, TUNING & ADMIN INSIGHT

🔒 Hive/HDFS Security

  • Kerberos: Authentication mechanism for Hadoop.
  • Ranger or Sentry: Fine-grained access control for Hive/HDFS.
  • ACLs and Permissions: Set on HDFS files via hdfs dfs -setfacl.

⚙️ Performance & Tuning

  • HDFS block size: Larger block sizes improve throughput (default: 128MB or 256MB).
  • Hive tuning:
    • hive.exec.reducers.max – limits number of reducers
    • hive.vectorized.execution.enabled – enables faster in-memory processing
  • Spark tuning:
    • Partition tuning (repartition(), coalesce())
    • Memory settings (spark.executor.memory, spark.sql.shuffle.partitions)

✅ NEXT STEPS

📊 Real-World Data Engineering Project: PySpark + Hive + HDFS


📅 Project Title: E-Commerce Analytics Pipeline

Goal: Analyze real-world e-commerce sales data using HDFS for storage, PySpark for processing, and Hive for querying.


🔧 Step 1: Setup

Prerequisites:

  • Hadoop + HDFS
  • Hive (Metastore configured)
  • PySpark with Hive support enabled
  • Real-world dataset: E-Commerce Sales Data

📂 Step 2: Data Ingestion

Upload raw CSV to HDFS

hdfs dfs -mkdir -p /user/project/ecomm/raw
hdfs dfs -put data.csv /user/project/ecomm/raw/

Read with PySpark and write as Parquet

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Ecomm Analytics") \
    .enableHiveSupport() \
    .getOrCreate()

df = spark.read.csv("hdfs:///user/project/ecomm/raw/data.csv", header=True, inferSchema=True)
df.write.mode("overwrite").parquet("hdfs:///user/project/ecomm/processed/")

📊 Step 3: Hive Table Creation

CREATE EXTERNAL TABLE ecomm_sales (
    InvoiceNo STRING,
    StockCode STRING,
    Description STRING,
    Quantity INT,
    InvoiceDate STRING,
    UnitPrice DOUBLE,
    CustomerID STRING,
    Country STRING
)
STORED AS PARQUET
LOCATION 'hdfs:///user/project/ecomm/processed/';

🔎 Step 4: Querying with Hive & PySpark

Example Hive Query:

SELECT Country, SUM(Quantity * UnitPrice) AS Revenue
FROM ecomm_sales
GROUP BY Country
ORDER BY Revenue DESC;

Same with PySpark:

df = spark.sql("""
SELECT Country, SUM(Quantity * UnitPrice) AS Revenue
FROM ecomm_sales
GROUP BY Country
ORDER BY Revenue DESC
""")
df.show()

🕵️ Hive Query Optimization Internals

Query Planning Process:

  1. Parser: Converts SQL to AST
  2. Planner/Optimizer:
    • Predicate Pushdown
    • Column Pruning
    • Join Reordering
    • Aggregation Pushdown
  3. Execution Plan: Generates physical plan for MapReduce/Tez/Spark

Optimizations:

  • hive.vectorized.execution.enabled=true — enables batch processing
  • hive.cbo.enable=true — cost-based optimization
  • hive.exec.dynamic.partition=true — for partitioned tables

⚡ Engine Comparisons: MapReduce vs Tez vs Spark vs LLAP

FeatureMapReduceTezSparkLLAP
Engine TypeBatchDAGDAG/In-MemoryIn-Memory Daemons
Startup LatencyHighMediumMediumLow
Interactive QueriesNoNoSomewhatYes
CachingNoNoYesYes (Column cache)
Use CaseETLBatch SQLMachine LearningDashboards, BI

LLAP = Low Latency Analytical Processing

  • Always-on daemon processes
  • In-memory caching
  • Reduced query startup time

💼 Summary

ComponentPurpose
HDFSDistributed storage
PySparkData processing + Hive querying
HiveMetadata + SQL access to HDFS data
Tez/SparkQuery execution engine for Hive
LLAPInteractive querying with Hive

📚 Next Steps

  • Add partitioning by Country or InvoiceDate
  • Benchmark Hive-on-Tez vs Hive-on-Spark
  • Secure data using HDFS ACLs and Ranger
  • Visualize results in Superset or Power BI

Pages: 1 2

Posted in