Exploring HDFS Write and Read Path Internals

The Hadoop Distributed File System (HDFS) is the foundational storage layer in the Hadoop ecosystem, built to store massive datasets reliably across clusters of machines. To understand its performance, fault-tolerance, and scalability, it’s essential to know how HDFS handles write and read operations under the hood.

In this blog, we’ll explore the internal mechanics of HDFS write and read paths, step by step. This knowledge is invaluable for developers, architects, and data engineers working with large-scale distributed systems.

HDFS Architecture Overview

Before diving into the data flow, let’s review the key components of HDFS:

NameNode: Manages metadata — file names, permissions, and block locations
DataNode: Stores actual file data in blocks
Client: The user application interacting with the file system
Secondary NameNode / Checkpoint Node: Assists with merging FSImage and Edit logs

Files are split into blocks (default: 128 MB or 256 MB) and distributed across DataNodes.

HDFS Write Path – Step-by-Step

Here’s how HDFS writes data from a client into the system:

Step 1: File Creation Request

The client calls create() on the HDFS API (via FileSystem.create()).
The NameNode verifies permissions and checks if the file already exists.
If valid, it returns a file handle and block allocation details.

Step 2: Block Allocation

The NameNode allocates a new block and chooses a set of DataNodes for replication (default replication factor: 3).
It returns a pipeline of DataNodes to the client: D1 → D2 → D3.

Step 3: Data Writing via Pipeline

The client sends data to the first DataNode (D1).
D1 streams the data to D2, which streams to D3 — forming a chain.
Each DataNode writes data to its local disk and maintains a checksum for validation.

Step 4: Acknowledgment Chain

After a block is fully written:
- D3 sends ACK to D2 → D2 sends ACK to D1 → D1 sends ACK to the client.
Only when all nodes in the pipeline ACK the write, the client proceeds to the next block.

Step 5: Block Finalization

Once all blocks are written, the client sends a close() command.
The NameNode marks the file as complete and persists metadata to the FsImage/EditLog.

Key Features:

Write-once semantics: files cannot be modified after closing.
Pipelined writes ensure efficiency and fault-tolerance.
Partial files (unclosed) are treated as corrupt.

HDFS Read Path – Step-by-Step

Let’s look at how a file is read from HDFS:

Step 1: File Open Request

The client calls open() on the HDFS API.
The NameNode returns the list of block IDs and corresponding DataNode locations.

Step 2: Block Selection

The client selects the closest DataNode (based on rack-awareness).
If the nearest node is unavailable, it picks another node with the block replica.

Step 3: Data Streaming

The client opens a TCP connection to the DataNode.
The block is read in chunks (default 64 KB), and checksums are verified.

Step 4: Block-by-Block Reading

After reading one block, the client proceeds to the next, repeating steps 2–3.
Parallel reads are possible for applications that support multithreading.

Step 5: Close the File

Once all blocks are read, the client closes the file.

Key Features:

Streaming reads: efficient for large-scale analytics
Automatic failover: if one DataNode fails, another replica is used
Checksums ensure integrity

Write and Read Path Diagrams (Conceptual)

Write Path:

Client → NameNode (block request)
↓
Client → D1 → D2 → D3  (Data Pipeline)
↑       ↑      ↑
ACK     ACK    ACK
↓
Client confirms write

Read Path:

Client → NameNode (metadata)
↓
Client → D1 (or D2/D3 if D1 is down)
↓
Read block data → Verify checksum → Continue

Optimizations and Fault Tolerance

Rack Awareness: Replicas are distributed across racks for resilience.
Heartbeat Mechanism: DataNodes send heartbeats to the NameNode to indicate health.
Re-replication: If a DataNode fails, blocks are automatically replicated elsewhere.
Checksum Verification: Ensures data integrity during read and write.

Best Practices

Avoid small files; HDFS is optimized for large sequential reads/writes.
Use tools like DistCp for bulk data transfer between clusters.
Monitor write failures and fsck errors to detect inconsistent states.
Enable append only if absolutely necessary (comes with overhead).
Use High Availability (HA) NameNode setup in production clusters.

Conclusion

Understanding the write and read path internals of HDFS gives you the power to build reliable, scalable, and fault-tolerant big data applications. From block replication to pipelined writes and rack-aware reads, HDFS has been optimized to meet the demanding needs of distributed storage in modern data ecosystems.

By designing applications that align with HDFS’s architecture, you ensure better performance, higher availability, and greater resilience in your data infrastructure.