In the Hadoop ecosystem, two of the most critical components — HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator) — work hand in hand to provide scalable, fault-tolerant, and distributed computing power for big data workloads.

While HDFS handles data storage, YARN is responsible for resource allocation and job scheduling across the cluster. Their coordination ensures efficient use of compute and storage resources, especially in data-intensive environments like ETL pipelines, machine learning jobs, and interactive analytics.

In this guide, we’ll dive into how HDFS and YARN interact, why their coordination is essential, and how to optimize them for better performance.


The Role of HDFS in Hadoop

HDFS is the distributed file system layer that:

  • Stores large files across multiple machines
  • Splits files into blocks (default: 128MB/256MB)
  • Replicates blocks (default: 3 copies) for fault tolerance
  • Provides high-throughput access for batch processing
  • Supports data locality awareness to reduce network overhead

The Role of YARN in Hadoop

YARN acts as the cluster resource manager, decoupling resource management from processing:

  • ResourceManager (RM): Central authority that tracks node resources and schedules jobs
  • NodeManager (NM): Runs on each worker node and reports health/status
  • ApplicationMaster (AM): Manages execution of a specific job (e.g., Spark, Hive)

YARN supports multiple execution engines and frameworks, including:

  • MapReduce
  • Apache Spark
  • Hive on Tez
  • Flink, and more

HDFS and YARN: Coordinated Execution

HDFS and YARN coordinate primarily through data locality awareness.

When YARN schedules a task, it:

  1. Queries HDFS to determine the block locations of input files
  2. Attempts to schedule the task on a node that stores the data block (data locality)
  3. Falls back to rack-local or remote execution if locality isn’t available

Why it matters:
Local reads are faster and avoid unnecessary network overhead. This is crucial for high-throughput jobs.


Types of Data Locality in YARN

YARN tries to achieve the following locality types, in order of preference:

  1. Node-local: Task runs on the same node where the data resides (best performance)
  2. Rack-local: Task runs on a node in the same rack as the data (acceptable)
  3. Off-rack (remote): Task runs on a node in a different rack (least optimal)

HDFS informs YARN about block locations using the NameNode, enabling intelligent scheduling.


Tuning YARN for Better Coordination with HDFS

To optimize coordination between HDFS and YARN:

  1. Increase locality wait time
<property>
<name>yarn.scheduler.locality-delay</name>
<value>40</value> <!-- Wait time in scheduling cycles -->
</property>

This gives YARN more time to find a node-local resource.

  1. Configure container sizes properly
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
  1. Enable speculative execution carefully

Speculative execution can help with slow tasks but may reduce locality.

<property>
<name>mapreduce.map.speculative</name>
<value>false</value>
</property>
  1. Spread data evenly across DataNodes

Use the HDFS balancer:

hdfs balancer -threshold 10

This ensures YARN has enough nodes with the data blocks to schedule local tasks.


Example: Hive Query Execution with YARN and HDFS

When executing a Hive query:

  • The query is compiled into Tez/Spark DAG
  • HDFS provides file splits and block locations
  • YARN schedules Tez/Spark containers on DataNodes where blocks reside
  • Each task reads data from HDFS and processes locally

This workflow demonstrates the seamless coordination between HDFS storage and YARN compute.


Monitoring and Debugging

Monitor coordination using:

  • YARN ResourceManager UI: Track job locality percentages
  • HDFS NameNode UI: Monitor block distribution
  • Job History Server: Understand task runtimes and failure points
  • Logs: Use yarn logs -applicationId to inspect job/container issues

Best Practices

  • Keep replication factor = 3 for better scheduling flexibility
  • Avoid storing too many small files — batch into large files
  • Use balanced data distribution to maximize node-local tasks
  • Monitor data locality stats in job counters and adjust scheduler delay if needed
  • Use long-running containers (LLAP) for interactive queries with reduced overhead

Conclusion

Efficient resource coordination between HDFS and YARN is foundational to a high-performing Hadoop cluster. By understanding how YARN schedules tasks based on HDFS block locations and fine-tuning both systems, you can achieve:

  • Better data locality
  • Faster job execution
  • Improved cluster utilization

Together, HDFS and YARN create a tightly integrated storage-compute framework that continues to power some of the largest big data platforms in the world.