HDFS and YARN Effective Resource Coordination
Learn how HDFS and YARN work together to power scalable and efficient big data processing
In the Hadoop ecosystem, two of the most critical components — HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator) — work hand in hand to provide scalable, fault-tolerant, and distributed computing power for big data workloads.
While HDFS handles data storage, YARN is responsible for resource allocation and job scheduling across the cluster. Their coordination ensures efficient use of compute and storage resources, especially in data-intensive environments like ETL pipelines, machine learning jobs, and interactive analytics.
In this guide, we’ll dive into how HDFS and YARN interact, why their coordination is essential, and how to optimize them for better performance.
The Role of HDFS in Hadoop
HDFS is the distributed file system layer that:
- Stores large files across multiple machines
- Splits files into blocks (default: 128MB/256MB)
- Replicates blocks (default: 3 copies) for fault tolerance
- Provides high-throughput access for batch processing
- Supports data locality awareness to reduce network overhead
The Role of YARN in Hadoop
YARN acts as the cluster resource manager, decoupling resource management from processing:
- ResourceManager (RM): Central authority that tracks node resources and schedules jobs
- NodeManager (NM): Runs on each worker node and reports health/status
- ApplicationMaster (AM): Manages execution of a specific job (e.g., Spark, Hive)
YARN supports multiple execution engines and frameworks, including:
- MapReduce
- Apache Spark
- Hive on Tez
- Flink, and more
HDFS and YARN: Coordinated Execution
HDFS and YARN coordinate primarily through data locality awareness.
When YARN schedules a task, it:
- Queries HDFS to determine the block locations of input files
- Attempts to schedule the task on a node that stores the data block (data locality)
- Falls back to rack-local or remote execution if locality isn’t available
Why it matters:
Local reads are faster and avoid unnecessary network overhead. This is crucial for high-throughput jobs.
Types of Data Locality in YARN
YARN tries to achieve the following locality types, in order of preference:
- Node-local: Task runs on the same node where the data resides (best performance)
- Rack-local: Task runs on a node in the same rack as the data (acceptable)
- Off-rack (remote): Task runs on a node in a different rack (least optimal)
HDFS informs YARN about block locations using the NameNode, enabling intelligent scheduling.
Tuning YARN for Better Coordination with HDFS
To optimize coordination between HDFS and YARN:
- Increase locality wait time
<property>
<name>yarn.scheduler.locality-delay</name>
<value>40</value> <!-- Wait time in scheduling cycles -->
</property>
This gives YARN more time to find a node-local resource.
- Configure container sizes properly
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
- Enable speculative execution carefully
Speculative execution can help with slow tasks but may reduce locality.
<property>
<name>mapreduce.map.speculative</name>
<value>false</value>
</property>
- Spread data evenly across DataNodes
Use the HDFS balancer:
hdfs balancer -threshold 10
This ensures YARN has enough nodes with the data blocks to schedule local tasks.
Example: Hive Query Execution with YARN and HDFS
When executing a Hive query:
- The query is compiled into Tez/Spark DAG
- HDFS provides file splits and block locations
- YARN schedules Tez/Spark containers on DataNodes where blocks reside
- Each task reads data from HDFS and processes locally
This workflow demonstrates the seamless coordination between HDFS storage and YARN compute.
Monitoring and Debugging
Monitor coordination using:
- YARN ResourceManager UI: Track job locality percentages
- HDFS NameNode UI: Monitor block distribution
- Job History Server: Understand task runtimes and failure points
- Logs: Use
yarn logs -applicationId
to inspect job/container issues
Best Practices
- Keep replication factor = 3 for better scheduling flexibility
- Avoid storing too many small files — batch into large files
- Use balanced data distribution to maximize node-local tasks
- Monitor data locality stats in job counters and adjust scheduler delay if needed
- Use long-running containers (LLAP) for interactive queries with reduced overhead
Conclusion
Efficient resource coordination between HDFS and YARN is foundational to a high-performing Hadoop cluster. By understanding how YARN schedules tasks based on HDFS block locations and fine-tuning both systems, you can achieve:
- Better data locality
- Faster job execution
- Improved cluster utilization
Together, HDFS and YARN create a tightly integrated storage-compute framework that continues to power some of the largest big data platforms in the world.