In Hadoop HDFS, ensuring balanced storage across the cluster is crucial for performance and fault tolerance. While the hdfs balancer tool balances data across DataNodes, there’s another less-known but powerful utility called Disk Balancer, which balances data within a DataNode across its multiple disks.

Imbalanced disk usage within a DataNode can lead to:

  • Local disk bottlenecks
  • Uneven disk wear
  • Underutilized capacity
  • Reduced I/O throughput

This post explores how the HDFS Disk Balancer works, how to configure and use it, and best practices for maintaining optimal intra-node storage utilization.


What Is HDFS Disk Balancer?

The Disk Balancer is a command-line utility introduced in Hadoop 2.7 to distribute blocks evenly across all disks in a DataNode.

While the regular hdfs balancer operates at the cluster level, the Disk Balancer addresses disk skew on individual nodes — a common issue when disks are added, replaced, or fail.


When to Use Disk Balancer

Use Disk Balancer when:

  • A DataNode contains disks with uneven utilization
  • Disks are replaced or newly added to existing nodes
  • You observe local I/O bottlenecks in monitoring tools
  • Storage policies change (e.g., SSD vs HDD segregation)

Imbalanced disk usage can affect:

  • HDFS write performance (uneven block placement)
  • Read throughput (some disks overutilized)
  • Long-term disk health (uneven wear)

How Disk Balancer Works

Disk Balancer:

  1. Analyzes the usage across all volumes on a DataNode
  2. Generates a plan for redistributing blocks
  3. Moves HDFS blocks from overutilized to underutilized disks
  4. Ensures that no data is lost or corrupted during the move

It operates non-intrusively and throttles I/O usage to avoid impacting active workloads.


Step-by-Step: Using Disk Balancer

Step 1: Analyze Disk Utilization

hdfs diskbalancer -report

This command provides a utilization breakdown of each volume within the DataNode.

Step 2: Generate Disk Balancing Plan

hdfs diskbalancer -plan <DataNode_UUID> -out /tmp/plan.json
  • Replace <DataNode_UUID> with the actual UUID of the DataNode.
  • Use hdfs dfsadmin -report to find UUIDs.

Step 3: Execute the Plan

hdfs diskbalancer -execute /tmp/plan.json

The Disk Balancer begins redistributing blocks per the generated plan.

Step 4: Monitor Progress

Progress is logged to standard output and logs (e.g., /var/log/hadoop/hdfs).


Configuration Parameters

Key tunables (can be added to hdfs-site.xml):

<property>
<name>dfs.disk.balancer.max.disk.throughputInMBperSec</name>
<value>10</value>
</property>

<property>
  <name>dfs.disk.balancer.block.tolerance.percent</name>
  <value>10</value>
</property>

<property>
  <name>dfs.disk.balancer.max.disk.errors</name>
  <value>5</value>
</property>

These control:

  • I/O throttling
  • Block movement sensitivity
  • Tolerance for skipping bad disks

Safety and Performance Considerations

  • Disk Balancer is safe: data is only moved after integrity checks
  • Works online: no downtime required
  • Run during off-peak hours to minimize interference
  • Avoid running Disk Balancer concurrently with large HDFS balancer jobs

You can pause or abort execution if needed:

hdfs diskbalancer -cancel <planID>

Disk Balancer vs HDFS Balancer

Feature HDFS Balancer Disk Balancer
Scope Across DataNodes Within a single DataNode
Goal Equalize node utilization Equalize disk utilization
Trigger Skewed node usage Skewed local disk usage
Common Use Case Cluster-level optimization Local optimization after disk add/remove

Use both tools together for complete HDFS health and optimization.


Best Practices

  • Regularly monitor disk usage with hdfs dfsadmin -report
  • Run Disk Balancer after adding/removing disks
  • Automate health checks for per-disk utilization
  • Document UUIDs and hardware changes per DataNode
  • Set throughput limits to avoid job contention
  • Schedule jobs during maintenance windows if moving large data volumes

Conclusion

As HDFS clusters grow and evolve, disk skew within DataNodes can quietly degrade performance and increase hardware wear. The HDFS Disk Balancer provides a powerful yet lightweight way to keep local disk usage optimized — ensuring better I/O throughput, fault tolerance, and resource efficiency.

Make it part of your cluster tuning toolkit to maintain a high-performing and balanced Hadoop ecosystem.