Optimizing HDFS Performance with Tiered Storage
Leverage tiered storage in HDFS to boost performance, reduce costs, and manage data more efficiently
As data volumes explode, not all data in a Hadoop ecosystem requires the same level of storage performance. While some data is accessed frequently and needs low-latency response (“hot” data), other datasets are rarely accessed and can reside on slower, cheaper storage (“cold” data).
To address this, HDFS supports tiered storage, allowing administrators to classify data based on access frequency and store it across different storage media like SSDs, HDDs, and archival volumes — optimizing both performance and cost.
In this blog, we’ll explore how to use tiered storage in HDFS, configure storage types, apply data placement policies, and implement best practices for performance tuning and lifecycle management.
What is Tiered Storage in HDFS?
Tiered storage is the ability to use different types of storage hardware in the same HDFS cluster, categorized into:
- Hot (SSD): High-speed, low-latency storage for frequently accessed data
- Warm (HDD): Standard magnetic disks for regular data
- Cold (ARCHIVE/REMOTE): Slow or remote storage for rarely accessed data
HDFS lets you assign storage types to DataNode volumes and apply block placement policies to control where data lives.
Configuring Storage Types in HDFS
- Define storage types per DataNode in
hdfs-site.xml
:
<property>
<name>dfs.datanode.data.dir</name>
<value>[SSD]/data/ssd,[DISK]/data/disk,[ARCHIVE]/data/archive</value>
</property>
Example:
<value>[SSD]/mnt/ssd1,[DISK]/mnt/hdd1,[ARCHIVE]/mnt/archive1</value>
This registers multiple volumes with different performance characteristics.
- Enable storage policy support (Hadoop 2.6+):
hdfs storagepolicies -listPolicies
Available policies:
HOT
: All replicas on DISKWARM
: One on SSD, rest on DISKCOLD
: All on ARCHIVEALL_SSD
: All replicas on SSDONE_SSD
: One replica on SSDLAZY_PERSIST
: Uses RAM disk (for temporary data)
Assigning Storage Policies to Directories
Use hdfs storagepolicies
to assign storage types to directories:
hdfs storagepolicies -setStoragePolicy -path /user/analytics/hot -policy HOT
hdfs storagepolicies -setStoragePolicy -path /user/analytics/cold -policy COLD
To verify:
hdfs storagepolicies -getStoragePolicy -path /user/analytics/hot
Files written to these directories will follow the associated placement policy.
Data Lifecycle Management with Tiered Storage
-
Ingest into HOT tier: New, high-value data (e.g., streaming logs, transactions) are written to SSD-backed storage.
-
Migrate to WARM/COLD tiers: Use lifecycle scripts or automated jobs to move older data:
hdfs mover -p /user/analytics/2023/ -bandwidth 50
The hdfs mover
tool rebalances blocks based on the directory’s storage policy.
- Purge or archive cold data: Data no longer needed can be moved to S3, Glacier, or deleted.
Performance Benefits of Tiered Storage
- Faster analytics: SSDs reduce query latency for time-sensitive data
- Lower IOPS pressure: Separates heavy workloads from cold storage reads
- Cost-effective scaling: Use HDDs and archive disks for long-term storage
- Better hardware utilization: Match data with the right storage profile
Monitoring and Tuning Tiered Storage
- Monitor block distribution:
hdfs fsck /user/analytics/ -files -blocks -locations
- Check storage usage:
hdfs dfsadmin -report
- Track mover performance:
Use logs and cluster metrics (via Ambari, Cloudera Manager, Prometheus) to ensure rebalancing and policies are applied as expected.
Best Practices
- Use SSD storage for:
- Recent logs
- Real-time dashboards
- Hive/Presto query targets
- Use DISK/HDD for:
- ETL staging data
- Machine learning feature stores
- Mid-frequency access files
- Use ARCHIVE for:
- Historical snapshots
- Compliance storage
- Monthly or yearly reports
-
Schedule the HDFS mover to rebalance blocks during low-usage hours
- Align storage tiers with business SLAs for performance and retention
Conclusion
Tiered storage in HDFS provides a strategic way to optimize performance, control costs, and streamline data access across different temperature layers. By combining high-speed SSDs, reliable HDDs, and inexpensive archive storage — all within a single Hadoop cluster — you can design a scalable, efficient, and future-ready data architecture.
Start implementing tiered storage today to ensure your big data platform is optimized, responsive, and sustainable for the long haul.