As data volumes explode, not all data in a Hadoop ecosystem requires the same level of storage performance. While some data is accessed frequently and needs low-latency response (“hot” data), other datasets are rarely accessed and can reside on slower, cheaper storage (“cold” data).

To address this, HDFS supports tiered storage, allowing administrators to classify data based on access frequency and store it across different storage media like SSDs, HDDs, and archival volumes — optimizing both performance and cost.

In this blog, we’ll explore how to use tiered storage in HDFS, configure storage types, apply data placement policies, and implement best practices for performance tuning and lifecycle management.


What is Tiered Storage in HDFS?

Tiered storage is the ability to use different types of storage hardware in the same HDFS cluster, categorized into:

  • Hot (SSD): High-speed, low-latency storage for frequently accessed data
  • Warm (HDD): Standard magnetic disks for regular data
  • Cold (ARCHIVE/REMOTE): Slow or remote storage for rarely accessed data

HDFS lets you assign storage types to DataNode volumes and apply block placement policies to control where data lives.


Configuring Storage Types in HDFS

  1. Define storage types per DataNode in hdfs-site.xml:
<property>
<name>dfs.datanode.data.dir</name>
<value>[SSD]/data/ssd,[DISK]/data/disk,[ARCHIVE]/data/archive</value>
</property>

Example:

<value>[SSD]/mnt/ssd1,[DISK]/mnt/hdd1,[ARCHIVE]/mnt/archive1</value>

This registers multiple volumes with different performance characteristics.

  1. Enable storage policy support (Hadoop 2.6+):
hdfs storagepolicies -listPolicies

Available policies:

  • HOT: All replicas on DISK
  • WARM: One on SSD, rest on DISK
  • COLD: All on ARCHIVE
  • ALL_SSD: All replicas on SSD
  • ONE_SSD: One replica on SSD
  • LAZY_PERSIST: Uses RAM disk (for temporary data)

Assigning Storage Policies to Directories

Use hdfs storagepolicies to assign storage types to directories:

hdfs storagepolicies -setStoragePolicy -path /user/analytics/hot -policy HOT
hdfs storagepolicies -setStoragePolicy -path /user/analytics/cold -policy COLD

To verify:

hdfs storagepolicies -getStoragePolicy -path /user/analytics/hot

Files written to these directories will follow the associated placement policy.


Data Lifecycle Management with Tiered Storage

  1. Ingest into HOT tier: New, high-value data (e.g., streaming logs, transactions) are written to SSD-backed storage.

  2. Migrate to WARM/COLD tiers: Use lifecycle scripts or automated jobs to move older data:

hdfs mover -p /user/analytics/2023/ -bandwidth 50

The hdfs mover tool rebalances blocks based on the directory’s storage policy.

  1. Purge or archive cold data: Data no longer needed can be moved to S3, Glacier, or deleted.

Performance Benefits of Tiered Storage

  • Faster analytics: SSDs reduce query latency for time-sensitive data
  • Lower IOPS pressure: Separates heavy workloads from cold storage reads
  • Cost-effective scaling: Use HDDs and archive disks for long-term storage
  • Better hardware utilization: Match data with the right storage profile

Monitoring and Tuning Tiered Storage

  1. Monitor block distribution:
hdfs fsck /user/analytics/ -files -blocks -locations
  1. Check storage usage:
hdfs dfsadmin -report
  1. Track mover performance:

Use logs and cluster metrics (via Ambari, Cloudera Manager, Prometheus) to ensure rebalancing and policies are applied as expected.


Best Practices

  • Use SSD storage for:
    • Recent logs
    • Real-time dashboards
    • Hive/Presto query targets
  • Use DISK/HDD for:
    • ETL staging data
    • Machine learning feature stores
    • Mid-frequency access files
  • Use ARCHIVE for:
    • Historical snapshots
    • Compliance storage
    • Monthly or yearly reports
  • Schedule the HDFS mover to rebalance blocks during low-usage hours

  • Align storage tiers with business SLAs for performance and retention

Conclusion

Tiered storage in HDFS provides a strategic way to optimize performance, control costs, and streamline data access across different temperature layers. By combining high-speed SSDs, reliable HDDs, and inexpensive archive storage — all within a single Hadoop cluster — you can design a scalable, efficient, and future-ready data architecture.

Start implementing tiered storage today to ensure your big data platform is optimized, responsive, and sustainable for the long haul.