In big data environments, protecting valuable datasets from accidental deletion, corruption, or system failures is paramount. Hadoop Distributed File System (HDFS) provides a powerful feature called snapshots that can play a vital role in disaster recovery and data protection.

This blog explores how to use HDFS snapshots for disaster recovery scenarios, including how to create, manage, and restore from snapshots effectively in your Hadoop ecosystem.


What Are HDFS Snapshots?

HDFS snapshots are read-only, point-in-time copies of a directory in HDFS. They allow users to:

  • Recover accidentally deleted or modified data
  • Perform backups without copying entire datasets
  • Roll back to a consistent state after a failure or corruption
  • Audit historical states for compliance and governance

Snapshots do not duplicate data. Instead, they record the state of the filesystem metadata and share unchanged blocks with the original files — making them highly space-efficient.


Enabling Snapshots on HDFS Directories

Before creating snapshots, you must enable snapshot support on a specific directory.

hdfs dfsadmin -allowSnapshot /data/warehouse

Now, you can create snapshots on this path.


Creating and Listing Snapshots

Use the following command to create a snapshot:

hdfs dfs -createSnapshot /data/warehouse snapshot_2024_11_16

List snapshots:

hdfs dfs -ls /data/warehouse/.snapshot

The .snapshot directory is hidden from normal views and contains the snapshot data.


Accessing and Restoring Data from Snapshots

To access files from a snapshot:

hdfs dfs -cat /data/warehouse/.snapshot/snapshot_2024_11_16/file.csv

To restore a deleted file:

hdfs dfs -cp /data/warehouse/.snapshot/snapshot_2024_11_16/file.csv /data/warehouse/file.csv

This makes snapshots a reliable way to recover from accidental deletions or data corruption.


Deleting Snapshots

Once a snapshot is no longer needed:

hdfs dfs -deleteSnapshot /data/warehouse snapshot_2024_11_16

Deleted snapshots release the references to old blocks, allowing space to be reclaimed.


Use Cases for HDFS Snapshots

  1. Disaster Recovery
    Quickly restore files or datasets after accidental deletion or job failure.

  2. Safe Schema Changes
    Back up directories before performing Hive schema updates or partition drops.

  3. ETL Protection
    Take snapshots before executing large ETL jobs to roll back in case of error.

  4. Backup and Archival
    Use snapshots as part of incremental backup pipelines with tools like DistCp.


Integrating Snapshots into Backup Workflows

Combine snapshots with DistCp to replicate data to a secondary cluster:

hadoop distcp -update -delete -diff /data/warehouse/.snapshot/snapshot_2024_11_15 /backup-cluster/data/warehouse

This enables incremental backup using snapshot differences — improving efficiency and reducing data movement.


Limitations of HDFS Snapshots

  • Snapshots are read-only — no edits allowed inside .snapshot
  • Snapshots are directory-level only, not file-level
  • Excessive snapshots may slow down NameNode metadata operations
  • Not a substitute for off-cluster backups

Use snapshots as first-line defense, not a replacement for full disaster recovery planning.


Best Practices

  • Enable snapshots on all critical directories (e.g., Hive warehouse, ingestion paths)
  • Name snapshots using consistent date-based naming (e.g., snapshot_YYYY_MM_DD)
  • Automate snapshot creation and cleanup with Oozie or cron
  • Limit the number of retained snapshots per directory (e.g., last 7 or 30 days)
  • Monitor snapshot space usage using HDFS metrics

Conclusion

HDFS snapshots are an essential tool in any Hadoop administrator’s disaster recovery toolkit. They offer fast, lightweight, and reliable protection against data loss scenarios and allow easy recovery from user or system errors.

By implementing regular snapshot policies and integrating them into your backup workflows, you can ensure business continuity and data integrity in your big data infrastructure.