Modern data architectures increasingly span both on-premises and cloud environments. Enterprises want to retain control and performance of on-prem HDFS, while leveraging the scalability and cost benefits of Amazon S3. A hybrid storage solution combining HDFS and S3 allows organizations to build flexible, elastic, and efficient data lakes.

In this blog, we’ll explore how to architect and implement a hybrid HDFS-S3 storage system, examine integration techniques, and highlight best practices for achieving optimal performance and cost-effectiveness.


Why Combine HDFS and S3?

Each storage system has unique strengths:

Storage System Benefits
HDFS High throughput, low-latency I/O, data locality, deep Hadoop integration
Amazon S3 Virtually unlimited storage, low cost, pay-per-use, high availability

Hybrid architecture offers:

  • Separation of compute and storage
  • Elastic scaling with S3 for cold or historical data
  • High-performance processing with HDFS for hot data
  • Cost optimization by tiering data appropriately

Common Hybrid Storage Architecture

+-----------------------+
|     Data Ingestion    |
| (Kafka, NiFi, Flume)  |
+----------+------------+
|
v
+------------+-------------+
|    Hadoop YARN / Spark   |
+------------+-------------+
|
+----------------+----------------+
|                                 |
+-----v-----+                   +-------v-------+
|   HDFS    |   <-- Hot Data    |     S3        |  <-- Cold / Archive Data
| (on-prem) |                   | (cloud)       |
+-----------+                   +---------------+

Use Cases for Hybrid HDFS-S3 Storage

  1. Tiered Storage:
    • Hot, frequently accessed data in HDFS
    • Cold, rarely accessed or historical data in S3
  2. ETL Pipelines:
    • Ingest raw data into S3
    • Process it in Spark using HDFS for temp storage
    • Output results to S3 for downstream BI tools
  3. Disaster Recovery:
    • Periodically back up HDFS to S3
    • Enable cross-region restore
  4. Data Archival:
    • Move aged Hive partitions from HDFS to S3
    • Retain HDFS for recent partitions

Configuring HDFS + S3 Access in Hadoop/Spark

Use Hadoop’s S3A filesystem connector to access Amazon S3 directly.

Example configuration in core-site.xml:

<property>
<name>fs.s3a.access.key</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_AWS_SECRET_KEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>

Read/write from Spark:

val df = spark.read.parquet("s3a://bucket-name/landing/raw_data")
df.write.mode("overwrite").save("hdfs:///data/processed/")

Integrating Hive with S3

You can define Hive external tables on S3 paths:

CREATE EXTERNAL TABLE sales (
id STRING,
amount DOUBLE,
timestamp STRING
)
STORED AS PARQUET
LOCATION 's3a://bucket-name/hive/sales/';

This allows querying archived or cold data in S3 without moving it into HDFS.


Data Movement Between HDFS and S3

Use DistCp for large-scale, parallelized copy jobs:

Copy from HDFS to S3:

hadoop distcp hdfs:///data/logs s3a://backup-bucket/logs

Copy from S3 to HDFS:

hadoop distcp s3a://landing-bucket/events hdfs:///staging/events

Schedule with Airflow or Oozie for automated archival workflows.


Performance Considerations

  • S3 has higher latency than HDFS — avoid using it for shuffle-intensive Spark jobs
  • Optimize file formats with ORC or Parquet + Snappy compression
  • Avoid writing many small files to S3 (use compaction strategies)
  • Tune Hadoop settings:
    <property>
    <name>fs.s3a.connection.maximum</name>
    <value>1000</value>
    </property>
    

Security and Governance

  • Use IAM roles or instance profiles to manage S3 access securely
  • Encrypt data at rest using S3 SSE or client-side tools
  • Use Apache Ranger or AWS Lake Formation for access policies
  • Log S3 access with AWS CloudTrail and HDFS access with audit logs

Best Practices

  • Separate hot and cold paths in your data lake layout
  • Use Hive external tables on S3 for archived data
  • Automate lifecycle policies with DistCp, Airflow, or custom scripts
  • Monitor storage costs and performance regularly
  • Compress and partition data before storing in S3

Conclusion

A hybrid storage solution with HDFS and Amazon S3 offers the best of both worlds — high-performance, low-latency access to hot data, and elastic, low-cost storage for archival workloads.

By architecting with tiered storage, leveraging tools like Spark, Hive, and DistCp, and following best practices for integration and governance, you can build a scalable and efficient modern data platform that aligns with business growth and cloud strategy.