Building Hybrid Storage Solutions with HDFS and S3

Modern data architectures increasingly span both on-premises and cloud environments. Enterprises want to retain control and performance of on-prem HDFS, while leveraging the scalability and cost benefits of Amazon S3. A hybrid storage solution combining HDFS and S3 allows organizations to build flexible, elastic, and efficient data lakes.

In this blog, we’ll explore how to architect and implement a hybrid HDFS-S3 storage system, examine integration techniques, and highlight best practices for achieving optimal performance and cost-effectiveness.

Why Combine HDFS and S3?

Each storage system has unique strengths:

Storage System	Benefits
HDFS	High throughput, low-latency I/O, data locality, deep Hadoop integration
Amazon S3	Virtually unlimited storage, low cost, pay-per-use, high availability

Hybrid architecture offers:

Separation of compute and storage
Elastic scaling with S3 for cold or historical data
High-performance processing with HDFS for hot data
Cost optimization by tiering data appropriately

Common Hybrid Storage Architecture

+-----------------------+
|     Data Ingestion    |
| (Kafka, NiFi, Flume)  |
+----------+------------+
|
v
+------------+-------------+
|    Hadoop YARN / Spark   |
+------------+-------------+
|
+----------------+----------------+
|                                 |
+-----v-----+                   +-------v-------+
|   HDFS    |   <-- Hot Data    |     S3        |  <-- Cold / Archive Data
| (on-prem) |                   | (cloud)       |
+-----------+                   +---------------+

Use Cases for Hybrid HDFS-S3 Storage

Tiered Storage:
- Hot, frequently accessed data in HDFS
- Cold, rarely accessed or historical data in S3
ETL Pipelines:
- Ingest raw data into S3
- Process it in Spark using HDFS for temp storage
- Output results to S3 for downstream BI tools
Disaster Recovery:
- Periodically back up HDFS to S3
- Enable cross-region restore
Data Archival:
- Move aged Hive partitions from HDFS to S3
- Retain HDFS for recent partitions

Configuring HDFS + S3 Access in Hadoop/Spark

Use Hadoop’s S3A filesystem connector to access Amazon S3 directly.

Example configuration in core-site.xml:

<property>
<name>fs.s3a.access.key</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_AWS_SECRET_KEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>

Read/write from Spark:

val df = spark.read.parquet("s3a://bucket-name/landing/raw_data")
df.write.mode("overwrite").save("hdfs:///data/processed/")

Integrating Hive with S3

You can define Hive external tables on S3 paths:

CREATE EXTERNAL TABLE sales (
id STRING,
amount DOUBLE,
timestamp STRING
)
STORED AS PARQUET
LOCATION 's3a://bucket-name/hive/sales/';

This allows querying archived or cold data in S3 without moving it into HDFS.

Data Movement Between HDFS and S3

Use DistCp for large-scale, parallelized copy jobs:

Copy from HDFS to S3:

hadoop distcp hdfs:///data/logs s3a://backup-bucket/logs

Copy from S3 to HDFS:

hadoop distcp s3a://landing-bucket/events hdfs:///staging/events

Schedule with Airflow or Oozie for automated archival workflows.

Performance Considerations

S3 has higher latency than HDFS — avoid using it for shuffle-intensive Spark jobs
Optimize file formats with ORC or Parquet + Snappy compression
Avoid writing many small files to S3 (use compaction strategies)

Tune Hadoop settings:

<property>
<name>fs.s3a.connection.maximum</name>
<value>1000</value>
</property>

Security and Governance

Use IAM roles or instance profiles to manage S3 access securely
Encrypt data at rest using S3 SSE or client-side tools
Use Apache Ranger or AWS Lake Formation for access policies
Log S3 access with AWS CloudTrail and HDFS access with audit logs

Best Practices

Separate hot and cold paths in your data lake layout
Use Hive external tables on S3 for archived data
Automate lifecycle policies with DistCp, Airflow, or custom scripts
Monitor storage costs and performance regularly
Compress and partition data before storing in S3

Conclusion

A hybrid storage solution with HDFS and Amazon S3 offers the best of both worlds — high-performance, low-latency access to hot data, and elastic, low-cost storage for archival workloads.

By architecting with tiered storage, leveraging tools like Spark, Hive, and DistCp, and following best practices for integration and governance, you can build a scalable and efficient modern data platform that aligns with business growth and cloud strategy.