Building Hybrid Storage Solutions with HDFS and S3
Combine the strengths of HDFS and Amazon S3 to create scalable and cost-efficient data lake architectures
Modern data architectures increasingly span both on-premises and cloud environments. Enterprises want to retain control and performance of on-prem HDFS, while leveraging the scalability and cost benefits of Amazon S3. A hybrid storage solution combining HDFS and S3 allows organizations to build flexible, elastic, and efficient data lakes.
In this blog, we’ll explore how to architect and implement a hybrid HDFS-S3 storage system, examine integration techniques, and highlight best practices for achieving optimal performance and cost-effectiveness.
Why Combine HDFS and S3?
Each storage system has unique strengths:
Storage System | Benefits |
---|---|
HDFS | High throughput, low-latency I/O, data locality, deep Hadoop integration |
Amazon S3 | Virtually unlimited storage, low cost, pay-per-use, high availability |
Hybrid architecture offers:
- Separation of compute and storage
- Elastic scaling with S3 for cold or historical data
- High-performance processing with HDFS for hot data
- Cost optimization by tiering data appropriately
Common Hybrid Storage Architecture
+-----------------------+
| Data Ingestion |
| (Kafka, NiFi, Flume) |
+----------+------------+
|
v
+------------+-------------+
| Hadoop YARN / Spark |
+------------+-------------+
|
+----------------+----------------+
| |
+-----v-----+ +-------v-------+
| HDFS | <-- Hot Data | S3 | <-- Cold / Archive Data
| (on-prem) | | (cloud) |
+-----------+ +---------------+
Use Cases for Hybrid HDFS-S3 Storage
- Tiered Storage:
- Hot, frequently accessed data in HDFS
- Cold, rarely accessed or historical data in S3
- ETL Pipelines:
- Ingest raw data into S3
- Process it in Spark using HDFS for temp storage
- Output results to S3 for downstream BI tools
- Disaster Recovery:
- Periodically back up HDFS to S3
- Enable cross-region restore
- Data Archival:
- Move aged Hive partitions from HDFS to S3
- Retain HDFS for recent partitions
Configuring HDFS + S3 Access in Hadoop/Spark
Use Hadoop’s S3A filesystem connector to access Amazon S3 directly.
Example configuration in core-site.xml
:
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_AWS_SECRET_KEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
Read/write from Spark:
val df = spark.read.parquet("s3a://bucket-name/landing/raw_data")
df.write.mode("overwrite").save("hdfs:///data/processed/")
Integrating Hive with S3
You can define Hive external tables on S3 paths:
CREATE EXTERNAL TABLE sales (
id STRING,
amount DOUBLE,
timestamp STRING
)
STORED AS PARQUET
LOCATION 's3a://bucket-name/hive/sales/';
This allows querying archived or cold data in S3 without moving it into HDFS.
Data Movement Between HDFS and S3
Use DistCp for large-scale, parallelized copy jobs:
Copy from HDFS to S3:
hadoop distcp hdfs:///data/logs s3a://backup-bucket/logs
Copy from S3 to HDFS:
hadoop distcp s3a://landing-bucket/events hdfs:///staging/events
Schedule with Airflow or Oozie for automated archival workflows.
Performance Considerations
- S3 has higher latency than HDFS — avoid using it for shuffle-intensive Spark jobs
- Optimize file formats with ORC or Parquet + Snappy compression
- Avoid writing many small files to S3 (use compaction strategies)
- Tune Hadoop settings:
<property> <name>fs.s3a.connection.maximum</name> <value>1000</value> </property>
Security and Governance
- Use IAM roles or instance profiles to manage S3 access securely
- Encrypt data at rest using S3 SSE or client-side tools
- Use Apache Ranger or AWS Lake Formation for access policies
- Log S3 access with AWS CloudTrail and HDFS access with audit logs
Best Practices
- Separate hot and cold paths in your data lake layout
- Use Hive external tables on S3 for archived data
- Automate lifecycle policies with DistCp, Airflow, or custom scripts
- Monitor storage costs and performance regularly
- Compress and partition data before storing in S3
Conclusion
A hybrid storage solution with HDFS and Amazon S3 offers the best of both worlds — high-performance, low-latency access to hot data, and elastic, low-cost storage for archival workloads.
By architecting with tiered storage, leveraging tools like Spark, Hive, and DistCp, and following best practices for integration and governance, you can build a scalable and efficient modern data platform that aligns with business growth and cloud strategy.