Using Hudi with AWS Glue for Cloud Based Data Lakes
Build scalable, versioned, and real-time cloud data lakes using Apache Hudi and AWS Glue
Modern data platforms demand scalable, real-time, and cost-effective solutions for ingesting and managing large datasets. Apache Hudi and AWS Glue form a powerful combination to build cloud-native data lakes that support incremental processing, time-travel, and low-latency queries on Amazon S3.
This blog explains how to use Hudi with AWS Glue, covering architecture, configuration, data ingestion pipelines, and integration with the AWS Glue Data Catalog to power cloud-based lakehouse solutions.
Why Use Hudi with AWS Glue?
Apache Hudi adds transactional capabilities to S3 by enabling:
- Upserts and deletes on immutable storage
- Data versioning and time-travel queries
- Efficient incremental data ingestion
- Built-in compaction and clustering
AWS Glue provides:
- Serverless Spark-based ETL jobs
- Native integration with the AWS Glue Data Catalog
- Orchestration using Workflows and Triggers
- Access to Amazon S3, Athena, and Redshift Spectrum
Together, they enable real-time, governed, and scalable data lake pipelines.
Architecture Overview
The Hudi-on-AWS-Glue architecture consists of:
- Amazon S3 as the data lake storage layer
- AWS Glue Jobs running Apache Spark with Hudi libraries
- Glue Data Catalog as the Hive Metastore
- Athena, EMR, or Redshift Spectrum for querying Hudi tables
[Data Ingestion] → [Glue ETL Job w/ Hudi] → [Hudi Table on S3]
↘
[Glue Data Catalog]
↘
[Athena / EMR / Redshift]
Setting Up Hudi on AWS Glue
-
Use Glue 3.0 or later, which supports Apache Spark 3.x and dynamic frame conversion.
-
Add Hudi dependencies in Glue Job:
Under “Job Parameters” (Glue Studio or Console):
--additional-python-modules
org.apache.hudi:hudi-spark3-bundle_2.12:0.14.0
Alternatively, attach a custom Hudi Spark bundle JAR from an S3 path.
- Configure IAM roles to access:
- S3 buckets
- Glue Catalog
- CloudWatch for logging
Writing Hudi Tables with AWS Glue
Example Glue PySpark job to write a Copy-on-Write Hudi table:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("hudi-glue-etl") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.getOrCreate()
input_df = spark.read.json("s3://input-bucket/events/")
hudi_options = {
'hoodie.table.name': 'event_data',
'hoodie.datasource.write.recordkey.field': 'event_id',
'hoodie.datasource.write.partitionpath.field': 'event_type',
'hoodie.datasource.write.table.name': 'event_data',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'event_ts',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database': 'default',
'hoodie.datasource.hive_sync.table': 'event_data',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'glue',
'hoodie.datasource.write.hive_style_partitioning': 'true'
}
input_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save("s3://output-bucket/hudi/event_data")
Querying Hudi Tables with Athena
After a successful Glue sync, your Hudi tables are queryable via Amazon Athena:
SELECT * FROM default.event_data
WHERE event_type = 'login'
ORDER BY event_ts DESC
LIMIT 10;
Athena supports reading both Copy-on-Write (COW) and Merge-on-Read (MOR) Hudi tables using the Hudi Athena connector.
Best Practices for Hudi + AWS Glue
- Choose Table Type Wisely
- Use COW for fast queries and frequent reads
- Use MOR for heavy write workloads with compaction
- Partition Effectively
- Use low-cardinality columns (e.g.,
date
,region
) - Avoid over-partitioning (e.g.,
user_id
)
- Use low-cardinality columns (e.g.,
- Compaction Strategy
- Use inline compaction for Glue jobs with
hoodie.compact.inline=true
- Use asynchronous compaction via AWS Glue Workflows
- Use inline compaction for Glue jobs with
- Monitor Metadata Table
- Enable Hudi metadata table (
hoodie.metadata.enable=true
) - Tune compaction intervals to avoid metadata bloat
- Enable Hudi metadata table (
- Enable AWS Glue Version 3.0+
- Improves compatibility with Spark 3 and Hudi >= 0.10.0
Cost Optimization Tips
- Use bucketed partitioning to reduce S3 API costs during query planning
- Compact small files to reduce storage costs
- Use Athena query result caching
- Set S3 lifecycle rules for archived Hudi versions
Conclusion
Using Hudi with AWS Glue empowers organizations to build real-time, cloud-native data lakes that are both scalable and cost-effective. With native Glue Catalog support, S3 integration, and upsert capabilities, Hudi unlocks powerful lakehouse features without sacrificing flexibility or performance.
Adopting best practices for partitioning, compaction, and job configuration ensures your pipelines remain fast, efficient, and production-ready in a serverless AWS environment.