Using Hudi with AWS Glue for Cloud Based Data Lakes

Modern data platforms demand scalable, real-time, and cost-effective solutions for ingesting and managing large datasets. Apache Hudi and AWS Glue form a powerful combination to build cloud-native data lakes that support incremental processing, time-travel, and low-latency queries on Amazon S3.

This blog explains how to use Hudi with AWS Glue, covering architecture, configuration, data ingestion pipelines, and integration with the AWS Glue Data Catalog to power cloud-based lakehouse solutions.

Why Use Hudi with AWS Glue?

Apache Hudi adds transactional capabilities to S3 by enabling:

Upserts and deletes on immutable storage
Data versioning and time-travel queries
Efficient incremental data ingestion
Built-in compaction and clustering

AWS Glue provides:

Serverless Spark-based ETL jobs
Native integration with the AWS Glue Data Catalog
Orchestration using Workflows and Triggers
Access to Amazon S3, Athena, and Redshift Spectrum

Together, they enable real-time, governed, and scalable data lake pipelines.

Architecture Overview

The Hudi-on-AWS-Glue architecture consists of:

Amazon S3 as the data lake storage layer
AWS Glue Jobs running Apache Spark with Hudi libraries
Glue Data Catalog as the Hive Metastore
Athena, EMR, or Redshift Spectrum for querying Hudi tables

[Data Ingestion] → [Glue ETL Job w/ Hudi] → [Hudi Table on S3]  
↘  
[Glue Data Catalog]  
↘  
[Athena / EMR / Redshift]  

Setting Up Hudi on AWS Glue

Use Glue 3.0 or later, which supports Apache Spark 3.x and dynamic frame conversion.
Add Hudi dependencies in Glue Job:

Under “Job Parameters” (Glue Studio or Console):

--additional-python-modules  
org.apache.hudi:hudi-spark3-bundle_2.12:0.14.0  

Alternatively, attach a custom Hudi Spark bundle JAR from an S3 path.

Configure IAM roles to access:
- S3 buckets
- Glue Catalog
- CloudWatch for logging

Writing Hudi Tables with AWS Glue

Example Glue PySpark job to write a Copy-on-Write Hudi table:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("hudi-glue-etl") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.getOrCreate()

input_df = spark.read.json("s3://input-bucket/events/")

hudi_options = {
'hoodie.table.name': 'event_data',
'hoodie.datasource.write.recordkey.field': 'event_id',
'hoodie.datasource.write.partitionpath.field': 'event_type',
'hoodie.datasource.write.table.name': 'event_data',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'event_ts',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database': 'default',
'hoodie.datasource.hive_sync.table': 'event_data',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'glue',
'hoodie.datasource.write.hive_style_partitioning': 'true'
}

input_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save("s3://output-bucket/hudi/event_data")

Querying Hudi Tables with Athena

After a successful Glue sync, your Hudi tables are queryable via Amazon Athena:

SELECT * FROM default.event_data  
WHERE event_type = 'login'  
ORDER BY event_ts DESC  
LIMIT 10;

Athena supports reading both Copy-on-Write (COW) and Merge-on-Read (MOR) Hudi tables using the Hudi Athena connector.

Best Practices for Hudi + AWS Glue

Choose Table Type Wisely
- Use COW for fast queries and frequent reads
- Use MOR for heavy write workloads with compaction
Partition Effectively
- Use low-cardinality columns (e.g., date, region)
- Avoid over-partitioning (e.g., user_id)
Compaction Strategy
- Use inline compaction for Glue jobs with hoodie.compact.inline=true
- Use asynchronous compaction via AWS Glue Workflows
Monitor Metadata Table
- Enable Hudi metadata table (hoodie.metadata.enable=true)
- Tune compaction intervals to avoid metadata bloat
Enable AWS Glue Version 3.0+
- Improves compatibility with Spark 3 and Hudi >= 0.10.0

Cost Optimization Tips

Use bucketed partitioning to reduce S3 API costs during query planning
Compact small files to reduce storage costs
Use Athena query result caching
Set S3 lifecycle rules for archived Hudi versions

Conclusion

Using Hudi with AWS Glue empowers organizations to build real-time, cloud-native data lakes that are both scalable and cost-effective. With native Glue Catalog support, S3 integration, and upsert capabilities, Hudi unlocks powerful lakehouse features without sacrificing flexibility or performance.

Adopting best practices for partitioning, compaction, and job configuration ensures your pipelines remain fast, efficient, and production-ready in a serverless AWS environment.