Modern data platforms demand scalable, real-time, and cost-effective solutions for ingesting and managing large datasets. Apache Hudi and AWS Glue form a powerful combination to build cloud-native data lakes that support incremental processing, time-travel, and low-latency queries on Amazon S3.

This blog explains how to use Hudi with AWS Glue, covering architecture, configuration, data ingestion pipelines, and integration with the AWS Glue Data Catalog to power cloud-based lakehouse solutions.


Why Use Hudi with AWS Glue?

Apache Hudi adds transactional capabilities to S3 by enabling:

  • Upserts and deletes on immutable storage
  • Data versioning and time-travel queries
  • Efficient incremental data ingestion
  • Built-in compaction and clustering

AWS Glue provides:

  • Serverless Spark-based ETL jobs
  • Native integration with the AWS Glue Data Catalog
  • Orchestration using Workflows and Triggers
  • Access to Amazon S3, Athena, and Redshift Spectrum

Together, they enable real-time, governed, and scalable data lake pipelines.


Architecture Overview

The Hudi-on-AWS-Glue architecture consists of:

  • Amazon S3 as the data lake storage layer
  • AWS Glue Jobs running Apache Spark with Hudi libraries
  • Glue Data Catalog as the Hive Metastore
  • Athena, EMR, or Redshift Spectrum for querying Hudi tables
[Data Ingestion] → [Glue ETL Job w/ Hudi] → [Hudi Table on S3]  
↘  
[Glue Data Catalog]  
↘  
[Athena / EMR / Redshift]  

Setting Up Hudi on AWS Glue

  1. Use Glue 3.0 or later, which supports Apache Spark 3.x and dynamic frame conversion.

  2. Add Hudi dependencies in Glue Job:

Under “Job Parameters” (Glue Studio or Console):

--additional-python-modules  
org.apache.hudi:hudi-spark3-bundle_2.12:0.14.0  

Alternatively, attach a custom Hudi Spark bundle JAR from an S3 path.

  1. Configure IAM roles to access:
    • S3 buckets
    • Glue Catalog
    • CloudWatch for logging

Writing Hudi Tables with AWS Glue

Example Glue PySpark job to write a Copy-on-Write Hudi table:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("hudi-glue-etl") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.getOrCreate()

input_df = spark.read.json("s3://input-bucket/events/")

hudi_options = {
'hoodie.table.name': 'event_data',
'hoodie.datasource.write.recordkey.field': 'event_id',
'hoodie.datasource.write.partitionpath.field': 'event_type',
'hoodie.datasource.write.table.name': 'event_data',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'event_ts',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database': 'default',
'hoodie.datasource.hive_sync.table': 'event_data',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'glue',
'hoodie.datasource.write.hive_style_partitioning': 'true'
}

input_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save("s3://output-bucket/hudi/event_data")

Querying Hudi Tables with Athena

After a successful Glue sync, your Hudi tables are queryable via Amazon Athena:

SELECT * FROM default.event_data  
WHERE event_type = 'login'  
ORDER BY event_ts DESC  
LIMIT 10;

Athena supports reading both Copy-on-Write (COW) and Merge-on-Read (MOR) Hudi tables using the Hudi Athena connector.


Best Practices for Hudi + AWS Glue

  1. Choose Table Type Wisely
    • Use COW for fast queries and frequent reads
    • Use MOR for heavy write workloads with compaction
  2. Partition Effectively
    • Use low-cardinality columns (e.g., date, region)
    • Avoid over-partitioning (e.g., user_id)
  3. Compaction Strategy
    • Use inline compaction for Glue jobs with hoodie.compact.inline=true
    • Use asynchronous compaction via AWS Glue Workflows
  4. Monitor Metadata Table
    • Enable Hudi metadata table (hoodie.metadata.enable=true)
    • Tune compaction intervals to avoid metadata bloat
  5. Enable AWS Glue Version 3.0+
    • Improves compatibility with Spark 3 and Hudi >= 0.10.0

Cost Optimization Tips

  • Use bucketed partitioning to reduce S3 API costs during query planning
  • Compact small files to reduce storage costs
  • Use Athena query result caching
  • Set S3 lifecycle rules for archived Hudi versions

Conclusion

Using Hudi with AWS Glue empowers organizations to build real-time, cloud-native data lakes that are both scalable and cost-effective. With native Glue Catalog support, S3 integration, and upsert capabilities, Hudi unlocks powerful lakehouse features without sacrificing flexibility or performance.

Adopting best practices for partitioning, compaction, and job configuration ensures your pipelines remain fast, efficient, and production-ready in a serverless AWS environment.