Debugging Common Issues in Hudi Workflows
Identify and resolve the most frequent problems encountered in Apache Hudi pipelines
Apache Hudi is a powerful lakehouse framework that supports upserts, incremental processing, and time-travel analytics. But as with any complex system, users can encounter a variety of issues during ingestion, compaction, or querying—especially when dealing with large-scale datasets and streaming pipelines.
This post outlines common issues in Hudi workflows, provides practical debugging tips, and offers guidance on how to monitor and stabilize your data pipelines in production.
1. Compaction Failures in Merge-On-Read Tables
Problem: Compaction fails or times out.
Symptoms:
- Missing base files
- Hudi timeline stuck in inflight state
- Queries return incomplete data
Solutions:
- Check compaction status:
hudi-cli > compactions show all
- Force compaction cleanup:
hudi-cli > compaction schedule --table-path s3://my-table > compaction run
- Tune compaction settings:
hoodie.compact.inline = false hoodie.compact.max.delta.commits = 5
2. Write Operation Errors
Problem: Ingestion job fails with writer exceptions.
Symptoms:
- NullPointerException
- Task killed due to memory
- HudiWriteClient not found
Root Causes:
- Mismatch between Hudi version and Spark/Flink version
- Insufficient memory for write buffer
- Incorrect record key or precombine field
Solutions:
- Validate fields in write config:
hoodie.datasource.write.recordkey.field = id hoodie.datasource.write.precombine.field = ts
- Upgrade to compatible Hudi + engine versions
- Increase Spark executor memory:
--executor-memory 4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
3. Metadata Table Issues
Problem: Table operations slow or hang due to corrupted metadata.
Symptoms:
- Timeouts during file listing
- Metadata compaction backlog
Solutions:
- Enable metadata cleaning:
hoodie.metadata.clean.automatic = true hoodie.metadata.compact.async = true
- Validate metadata health:
hudi-cli > metadata stats
- In extreme cases, disable metadata temporarily:
hoodie.metadata.enable = false
4. Schema Mismatch and Evolution Failures
Problem: Job fails due to schema incompatibility.
Symptoms:
- Column not found
- SparkAnalysisException or Avro schema mismatch
Root Causes:
- Record schema doesn’t match table schema
- Missing nullable fields in incoming data
Solutions:
- Enable schema validation and auto-evolution:
hoodie.avro.schema.validate = true hoodie.avro.schema.allow.auto.evolution = true
-
Maintain versioned schema registry using tools like AWS Glue, Confluent Schema Registry, or Hive Metastore
- Compare Avro schemas using
hudi-cli
or diff tools
5. Querying Issues with Athena or Presto
Problem: Athena returns empty results or errors.
Symptoms:
- Table appears empty
- Missing partitions
- Incorrect file formats
Solutions:
- Sync Hudi table to AWS Glue or HMS:
hoodie.datasource.hive_sync.enable = true hoodie.datasource.hive_sync.mode = glue
- For Merge-On-Read tables, ensure latest base and delta files exist
- Use Copy-on-Write for Athena compatibility, or convert table type
6. Timeline Corruption or Stuck States
Problem: Ingestion jobs fail due to corrupted .hoodie
timeline.
Symptoms:
.inflight
commits remain forever- Rollbacks don’t complete
Solutions:
- Manually rollback stuck commits:
hudi-cli > rollback --commitTime 20241116120010
- Clean up partial files and rerun jobs
- Use:
hoodie.clean.automatic = true
7. Small Files Explosion
Problem: High number of small files degrades performance and increases S3 costs.
Solutions:
- Increase small file size limit:
hoodie.parquet.small.file.limit = 134217728 # 128MB
- Enable clustering or compaction:
hoodie.clustering.inline = true hoodie.clustering.plan.strategy.class = org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
8. Memory Issues in Large Ingest Jobs
Problem: Jobs get killed or OOM errors during write phase.
Tips:
- Use Kryo serializer:
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
- Increase executor memory and cores:
--executor-memory 8g --executor-cores 4
- Control commit frequency:
hoodie.bulkinsert.shuffle.parallelism = 200
Conclusion
Debugging Hudi workflows requires a solid understanding of its internal components—metadata table, write operations, compaction, and table timeline. Whether you’re running Spark or Flink, these common errors can be quickly identified and resolved with the right configuration, tooling, and monitoring practices.
Following these best practices helps you stabilize production pipelines, reduce ingestion failures, and deliver reliable lakehouse data at scale.