Debugging Common Issues in Hudi Workflows

Apache Hudi is a powerful lakehouse framework that supports upserts, incremental processing, and time-travel analytics. But as with any complex system, users can encounter a variety of issues during ingestion, compaction, or querying—especially when dealing with large-scale datasets and streaming pipelines.

This post outlines common issues in Hudi workflows, provides practical debugging tips, and offers guidance on how to monitor and stabilize your data pipelines in production.

1. Compaction Failures in Merge-On-Read Tables

Problem: Compaction fails or times out.

Symptoms:

Missing base files
Hudi timeline stuck in inflight state
Queries return incomplete data

Solutions:

Check compaction status:
```
hudi-cli
> compactions show all
```

Force compaction cleanup:

hudi-cli
> compaction schedule --table-path s3://my-table
> compaction run

Tune compaction settings:

hoodie.compact.inline = false
hoodie.compact.max.delta.commits = 5

2. Write Operation Errors

Problem: Ingestion job fails with writer exceptions.

Symptoms:

NullPointerException
Task killed due to memory
HudiWriteClient not found

Root Causes:

Mismatch between Hudi version and Spark/Flink version
Insufficient memory for write buffer
Incorrect record key or precombine field

Solutions:

Validate fields in write config:

hoodie.datasource.write.recordkey.field = id
hoodie.datasource.write.precombine.field = ts

Upgrade to compatible Hudi + engine versions

Increase Spark executor memory:

--executor-memory 4g
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer

3. Metadata Table Issues

Problem: Table operations slow or hang due to corrupted metadata.

Symptoms:

Timeouts during file listing
Metadata compaction backlog

Solutions:

Enable metadata cleaning:

hoodie.metadata.clean.automatic = true
hoodie.metadata.compact.async = true

Validate metadata health:
```
hudi-cli
> metadata stats
```
In extreme cases, disable metadata temporarily:
```
hoodie.metadata.enable = false
```

4. Schema Mismatch and Evolution Failures

Problem: Job fails due to schema incompatibility.

Symptoms:

Column not found
SparkAnalysisException or Avro schema mismatch

Root Causes:

Record schema doesn’t match table schema
Missing nullable fields in incoming data

Solutions:

Enable schema validation and auto-evolution:

hoodie.avro.schema.validate = true
hoodie.avro.schema.allow.auto.evolution = true

Maintain versioned schema registry using tools like AWS Glue, Confluent Schema Registry, or Hive Metastore
Compare Avro schemas using hudi-cli or diff tools

5. Querying Issues with Athena or Presto

Problem: Athena returns empty results or errors.

Symptoms:

Table appears empty
Missing partitions
Incorrect file formats

Solutions:

Sync Hudi table to AWS Glue or HMS:

hoodie.datasource.hive_sync.enable = true
hoodie.datasource.hive_sync.mode = glue

For Merge-On-Read tables, ensure latest base and delta files exist
Use Copy-on-Write for Athena compatibility, or convert table type

6. Timeline Corruption or Stuck States

Problem: Ingestion jobs fail due to corrupted .hoodie timeline.

Symptoms:

.inflight commits remain forever
Rollbacks don’t complete

Solutions:

Manually rollback stuck commits:

hudi-cli
> rollback --commitTime 20241116120010

Clean up partial files and rerun jobs
Use:
```
hoodie.clean.automatic = true
```

7. Small Files Explosion

Problem: High number of small files degrades performance and increases S3 costs.

Solutions:

Increase small file size limit:

hoodie.parquet.small.file.limit = 134217728  # 128MB

Enable clustering or compaction:

hoodie.clustering.inline = true
hoodie.clustering.plan.strategy.class = org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy

8. Memory Issues in Large Ingest Jobs

Problem: Jobs get killed or OOM errors during write phase.

Tips:

Use Kryo serializer:

--conf spark.serializer=org.apache.spark.serializer.KryoSerializer

Increase executor memory and cores:

--executor-memory 8g
--executor-cores 4

Control commit frequency:

hoodie.bulkinsert.shuffle.parallelism = 200

Conclusion

Debugging Hudi workflows requires a solid understanding of its internal components—metadata table, write operations, compaction, and table timeline. Whether you’re running Spark or Flink, these common errors can be quickly identified and resolved with the right configuration, tooling, and monitoring practices.

Following these best practices helps you stabilize production pipelines, reduce ingestion failures, and deliver reliable lakehouse data at scale.