Apache Hudi is a powerful lakehouse framework that supports upserts, incremental processing, and time-travel analytics. But as with any complex system, users can encounter a variety of issues during ingestion, compaction, or querying—especially when dealing with large-scale datasets and streaming pipelines.

This post outlines common issues in Hudi workflows, provides practical debugging tips, and offers guidance on how to monitor and stabilize your data pipelines in production.


1. Compaction Failures in Merge-On-Read Tables

Problem: Compaction fails or times out.

Symptoms:

  • Missing base files
  • Hudi timeline stuck in inflight state
  • Queries return incomplete data

Solutions:

  • Check compaction status:
    hudi-cli
    > compactions show all
    
  • Force compaction cleanup:
    hudi-cli
    > compaction schedule --table-path s3://my-table
    > compaction run
    
  • Tune compaction settings:
    hoodie.compact.inline = false
    hoodie.compact.max.delta.commits = 5
    

2. Write Operation Errors

Problem: Ingestion job fails with writer exceptions.

Symptoms:

  • NullPointerException
  • Task killed due to memory
  • HudiWriteClient not found

Root Causes:

  • Mismatch between Hudi version and Spark/Flink version
  • Insufficient memory for write buffer
  • Incorrect record key or precombine field

Solutions:

  • Validate fields in write config:
    hoodie.datasource.write.recordkey.field = id
    hoodie.datasource.write.precombine.field = ts
    
  • Upgrade to compatible Hudi + engine versions
  • Increase Spark executor memory:
    --executor-memory 4g
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
    

3. Metadata Table Issues

Problem: Table operations slow or hang due to corrupted metadata.

Symptoms:

  • Timeouts during file listing
  • Metadata compaction backlog

Solutions:

  • Enable metadata cleaning:
    hoodie.metadata.clean.automatic = true
    hoodie.metadata.compact.async = true
    
  • Validate metadata health:
    hudi-cli
    > metadata stats
    
  • In extreme cases, disable metadata temporarily:
    hoodie.metadata.enable = false
    

4. Schema Mismatch and Evolution Failures

Problem: Job fails due to schema incompatibility.

Symptoms:

  • Column not found
  • SparkAnalysisException or Avro schema mismatch

Root Causes:

  • Record schema doesn’t match table schema
  • Missing nullable fields in incoming data

Solutions:

  • Enable schema validation and auto-evolution:
    hoodie.avro.schema.validate = true
    hoodie.avro.schema.allow.auto.evolution = true
    
  • Maintain versioned schema registry using tools like AWS Glue, Confluent Schema Registry, or Hive Metastore

  • Compare Avro schemas using hudi-cli or diff tools

5. Querying Issues with Athena or Presto

Problem: Athena returns empty results or errors.

Symptoms:

  • Table appears empty
  • Missing partitions
  • Incorrect file formats

Solutions:

  • Sync Hudi table to AWS Glue or HMS:
    hoodie.datasource.hive_sync.enable = true
    hoodie.datasource.hive_sync.mode = glue
    
  • For Merge-On-Read tables, ensure latest base and delta files exist
  • Use Copy-on-Write for Athena compatibility, or convert table type

6. Timeline Corruption or Stuck States

Problem: Ingestion jobs fail due to corrupted .hoodie timeline.

Symptoms:

  • .inflight commits remain forever
  • Rollbacks don’t complete

Solutions:

  • Manually rollback stuck commits:
    hudi-cli
    > rollback --commitTime 20241116120010
    
  • Clean up partial files and rerun jobs
  • Use:
    hoodie.clean.automatic = true
    

7. Small Files Explosion

Problem: High number of small files degrades performance and increases S3 costs.

Solutions:

  • Increase small file size limit:
    hoodie.parquet.small.file.limit = 134217728  # 128MB
    
  • Enable clustering or compaction:
    hoodie.clustering.inline = true
    hoodie.clustering.plan.strategy.class = org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
    

8. Memory Issues in Large Ingest Jobs

Problem: Jobs get killed or OOM errors during write phase.

Tips:

  • Use Kryo serializer:
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
    
  • Increase executor memory and cores:
    --executor-memory 8g
    --executor-cores 4
    
  • Control commit frequency:
    hoodie.bulkinsert.shuffle.parallelism = 200
    

Conclusion

Debugging Hudi workflows requires a solid understanding of its internal components—metadata table, write operations, compaction, and table timeline. Whether you’re running Spark or Flink, these common errors can be quickly identified and resolved with the right configuration, tooling, and monitoring practices.

Following these best practices helps you stabilize production pipelines, reduce ingestion failures, and deliver reliable lakehouse data at scale.