Advanced Compaction Techniques in Hudi for Efficient Storage

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a powerful lakehouse storage engine that enables real-time ingestion, updates, deletes, and incremental queries on data lakes. As data evolves rapidly, especially with frequent inserts and updates, Hudi uses compaction to merge smaller delta files into optimized base files.

Efficient compaction is crucial for maintaining query performance, reducing small files, and lowering storage overhead. In this blog, we explore advanced compaction techniques in Hudi, including inline compaction, asynchronous compaction, and clustering for efficient storage management.

Why Is Compaction Important in Hudi?

In Copy-on-Write (CoW) tables, data is written as new base files. In Merge-on-Read (MoR) tables, updates are written as delta logs and periodically compacted into base files. Without compaction:

MoR queries degrade over time (many logs to merge)
Small file problems increase NameNode pressure
Query latency increases due to excessive merging at read time

Efficient compaction ensures:

Faster read performance
Reduced file count
Efficient use of HDFS or cloud storage

Hudi Table Types Recap

Copy-on-Write (CoW):
- Updates rewrite entire files
- No compaction needed
- Better for read-heavy use cases
Merge-on-Read (MoR):
- Writes deltas as log files
- Requires periodic compaction
- Ideal for write-heavy workloads with streaming updates

Inline Compaction

Inline compaction performs compaction immediately after writes as part of the ingestion job.

Enable inline compaction:

hoodie.compact.inline=true
hoodie.compact.inline.max.delta.commits=3

Pros:

Automatic and convenient
Reduces read amplification quickly

Cons:

Slower ingestion due to added compaction step
May impact latency for real-time pipelines

Use inline compaction when:

Data volumes are manageable
You prefer simplicity over fine-tuned scheduling

Asynchronous (Scheduled) Compaction

In asynchronous compaction, ingestion is separated from compaction using background jobs.

Steps:

Ingest data normally
Run compaction periodically as a separate job

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
... \
--run-bootstrap --op compact

Or using CLI:

hudi-cli
connect --path s3://datalake/hudi/orders
compactions schedule
compactions run

Pros:

No impact on write latency
Full control over compaction timing

Cons:

More operational overhead
Requires job orchestration (Airflow, Oozie, etc.)

Use this for large-scale streaming ingestion and when latency is critical.

Compaction Trigger Strategies

Tune compaction frequency using:

hoodie.compact.inline.max.delta.commits=5
hoodie.compaction.small.file.size=104857600  # 100 MB
hoodie.io.compaction.strategy.class=org.apache.hudi.io.compact.strategy.UnboundedIOCompactionStrategy

Other strategies include:

BoundedIOCompactionStrategy: Limits file size per compaction batch
DayBasedCompactionStrategy: Compacts based on partition time range
LogFileSizeBasedStrategy: Triggers based on accumulated log file size

Choose a strategy based on ingestion rate, partitioning scheme, and query latency SLAs.

Compaction with Clustering (Hudi 0.7+)

Clustering is a layout optimization that reorganizes data files based on sorting and sizing.

Use clustering to:

Sort data by frequently queried columns
Coalesce small files across partitions

Enable clustering:

hoodie.clustering.inline=true
hoodie.clustering.inline.max.commits=4
hoodie.clustering.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSortAndSizeExecutionStrategy

Schedule and run clustering separately:

clustering schedule
clustering run

Use clustering when:

Query performance is more critical than ingest speed
You experience skewed or fragmented data

Best Practices for Efficient Compaction

Use asynchronous compaction for high-throughput streaming ingestion
Limit file size to reduce I/O overhead (hoodie.parquet.max.file.size)
Set log file size thresholds to trigger compaction timely
Combine clustering with compaction to improve read locality
Monitor compaction lag via Hudi metadata and dashboards
Schedule compaction during low-traffic hours

Monitoring and Metrics

Track compaction metrics using:

Hudi CLI (show compaction commands)
Spark job logs
Hudi metadata tables (.hoodie/ directory)
Integrations with Prometheus, Grafana, or Hadoop UIs

Focus on:

Number of pending compactions
Average compaction duration
File sizes before and after compaction

Conclusion

Apache Hudi’s compaction framework is essential for maintaining efficient storage, low-latency queries, and manageable file systems in modern data lakes. By using the right combination of inline, asynchronous, and clustering-based compaction techniques, you can balance write throughput and read performance effectively.

Master these advanced compaction techniques to ensure your Hudi-powered lakehouse remains fast, scalable, and cost-efficient as your data grows.