Exploring Hudi Clustering for Storage Optimization

As data lakes scale, the small files problem and poor data organization can drastically affect performance and cost. Apache Hudi addresses this challenge through clustering, a storage optimization feature that reorganizes data files for improved query performance and reduced metadata overhead.

This blog explores how Hudi’s clustering works, when to use it, and how to configure it effectively to improve the performance, manageability, and efficiency of your data lake storage.

What is Clustering in Apache Hudi?

Clustering in Hudi is a process that reorganizes data files based on defined criteria such as sort order or file size. It helps:

Combine small files into larger ones
Organize files for better query performance
Optimize partition layout without rewriting full tables

Unlike compaction (which rewrites delta logs into base files in Merge-on-Read), clustering is available for both COW and MOR tables and targets file organization.

Why Use Clustering?

Benefits of clustering:

Improves query scan performance by reducing file count and enabling sorted access
Reduces metadata size and S3/DFS list cost
Maintains query performance in long-running ingestion pipelines
Avoids the overhead of table rewrite or re-partitioning

Common triggers:

After many incremental upserts or writes
Following ingestion bursts creating many small files
Periodic table optimization for analytics

How Clustering Works

Clustering works in two stages:

Scheduling clustering plans
Marks partitions that need reorganization
Executing clustering plans
Launches Spark jobs to rewrite data into new files based on sort and size strategies

It is non-blocking and can be run asynchronously alongside ingestion jobs.

Configuring Clustering in Hudi

To enable clustering:

hoodie.clustering.inline=true
hoodie.clustering.inline.max.commits=4
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.target.file.max.bytes=134217728  # 128MB
hoodie.clustering.sort.columns=event_ts

Inline clustering runs automatically after N commits (configured by inline.max.commits).

To schedule clustering manually:

spark-submit hudi-cli.jar --command "clustering schedule --base-path s3://bucket/path"

To execute scheduled clustering:

spark-submit hudi-cli.jar --command "clustering run --base-path s3://bucket/path"

Clustering Strategies

Size-based Clustering
Combines small files into optimal-size chunks
Sort and Size Clustering
Sorts records by column(s) before rewriting (e.g., by event_ts)
Custom Clustering
You can implement your own strategy by extending Hudi’s clustering interfaces

Monitoring Clustering Jobs

Use the Hudi CLI to check clustering plans and status:

hoodie-cli
> connect --path s3://your-table
> clustering show schedule
> clustering show run

In logs, look for:

Number of small files compacted
Partitions clustered
Duration and size improvements

Query Performance Improvements

Clustering improves:

Athena and Presto scan performance
Hive and Spark pushdowns
Column pruning and partition filtering

Especially beneficial for:

Append-heavy workloads (logs, events)
Partitioned tables with high ingestion frequency
Long-running datasets in COW mode

Best Practices

Use inline clustering for continuous optimization in streaming jobs
For batch pipelines, use scheduled clustering during off-peak hours
Choose sort columns aligned with query filters (e.g., event_ts, region)
Monitor table size and plan clustering after every X commits or Y GB written
Use compaction for MOR, clustering for COW/MOR file layout optimization

Conclusion

Hudi’s clustering feature provides an elegant way to keep your data lake organized, performant, and storage-efficient without full rewrites. By applying clustering strategies based on size and sort order, you can tame the small files problem, accelerate queries, and simplify storage management — all critical in maintaining production-grade data lakes at scale.

Whether you’re running on-premise HDFS or S3-based lakehouses, clustering is a must-have for long-term Hudi optimization.