As data lakes scale, the small files problem and poor data organization can drastically affect performance and cost. Apache Hudi addresses this challenge through clustering, a storage optimization feature that reorganizes data files for improved query performance and reduced metadata overhead.

This blog explores how Hudi’s clustering works, when to use it, and how to configure it effectively to improve the performance, manageability, and efficiency of your data lake storage.


What is Clustering in Apache Hudi?

Clustering in Hudi is a process that reorganizes data files based on defined criteria such as sort order or file size. It helps:

  • Combine small files into larger ones
  • Organize files for better query performance
  • Optimize partition layout without rewriting full tables

Unlike compaction (which rewrites delta logs into base files in Merge-on-Read), clustering is available for both COW and MOR tables and targets file organization.


Why Use Clustering?

Benefits of clustering:

  • Improves query scan performance by reducing file count and enabling sorted access
  • Reduces metadata size and S3/DFS list cost
  • Maintains query performance in long-running ingestion pipelines
  • Avoids the overhead of table rewrite or re-partitioning

Common triggers:

  • After many incremental upserts or writes
  • Following ingestion bursts creating many small files
  • Periodic table optimization for analytics

How Clustering Works

Clustering works in two stages:

  1. Scheduling clustering plans
    Marks partitions that need reorganization

  2. Executing clustering plans
    Launches Spark jobs to rewrite data into new files based on sort and size strategies

It is non-blocking and can be run asynchronously alongside ingestion jobs.


Configuring Clustering in Hudi

To enable clustering:

hoodie.clustering.inline=true
hoodie.clustering.inline.max.commits=4
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.target.file.max.bytes=134217728  # 128MB
hoodie.clustering.sort.columns=event_ts

Inline clustering runs automatically after N commits (configured by inline.max.commits).

To schedule clustering manually:

spark-submit hudi-cli.jar --command "clustering schedule --base-path s3://bucket/path"

To execute scheduled clustering:

spark-submit hudi-cli.jar --command "clustering run --base-path s3://bucket/path"

Clustering Strategies

  1. Size-based Clustering
    Combines small files into optimal-size chunks

  2. Sort and Size Clustering
    Sorts records by column(s) before rewriting (e.g., by event_ts)

  3. Custom Clustering
    You can implement your own strategy by extending Hudi’s clustering interfaces


Monitoring Clustering Jobs

Use the Hudi CLI to check clustering plans and status:

hoodie-cli
> connect --path s3://your-table
> clustering show schedule
> clustering show run

In logs, look for:

  • Number of small files compacted
  • Partitions clustered
  • Duration and size improvements

Query Performance Improvements

Clustering improves:

  • Athena and Presto scan performance
  • Hive and Spark pushdowns
  • Column pruning and partition filtering

Especially beneficial for:

  • Append-heavy workloads (logs, events)
  • Partitioned tables with high ingestion frequency
  • Long-running datasets in COW mode

Best Practices

  • Use inline clustering for continuous optimization in streaming jobs
  • For batch pipelines, use scheduled clustering during off-peak hours
  • Choose sort columns aligned with query filters (e.g., event_ts, region)
  • Monitor table size and plan clustering after every X commits or Y GB written
  • Use compaction for MOR, clustering for COW/MOR file layout optimization

Conclusion

Hudi’s clustering feature provides an elegant way to keep your data lake organized, performant, and storage-efficient without full rewrites. By applying clustering strategies based on size and sort order, you can tame the small files problem, accelerate queries, and simplify storage management — all critical in maintaining production-grade data lakes at scale.

Whether you’re running on-premise HDFS or S3-based lakehouses, clustering is a must-have for long-term Hudi optimization.