Optimizing Hudi Metadata Table for Large Datasets

Apache Hudi is a popular data lakehouse platform that enables incremental data processing, efficient upserts, and streaming ingestion on top of distributed file systems like HDFS and S3.

One of the key components of Hudi is the Metadata Table, which stores file listings and partition information to speed up query planning and avoid expensive filesystem operations.

However, with large datasets, the metadata table itself can become a bottleneck if not configured and maintained properly.

In this post, we’ll dive into optimizing the Hudi metadata table for high-scale environments and discuss tuning strategies, compaction configs, and maintenance practices to ensure long-term performance.

What is the Hudi Metadata Table?

The Hudi Metadata Table is an internal component that stores metadata about:

Partitions
Files
Column stats
Bloom filters

It eliminates the need to perform expensive directory listings during queries and compactions.

By default, it’s enabled for Copy-on-Write (COW) and Merge-on-Read (MOR) tables and is stored within the same file system under the .hoodie/metadata folder.

Why Optimize the Metadata Table?

As the number of partitions or files grows (often into millions), metadata operations like:

Query planning
File listing
Incremental reads can degrade if the metadata table is not optimized.

Key benefits of optimization:

Faster query and write operations
Reduced memory usage
Lower latency for partition discovery
Efficient compaction and file pruning

Key Configuration Options for Metadata Table

Enable the Metadata Table

hoodie.metadata.enable=true

This enables metadata table management for file listings and other operations.

Enable Partition Stats Index (optional)

hoodie.metadata.partition.stats.enable=true

Useful for faster partition pruning during queries.

Optimize Bloom Index Storage (for MOR)

hoodie.metadata.index.bloom.enable=true
hoodie.metadata.index.column.stats.enable=true

Compaction Tuning for Metadata Table

The metadata table is itself a Hudi Merge-on-Read table, and it requires periodic compaction.

Tune compaction frequency and batch size:

hoodie.metadata.compact.max.delta.commits=10
hoodie.metadata.compact.async=true
hoodie.metadata.compact.inline=true
hoodie.metadata.compact.small.file.limit=104857600  # 100MB

Best practices:

Enable async compaction to avoid blocking data ingestion
Monitor compaction duration and frequency using logs
Use inline compaction in low-latency workloads

Scaling Metadata Table for Large Partition Counts

For datasets with >100K partitions, consider:

Increasing memory for metadata reads:
```
hoodie.memory.fraction.metadata=0.2
```
Adjusting the metadata parallelism:
```
hoodie.metadata.reader.parallelism=100
```
Storing metadata in a dedicated storage path (optional in cloud-native setups)

Monitoring Metadata Table Health

Use Hudi CLI or APIs to inspect metadata table performance:

hoodie-cli
> connect --path s3://my-hudi-table
> metadata stats

Check for:

Lagging compactions
Stale entries
High memory usage in Spark executors during metadata operations

Avoiding Common Pitfalls

Don’t disable the metadata table in large-scale environments — it’s crucial for performance
Avoid small commit intervals (e.g., one commit per minute) without compaction tuning
Monitor metadata table size and avoid small file explosion within .hoodie/metadata

Recommended Practices for Large Datasets

Use async compaction to isolate metadata overhead
Run metadata validation jobs periodically (via Spark or CLI)
Adjust reader and writer parallelism for cloud object stores (e.g., S3, GCS)
Enable column stats indexing for faster predicate evaluation
Configure file size thresholds to avoid metadata bloat

Conclusion

The Hudi metadata table is a powerful tool for scaling data lake operations — but like any system, it requires tuning and maintenance as datasets grow. By optimizing compaction, managing memory usage, and enabling relevant indexes, you can ensure fast, scalable, and reliable performance even in petabyte-scale Hudi deployments.

Follow these best practices to keep your metadata operations lean and your data lake query-ready.