Optimizing Hudi Metadata Table for Large Datasets
Improve Hudi query performance and scalability by tuning the metadata table for large-scale data lakes
Apache Hudi is a popular data lakehouse platform that enables incremental data processing, efficient upserts, and streaming ingestion on top of distributed file systems like HDFS and S3.
One of the key components of Hudi is the Metadata Table, which stores file listings and partition information to speed up query planning and avoid expensive filesystem operations.
However, with large datasets, the metadata table itself can become a bottleneck if not configured and maintained properly.
In this post, we’ll dive into optimizing the Hudi metadata table for high-scale environments and discuss tuning strategies, compaction configs, and maintenance practices to ensure long-term performance.
What is the Hudi Metadata Table?
The Hudi Metadata Table is an internal component that stores metadata about:
- Partitions
- Files
- Column stats
- Bloom filters
It eliminates the need to perform expensive directory listings during queries and compactions.
By default, it’s enabled for Copy-on-Write (COW) and Merge-on-Read (MOR) tables and is stored within the same file system under the .hoodie/metadata
folder.
Why Optimize the Metadata Table?
As the number of partitions or files grows (often into millions), metadata operations like:
- Query planning
- File listing
- Incremental reads can degrade if the metadata table is not optimized.
Key benefits of optimization:
- Faster query and write operations
- Reduced memory usage
- Lower latency for partition discovery
- Efficient compaction and file pruning
Key Configuration Options for Metadata Table
- Enable the Metadata Table
hoodie.metadata.enable=true
This enables metadata table management for file listings and other operations.
- Enable Partition Stats Index (optional)
hoodie.metadata.partition.stats.enable=true
Useful for faster partition pruning during queries.
- Optimize Bloom Index Storage (for MOR)
hoodie.metadata.index.bloom.enable=true
hoodie.metadata.index.column.stats.enable=true
Compaction Tuning for Metadata Table
The metadata table is itself a Hudi Merge-on-Read table, and it requires periodic compaction.
Tune compaction frequency and batch size:
hoodie.metadata.compact.max.delta.commits=10
hoodie.metadata.compact.async=true
hoodie.metadata.compact.inline=true
hoodie.metadata.compact.small.file.limit=104857600 # 100MB
Best practices:
- Enable async compaction to avoid blocking data ingestion
- Monitor compaction duration and frequency using logs
- Use inline compaction in low-latency workloads
Scaling Metadata Table for Large Partition Counts
For datasets with >100K partitions, consider:
- Increasing memory for metadata reads:
hoodie.memory.fraction.metadata=0.2
- Adjusting the metadata parallelism:
hoodie.metadata.reader.parallelism=100
- Storing metadata in a dedicated storage path (optional in cloud-native setups)
Monitoring Metadata Table Health
Use Hudi CLI or APIs to inspect metadata table performance:
hoodie-cli
> connect --path s3://my-hudi-table
> metadata stats
Check for:
- Lagging compactions
- Stale entries
- High memory usage in Spark executors during metadata operations
Avoiding Common Pitfalls
- Don’t disable the metadata table in large-scale environments — it’s crucial for performance
- Avoid small commit intervals (e.g., one commit per minute) without compaction tuning
- Monitor metadata table size and avoid small file explosion within
.hoodie/metadata
Recommended Practices for Large Datasets
- Use async compaction to isolate metadata overhead
- Run metadata validation jobs periodically (via Spark or CLI)
- Adjust reader and writer parallelism for cloud object stores (e.g., S3, GCS)
- Enable column stats indexing for faster predicate evaluation
- Configure file size thresholds to avoid metadata bloat
Conclusion
The Hudi metadata table is a powerful tool for scaling data lake operations — but like any system, it requires tuning and maintenance as datasets grow. By optimizing compaction, managing memory usage, and enabling relevant indexes, you can ensure fast, scalable, and reliable performance even in petabyte-scale Hudi deployments.
Follow these best practices to keep your metadata operations lean and your data lake query-ready.