Optimizing Hudi Write Performance with Indexing Strategies
Enhance Apache Hudi ingestion performance by leveraging the right indexing configurations
Apache Hudi has become a cornerstone of modern data lakes and lakehouse architectures by supporting incremental ingestion, upserts, streaming ETL, and record-level updates on top of HDFS or cloud object stores.
But writing to Hudi, especially in upsert-heavy workloads, can become a bottleneck if not configured properly. One of the most effective ways to boost Hudi’s write performance is by using the right indexing strategy.
In this blog, we’ll explore various Hudi indexing techniques, how they work, and how to tune them for better write throughput and consistency.
Why Indexing Matters in Hudi
During upserts, Hudi must locate existing records to determine whether to update or insert. This requires an efficient indexing mechanism.
Without an optimized index, every write may trigger:
- Excessive file lookups
- Slower commit times
- Higher memory usage
- Performance degradation under large datasets
Types of Indexes in Apache Hudi
Apache Hudi supports multiple indexing types, each optimized for different workloads.
Index Type | Lookup Cost | Write Cost | Update Performance | Use Case |
---|---|---|---|---|
Bloom | Medium | Low | Good | Default, general use |
HBase | Low | High | Excellent | High-frequency updates |
Simple | High | Low | Poor | Small datasets |
Global_Bloom | Medium | Medium | Fair | Updates across partitions |
Bucket | Low | Medium | Excellent | Pre-partitioned writes |
Record-Level Index | Varies | Varies | Varies | Advanced use cases (Beta) |
Let’s explore the most commonly used ones in more detail.
Bloom Index (Default)
The Bloom index is the default and most widely used. It leverages Bloom filters embedded in Parquet files to check for record keys.
hoodie.index.type=BLOOM
hoodie.bloom.index.parallelism=200
hoodie.bloom.index.filter.type=dynamic_v0
Pros:
- No external system required
- Fast lookups for moderate-scale datasets
Cons:
- Slower on large datasets with wide partitions
- False positives may slightly increase update overhead
Tuning Tips:
- Adjust
hoodie.bloom.index.parallelism
- Reduce false positives via
hoodie.bloom.index.filter.type
Global Bloom Index
Global Bloom extends regular Bloom across all partitions.
hoodie.index.type=GLOBAL_BLOOM
Use case: When record keys may move across partitions.
Trade-off: More expensive lookups across all files but ensures data correctness in dynamic partitioning scenarios.
HBase Index
Uses Apache HBase as an external store for indexing.
hoodie.index.type=HBASE
hoodie.index.hbase.zkquorum=zk-host:2181
hoodie.index.hbase.zkport=2181
Pros:
- Very fast lookups even with large datasets
- Ideal for high update frequency
Cons:
- Requires external HBase cluster
- Operational overhead
Use in production environments that prioritize low-latency updates.
Simple Index
Performs full file scans to find matching keys. Best for small or testing datasets.
hoodie.index.type=SIMPLE
Not recommended for large datasets due to poor scalability.
Bucket Index (Stable in Hudi 0.11+)
The bucket index precomputes record distribution, similar to bucketing in Hive.
hoodie.index.type=BUCKET
hoodie.bucket.index.num.buckets=16
Pros:
- Deterministic write path
- Efficient for bulk inserts and streaming
Cons:
- Less flexible for dynamic partitioning
- Requires upfront tuning of bucket count
Best for deduplicated streams or when primary key distribution is predictable.
Tips for Optimizing Write Performance
- Use Bulk Insert mode (
bulk_insert
) for initial ingestion - Enable metadata table for file listing acceleration:
hoodie.metadata.enable=true
- Use async compaction (
scheduleAndExecuteCompaction
) in MOR tables - Cache partition paths when using Bloom:
hoodie.bloom.index.use.caching=true
- For Spark pipelines, optimize parallelism:
hoodie.bloom.index.parallelism=300 hoodie.upsert.shuffle.parallelism=300
Choosing the Right Index
Use Case | Recommended Index |
---|---|
General upsert workload | Bloom |
Records updating across partitions | Global Bloom |
High-frequency updates | HBase |
Streaming pipeline with known skew | Bucket |
Testing or small-scale ingestion | Simple |
Always benchmark with representative data volumes and write patterns before choosing.
Conclusion
Apache Hudi’s ability to perform efficient upserts and incremental writes relies heavily on the indexing mechanism used. By selecting the right indexing strategy and tuning it properly, you can maximize ingestion throughput, minimize lookup cost, and ensure consistency across your data lake.
Whether you’re building a real-time pipeline or onboarding historical data, Hudi’s flexible indexing framework gives you the tools to fine-tune performance and reliability in high-scale environments.