Optimizing Hudi Write Performance with Indexing Strategies

Apache Hudi has become a cornerstone of modern data lakes and lakehouse architectures by supporting incremental ingestion, upserts, streaming ETL, and record-level updates on top of HDFS or cloud object stores.

But writing to Hudi, especially in upsert-heavy workloads, can become a bottleneck if not configured properly. One of the most effective ways to boost Hudi’s write performance is by using the right indexing strategy.

In this blog, we’ll explore various Hudi indexing techniques, how they work, and how to tune them for better write throughput and consistency.

Why Indexing Matters in Hudi

During upserts, Hudi must locate existing records to determine whether to update or insert. This requires an efficient indexing mechanism.

Without an optimized index, every write may trigger:

Excessive file lookups
Slower commit times
Higher memory usage
Performance degradation under large datasets

Types of Indexes in Apache Hudi

Apache Hudi supports multiple indexing types, each optimized for different workloads.

Index Type	Lookup Cost	Write Cost	Update Performance	Use Case
Bloom	Medium	Low	Good	Default, general use
HBase	Low	High	Excellent	High-frequency updates
Simple	High	Low	Poor	Small datasets
Global_Bloom	Medium	Medium	Fair	Updates across partitions
Bucket	Low	Medium	Excellent	Pre-partitioned writes
Record-Level Index	Varies	Varies	Varies	Advanced use cases (Beta)

Let’s explore the most commonly used ones in more detail.

Bloom Index (Default)

The Bloom index is the default and most widely used. It leverages Bloom filters embedded in Parquet files to check for record keys.

hoodie.index.type=BLOOM
hoodie.bloom.index.parallelism=200
hoodie.bloom.index.filter.type=dynamic_v0

Pros:

No external system required
Fast lookups for moderate-scale datasets

Cons:

Slower on large datasets with wide partitions
False positives may slightly increase update overhead

Tuning Tips:

Adjust hoodie.bloom.index.parallelism
Reduce false positives via hoodie.bloom.index.filter.type

Global Bloom Index

Global Bloom extends regular Bloom across all partitions.

hoodie.index.type=GLOBAL_BLOOM

Use case: When record keys may move across partitions.

Trade-off: More expensive lookups across all files but ensures data correctness in dynamic partitioning scenarios.

HBase Index

Uses Apache HBase as an external store for indexing.

hoodie.index.type=HBASE
hoodie.index.hbase.zkquorum=zk-host:2181
hoodie.index.hbase.zkport=2181

Pros:

Very fast lookups even with large datasets
Ideal for high update frequency

Cons:

Requires external HBase cluster
Operational overhead

Use in production environments that prioritize low-latency updates.

Simple Index

Performs full file scans to find matching keys. Best for small or testing datasets.

hoodie.index.type=SIMPLE

Not recommended for large datasets due to poor scalability.

Bucket Index (Stable in Hudi 0.11+)

The bucket index precomputes record distribution, similar to bucketing in Hive.

hoodie.index.type=BUCKET
hoodie.bucket.index.num.buckets=16

Pros:

Deterministic write path
Efficient for bulk inserts and streaming

Cons:

Less flexible for dynamic partitioning
Requires upfront tuning of bucket count

Best for deduplicated streams or when primary key distribution is predictable.

Tips for Optimizing Write Performance

Use Bulk Insert mode (bulk_insert) for initial ingestion
Enable metadata table for file listing acceleration:
```
hoodie.metadata.enable=true
```
Use async compaction (scheduleAndExecuteCompaction) in MOR tables
Cache partition paths when using Bloom:
```
hoodie.bloom.index.use.caching=true
```

For Spark pipelines, optimize parallelism:

hoodie.bloom.index.parallelism=300
hoodie.upsert.shuffle.parallelism=300

Choosing the Right Index

Use Case	Recommended Index
General upsert workload	Bloom
Records updating across partitions	Global Bloom
High-frequency updates	HBase
Streaming pipeline with known skew	Bucket
Testing or small-scale ingestion	Simple

Always benchmark with representative data volumes and write patterns before choosing.

Conclusion

Apache Hudi’s ability to perform efficient upserts and incremental writes relies heavily on the indexing mechanism used. By selecting the right indexing strategy and tuning it properly, you can maximize ingestion throughput, minimize lookup cost, and ensure consistency across your data lake.

Whether you’re building a real-time pipeline or onboarding historical data, Hudi’s flexible indexing framework gives you the tools to fine-tune performance and reliability in high-scale environments.