Apache Hudi has become a cornerstone of modern data lakes and lakehouse architectures by supporting incremental ingestion, upserts, streaming ETL, and record-level updates on top of HDFS or cloud object stores.

But writing to Hudi, especially in upsert-heavy workloads, can become a bottleneck if not configured properly. One of the most effective ways to boost Hudi’s write performance is by using the right indexing strategy.

In this blog, we’ll explore various Hudi indexing techniques, how they work, and how to tune them for better write throughput and consistency.


Why Indexing Matters in Hudi

During upserts, Hudi must locate existing records to determine whether to update or insert. This requires an efficient indexing mechanism.

Without an optimized index, every write may trigger:

  • Excessive file lookups
  • Slower commit times
  • Higher memory usage
  • Performance degradation under large datasets

Types of Indexes in Apache Hudi

Apache Hudi supports multiple indexing types, each optimized for different workloads.

Index Type Lookup Cost Write Cost Update Performance Use Case
Bloom Medium Low Good Default, general use
HBase Low High Excellent High-frequency updates
Simple High Low Poor Small datasets
Global_Bloom Medium Medium Fair Updates across partitions
Bucket Low Medium Excellent Pre-partitioned writes
Record-Level Index Varies Varies Varies Advanced use cases (Beta)

Let’s explore the most commonly used ones in more detail.


Bloom Index (Default)

The Bloom index is the default and most widely used. It leverages Bloom filters embedded in Parquet files to check for record keys.

hoodie.index.type=BLOOM
hoodie.bloom.index.parallelism=200
hoodie.bloom.index.filter.type=dynamic_v0

Pros:

  • No external system required
  • Fast lookups for moderate-scale datasets

Cons:

  • Slower on large datasets with wide partitions
  • False positives may slightly increase update overhead

Tuning Tips:

  • Adjust hoodie.bloom.index.parallelism
  • Reduce false positives via hoodie.bloom.index.filter.type

Global Bloom Index

Global Bloom extends regular Bloom across all partitions.

hoodie.index.type=GLOBAL_BLOOM

Use case: When record keys may move across partitions.

Trade-off: More expensive lookups across all files but ensures data correctness in dynamic partitioning scenarios.


HBase Index

Uses Apache HBase as an external store for indexing.

hoodie.index.type=HBASE
hoodie.index.hbase.zkquorum=zk-host:2181
hoodie.index.hbase.zkport=2181

Pros:

  • Very fast lookups even with large datasets
  • Ideal for high update frequency

Cons:

  • Requires external HBase cluster
  • Operational overhead

Use in production environments that prioritize low-latency updates.


Simple Index

Performs full file scans to find matching keys. Best for small or testing datasets.

hoodie.index.type=SIMPLE

Not recommended for large datasets due to poor scalability.


Bucket Index (Stable in Hudi 0.11+)

The bucket index precomputes record distribution, similar to bucketing in Hive.

hoodie.index.type=BUCKET
hoodie.bucket.index.num.buckets=16

Pros:

  • Deterministic write path
  • Efficient for bulk inserts and streaming

Cons:

  • Less flexible for dynamic partitioning
  • Requires upfront tuning of bucket count

Best for deduplicated streams or when primary key distribution is predictable.


Tips for Optimizing Write Performance

  • Use Bulk Insert mode (bulk_insert) for initial ingestion
  • Enable metadata table for file listing acceleration:
    hoodie.metadata.enable=true
    
  • Use async compaction (scheduleAndExecuteCompaction) in MOR tables
  • Cache partition paths when using Bloom:
    hoodie.bloom.index.use.caching=true
    
  • For Spark pipelines, optimize parallelism:
    hoodie.bloom.index.parallelism=300
    hoodie.upsert.shuffle.parallelism=300
    

Choosing the Right Index

Use Case Recommended Index
General upsert workload Bloom
Records updating across partitions Global Bloom
High-frequency updates HBase
Streaming pipeline with known skew Bucket
Testing or small-scale ingestion Simple

Always benchmark with representative data volumes and write patterns before choosing.


Conclusion

Apache Hudi’s ability to perform efficient upserts and incremental writes relies heavily on the indexing mechanism used. By selecting the right indexing strategy and tuning it properly, you can maximize ingestion throughput, minimize lookup cost, and ensure consistency across your data lake.

Whether you’re building a real-time pipeline or onboarding historical data, Hudi’s flexible indexing framework gives you the tools to fine-tune performance and reliability in high-scale environments.