Apache Hudi brings powerful transactional capabilities to modern data lakes, enabling real-time ingestion, updates, and time-travel querying over immutable cloud storage like Amazon S3, HDFS, or Azure ADLS. But successfully scaling Hudi in a production environment requires thoughtful architecture, resource management, and configuration tuning.

This guide outlines best practices for scaling Apache Hudi in production, covering ingestion performance, metadata management, compaction optimization, and reliability strategies to build robust lakehouse platforms at scale.


1. Choose the Right Table Type: COW vs MOR

Table Type When to Use
Copy-on-Write (COW) Frequent reads, low update rates
Merge-on-Read (MOR) High ingestion rate, frequent upserts/deletes
  • Use COW for Athena, Presto, and low-latency querying
  • Use MOR for streaming pipelines, high-write workloads, and large updates

2. Optimize Partitioning Strategy

  • Use low-cardinality columns like date, region, or event_type
  • Avoid over-partitioning (e.g., using user_id or UUID)
  • Enable Hive-style partitioning:
hoodie.datasource.write.hive_style_partitioning=true

Partition pruning improves query speed and reduces scan overhead.


3. Tune File Sizes to Avoid Small File Problems

  • Set minimum file size limits:
    hoodie.parquet.small.file.limit=134217728  # 128MB
    
  • Enable file size-based clustering:
    hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
    
  • Cluster periodically or inline:
    hoodie.clustering.inline=true
    hoodie.clustering.inline.max.commits=4
    

4. Manage Compaction and Cleaning Efficiently

For MOR tables:

  • Use asynchronous compaction in high-throughput scenarios:
    hoodie.compact.inline=false
    hoodie.compact.schedule.inline=true
    
  • Tune delta commits threshold:
    hoodie.compact.max.delta.commits=5
    

For both COW and MOR:

  • Clean older versions and rollback files:
    hoodie.cleaner.policy=KEEP_LATEST_COMMITS
    hoodie.cleaner.commits.retained=10
    

5. Enable and Maintain the Metadata Table

The metadata table speeds up file listing and compaction by avoiding direct storage scans.

Enable metadata:

hoodie.metadata.enable=true
hoodie.metadata.compact.inline=true
hoodie.metadata.cleaner.policy=KEEP_LATEST_COMMITS

Monitor metadata table health with the Hudi CLI or logs.


6. Leverage Incremental Querying

Use incremental pull mode to improve ETL efficiency:

spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20240401000000") \
.load("s3://data-lake/hudi/orders")
  • Track last commit time in metadata store
  • Use in streaming or DAG-based workflows (Airflow, Spark, Glue)

7. Scale Write Throughput with Parallelism

  • Increase Spark parallelism settings:
spark.sql.shuffle.partitions=400
hoodie.bulkinsert.shuffle.parallelism=400
hoodie.upsert.shuffle.parallelism=400
  • Use bulk_insert for initial loads or backfills

8. Monitor, Audit, and Debug Effectively

  • Use Hudi CLI for inspecting timelines, commits, compactions, and rollbacks
  • Export Hudi metrics via Prometheus and visualize in Grafana
  • Track:
    • commit.duration
    • upsert.records.count
    • compaction.duration
    • metadata.table.size

Set up alerts for:

  • Failed commits
  • Lagging compactions
  • Growing number of small files

9. Ensure Compatibility with Downstream Query Engines

  • Use Hive sync to register tables in AWS Glue, HMS, or external catalogs:
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.mode=glue
hoodie.datasource.hive_sync.partition_fields=event_date
  • Use COW tables for Athena/Presto; MOR requires snapshot readers or compaction

10. Secure and Govern Your Data Lake

  • Enable AWS IAM or Ranger-based access control
  • Enable audit logging via Spark/Glue logs and SIEM integrations
  • Mask or encrypt sensitive columns before writing to Hudi

Use tags or metadata (e.g., JSON schema with classifications) to enforce governance policies.


Conclusion

Scaling Apache Hudi in production is not just about ingesting data faster—it’s about maintaining consistency, ensuring reliability, and optimizing performance across workloads. With the right configurations, monitoring tools, and architecture choices, Hudi enables modern data lakehouses to handle billions of records, support real-time updates, and serve critical analytics with confidence.

Apply these best practices to ensure your Hudi pipelines remain resilient, efficient, and production-grade as your data footprint grows.