Apache Hudi is a powerful framework for building data lakehouses with upsert, incremental processing, and time-travel capabilities. While Hudi traditionally runs on YARN or Spark Standalone, running it on Kubernetes offers significant advantages in terms of scalability, cloud-native architecture, and resource efficiency.

In this blog, we explore how to deploy and manage Hudi workloads on Kubernetes, leveraging Spark on K8s, Helm charts, and best practices to support real-time and batch data lake processing.


Why Run Hudi on Kubernetes?

Benefit Description
Scalability Easily scale pods and Spark executors on demand
Portability Run on any cloud or on-prem K8s cluster
Cost Efficiency Use autoscaling and spot instances
Isolation Containerized apps prevent resource contention
Cloud-native Ops Seamless CI/CD, monitoring, and rollbacks

Running Hudi on K8s aligns with modern infrastructure standards, especially for hybrid and multi-cloud architectures.


Architecture Overview

To run Hudi on Kubernetes:

  • Apache Spark 3.x with Kubernetes support is required
  • Hudi jobs are submitted using Spark-on-Kubernetes mode
  • Data is stored in S3, GCS, or HDFS
  • Optionally use Hive Metastore, AWS Glue, or Data Catalog
  • Deploy using kubectl, Helm, or SparkOperator
[Dockerized Spark App] → [K8s Cluster] → [S3/HDFS]  
↘                ↘  
[Hive Metastore]    [Glue Catalog]  

Prerequisites

  • Kubernetes Cluster (EKS, GKE, AKS, or local minikube)
  • Spark 3.1+ built with K8s support and Hudi compatibility
  • Hudi utilities bundle JAR
  • Docker or Podman to build custom Spark images
  • Access to object store (S3, GCS, etc.)

Step 1: Build a Hudi-Compatible Spark Docker Image

Create a Dockerfile:

FROM gcr.io/spark-operator/spark:v3.3.0

ADD hudi-spark3.3-bundle_2.12-0.14.0.jar /opt/spark/jars/

Then build and push:

docker build -t myrepo/spark-hudi:latest .
docker push myrepo/spark-hudi:latest

Step 2: Submit Hudi Jobs on Kubernetes

Use Spark submit with --master k8s and point to your image:

spark-submit \
--master k8s://https://<k8s-api-server> \
--deploy-mode cluster \
--name hudi-job \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=myrepo/spark-hudi:latest \
--conf spark.kubernetes.namespace=default \
--conf spark.hadoop.fs.s3a.access.key=XXX \
--conf spark.hadoop.fs.s3a.secret.key=XXX \
local:///opt/spark/jars/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--target-base-path s3a://my-datalake/hudi/orders \
--target-table orders \
--props /etc/hudi/orders.properties

Step 3: Deploy via Helm (Optional)

Use Helm for templated deployment:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install spark bitnami/spark \
--set image.repository=myrepo/spark-hudi \
--set image.tag=latest \
--set worker.replicaCount=5

This simplifies provisioning Spark/Hudi workloads with autoscaling and monitoring.


Configurations for Performance

  • Shuffle tuning:
    spark.sql.shuffle.partitions=200
    
  • Resource allocation:
    spark.executor.memory=4g  
    spark.driver.memory=2g  
    spark.executor.cores=2  
    
  • Filesystem compatibility: Ensure correct S3A/GCS/HDFS credentials in spark.hadoop.* properties.

Best Practices

  • Use SparkOperator for managed job orchestration
  • Enable Hudi metadata table for file pruning and faster planning
  • Store checkpoints in object storage or persistent volume
  • Use initContainers to load configuration dynamically
  • Monitor jobs using Prometheus + Grafana, or Spark UI on K8s Dashboard
  • Rotate logs and retain job history with volume mounts

Security Considerations

  • Store credentials in Kubernetes secrets
  • Enable TLS encryption between Spark driver and executor pods
  • Use RBAC policies to limit access to Spark/K8s resources
  • Integrate with OIDC or IRSA (for AWS) for fine-grained access

Conclusion

Deploying Apache Hudi on Kubernetes unlocks scalable, elastic, and cloud-native data processing workflows. With containerized Spark jobs, seamless orchestration, and resource isolation, organizations can modernize their ETL pipelines and support real-time lakehouse operations at scale.

Whether you’re building a cloud-native analytics platform or transitioning from legacy YARN infrastructure, integrating Hudi with Kubernetes is a future-proof step toward efficient and modular big data processing.