Integrating HDFS with Kubernetes for Cloud Native Applications
Learn how to run HDFS in Kubernetes environments to support scalable cloud-native big data workloads
As organizations modernize their infrastructure, Kubernetes has become the de facto standard for container orchestration. Meanwhile, HDFS (Hadoop Distributed File System) remains a trusted foundation for scalable, high-throughput storage in big data environments.
But what if you want to bring the power of HDFS into your cloud-native workflows?
This post explores how to integrate HDFS with Kubernetes, enabling you to combine Kubernetes agility with HDFS durability and performance — unlocking hybrid and cloud-native big data processing.
Why Integrate HDFS with Kubernetes?
Kubernetes is designed for stateless workloads, while HDFS is a stateful, distributed storage system. Integrating the two allows you to:
- Run data-intensive applications (e.g., Spark, Hive, Presto) in containers
- Support persistent, high-throughput storage within Kubernetes
- Build hybrid cloud and edge storage solutions
- Migrate legacy big data apps into containerized environments
Deployment Strategies
There are three main approaches to using HDFS with Kubernetes:
- Deploy HDFS inside Kubernetes
- Access external HDFS from Kubernetes apps
- Use CSI drivers to provision HDFS as volumes
Each method has trade-offs in terms of complexity, performance, and fault tolerance.
1. Deploying HDFS on Kubernetes
You can containerize the HDFS components:
- NameNode
- DataNode
- JournalNode (for HA)
- Zookeeper (optional for HA failover)
Use a Kubernetes StatefulSet to manage each node:
Example: namenode-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-namenode
spec:
serviceName: "hdfs-namenode"
replicas: 1
selector:
matchLabels:
app: hdfs-namenode
template:
metadata:
labels:
app: hdfs-namenode
spec:
containers:
- name: namenode
image: hadoop:latest
ports:
- containerPort: 8020
- containerPort: 50070
volumeMounts:
- name: hdfs-nn-data
mountPath: /hadoop/dfs/name
volumeClaimTemplates:
- metadata:
name: hdfs-nn-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Repeat similar setup for datanode StatefulSets, using dfs.datanode.data.dir
.
2. Accessing External HDFS from Kubernetes
If you already have an HDFS cluster outside Kubernetes, you can mount it as a client in your pods.
Steps:
- Add HDFS config files (
core-site.xml
,hdfs-site.xml
) to your container images or as ConfigMaps - Mount them inside pods via volume
- Ensure pods can resolve and access HDFS NameNode/ports
- Use
hadoop fs
or client libraries in your apps
Useful for:
- Analytics platforms (e.g., Spark on K8s) reading from centralized HDFS
- Sharing datasets across cloud and on-prem environments
3. HDFS CSI Driver (Experimental)
A Container Storage Interface (CSI) driver for HDFS allows dynamic provisioning of HDFS-backed Persistent Volumes.
Projects like fluid and csi-hdfs aim to bridge this gap.
Example PersistentVolume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: hdfs-pv
spec:
capacity:
storage: 500Gi
accessModes:
- ReadWriteMany
csi:
driver: hdfs.csi.hadoop.com
volumeHandle: hdfs-volume
volumeAttributes:
hdfs.path: "/user/data"
hdfs.namenode: "hdfs://namenode-service:8020"
Note: CSI for HDFS is still evolving and may require customization.
Networking and Configuration Considerations
- Use headless services to expose StatefulSets
- Set environment variables for HADOOP_CONF_DIR inside containers
- Use ConfigMaps to inject
core-site.xml
andhdfs-site.xml
- Ensure persistent volumes for NameNode and DataNode data
- Enable Kerberos authentication in secure environments
Monitoring and Maintenance
Use Prometheus-compatible exporters and sidecars:
- JMX Exporter for HDFS metrics
- Integrate with Grafana, Datadog, or OpenTelemetry
Logs can be routed via Fluentd or Logstash to ELK or Loki for observability.
Real-World Use Cases
-
Kubernetes-native Spark on HDFS
Run Spark jobs in K8s using HDFS as the backend storage -
Hybrid Cloud Data Lakes
Keep active data on-prem HDFS, archive to S3 — all accessed from K8s -
Edge-to-Core Pipelines
Use lightweight HDFS clusters on edge nodes (e.g., K3s) and sync to central HDFS -
CI/CD for Data Pipelines
Build test environments with ephemeral HDFS clusters inside Kubernetes for safe iteration
Best Practices
- Use StatefulSets with volumeClaimTemplates for HDFS persistence
- Isolate DataNodes per availability zone to improve fault tolerance
- Mount HDFS as an external service for simpler cluster ops
- Backup critical NameNode metadata regularly
- Use anti-affinity rules to spread DataNodes across nodes
Conclusion
Bringing HDFS into Kubernetes environments opens the door to cloud-native big data workflows. Whether you’re running Spark, Presto, Hive, or ML pipelines, HDFS provides the high-throughput storage layer many applications need.
By choosing the right integration method — internal deployment, external access, or CSI — you can modernize your storage architecture while retaining the scalability and durability HDFS is known for.