Integrating HDFS with Kubernetes for Cloud Native Applications

As organizations modernize their infrastructure, Kubernetes has become the de facto standard for container orchestration. Meanwhile, HDFS (Hadoop Distributed File System) remains a trusted foundation for scalable, high-throughput storage in big data environments.

But what if you want to bring the power of HDFS into your cloud-native workflows?

This post explores how to integrate HDFS with Kubernetes, enabling you to combine Kubernetes agility with HDFS durability and performance — unlocking hybrid and cloud-native big data processing.

Why Integrate HDFS with Kubernetes?

Kubernetes is designed for stateless workloads, while HDFS is a stateful, distributed storage system. Integrating the two allows you to:

Run data-intensive applications (e.g., Spark, Hive, Presto) in containers
Support persistent, high-throughput storage within Kubernetes
Build hybrid cloud and edge storage solutions
Migrate legacy big data apps into containerized environments

Deployment Strategies

There are three main approaches to using HDFS with Kubernetes:

Deploy HDFS inside Kubernetes
Access external HDFS from Kubernetes apps
Use CSI drivers to provision HDFS as volumes

Each method has trade-offs in terms of complexity, performance, and fault tolerance.

1. Deploying HDFS on Kubernetes

You can containerize the HDFS components:

NameNode
DataNode
JournalNode (for HA)
Zookeeper (optional for HA failover)

Use a Kubernetes StatefulSet to manage each node:

Example: namenode-statefulset.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-namenode
spec:
serviceName: "hdfs-namenode"
replicas: 1
selector:
matchLabels:
app: hdfs-namenode
template:
metadata:
labels:
app: hdfs-namenode
spec:
containers:
- name: namenode
image: hadoop:latest
ports:
- containerPort: 8020
- containerPort: 50070
volumeMounts:
- name: hdfs-nn-data
mountPath: /hadoop/dfs/name
volumeClaimTemplates:
- metadata:
name: hdfs-nn-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi

Repeat similar setup for datanode StatefulSets, using dfs.datanode.data.dir.

2. Accessing External HDFS from Kubernetes

If you already have an HDFS cluster outside Kubernetes, you can mount it as a client in your pods.

Steps:

Add HDFS config files (core-site.xml, hdfs-site.xml) to your container images or as ConfigMaps
Mount them inside pods via volume
Ensure pods can resolve and access HDFS NameNode/ports
Use hadoop fs or client libraries in your apps

Useful for:

Analytics platforms (e.g., Spark on K8s) reading from centralized HDFS
Sharing datasets across cloud and on-prem environments

3. HDFS CSI Driver (Experimental)

A Container Storage Interface (CSI) driver for HDFS allows dynamic provisioning of HDFS-backed Persistent Volumes.

Projects like fluid and csi-hdfs aim to bridge this gap.

Example PersistentVolume:

apiVersion: v1
kind: PersistentVolume
metadata:
name: hdfs-pv
spec:
capacity:
storage: 500Gi
accessModes:
- ReadWriteMany
csi:
driver: hdfs.csi.hadoop.com
volumeHandle: hdfs-volume
volumeAttributes:
hdfs.path: "/user/data"
hdfs.namenode: "hdfs://namenode-service:8020"

Note: CSI for HDFS is still evolving and may require customization.

Networking and Configuration Considerations

Use headless services to expose StatefulSets
Set environment variables for HADOOP_CONF_DIR inside containers
Use ConfigMaps to inject core-site.xml and hdfs-site.xml
Ensure persistent volumes for NameNode and DataNode data
Enable Kerberos authentication in secure environments

Monitoring and Maintenance

Use Prometheus-compatible exporters and sidecars:

JMX Exporter for HDFS metrics
Integrate with Grafana, Datadog, or OpenTelemetry

Logs can be routed via Fluentd or Logstash to ELK or Loki for observability.

Real-World Use Cases

Kubernetes-native Spark on HDFS
Run Spark jobs in K8s using HDFS as the backend storage
Hybrid Cloud Data Lakes
Keep active data on-prem HDFS, archive to S3 — all accessed from K8s
Edge-to-Core Pipelines
Use lightweight HDFS clusters on edge nodes (e.g., K3s) and sync to central HDFS
CI/CD for Data Pipelines
Build test environments with ephemeral HDFS clusters inside Kubernetes for safe iteration

Best Practices

Use StatefulSets with volumeClaimTemplates for HDFS persistence
Isolate DataNodes per availability zone to improve fault tolerance
Mount HDFS as an external service for simpler cluster ops
Backup critical NameNode metadata regularly
Use anti-affinity rules to spread DataNodes across nodes

Conclusion

Bringing HDFS into Kubernetes environments opens the door to cloud-native big data workflows. Whether you’re running Spark, Presto, Hive, or ML pipelines, HDFS provides the high-throughput storage layer many applications need.

By choosing the right integration method — internal deployment, external access, or CSI — you can modernize your storage architecture while retaining the scalability and durability HDFS is known for.