As organizations modernize their infrastructure, Kubernetes has become the de facto standard for container orchestration. Meanwhile, HDFS (Hadoop Distributed File System) remains a trusted foundation for scalable, high-throughput storage in big data environments.

But what if you want to bring the power of HDFS into your cloud-native workflows?

This post explores how to integrate HDFS with Kubernetes, enabling you to combine Kubernetes agility with HDFS durability and performance — unlocking hybrid and cloud-native big data processing.


Why Integrate HDFS with Kubernetes?

Kubernetes is designed for stateless workloads, while HDFS is a stateful, distributed storage system. Integrating the two allows you to:

  • Run data-intensive applications (e.g., Spark, Hive, Presto) in containers
  • Support persistent, high-throughput storage within Kubernetes
  • Build hybrid cloud and edge storage solutions
  • Migrate legacy big data apps into containerized environments

Deployment Strategies

There are three main approaches to using HDFS with Kubernetes:

  1. Deploy HDFS inside Kubernetes
  2. Access external HDFS from Kubernetes apps
  3. Use CSI drivers to provision HDFS as volumes

Each method has trade-offs in terms of complexity, performance, and fault tolerance.


1. Deploying HDFS on Kubernetes

You can containerize the HDFS components:

  • NameNode
  • DataNode
  • JournalNode (for HA)
  • Zookeeper (optional for HA failover)

Use a Kubernetes StatefulSet to manage each node:

Example: namenode-statefulset.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-namenode
spec:
serviceName: "hdfs-namenode"
replicas: 1
selector:
matchLabels:
app: hdfs-namenode
template:
metadata:
labels:
app: hdfs-namenode
spec:
containers:
- name: namenode
image: hadoop:latest
ports:
- containerPort: 8020
- containerPort: 50070
volumeMounts:
- name: hdfs-nn-data
mountPath: /hadoop/dfs/name
volumeClaimTemplates:
- metadata:
name: hdfs-nn-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi

Repeat similar setup for datanode StatefulSets, using dfs.datanode.data.dir.


2. Accessing External HDFS from Kubernetes

If you already have an HDFS cluster outside Kubernetes, you can mount it as a client in your pods.

Steps:

  • Add HDFS config files (core-site.xml, hdfs-site.xml) to your container images or as ConfigMaps
  • Mount them inside pods via volume
  • Ensure pods can resolve and access HDFS NameNode/ports
  • Use hadoop fs or client libraries in your apps

Useful for:

  • Analytics platforms (e.g., Spark on K8s) reading from centralized HDFS
  • Sharing datasets across cloud and on-prem environments

3. HDFS CSI Driver (Experimental)

A Container Storage Interface (CSI) driver for HDFS allows dynamic provisioning of HDFS-backed Persistent Volumes.

Projects like fluid and csi-hdfs aim to bridge this gap.

Example PersistentVolume:

apiVersion: v1
kind: PersistentVolume
metadata:
name: hdfs-pv
spec:
capacity:
storage: 500Gi
accessModes:
- ReadWriteMany
csi:
driver: hdfs.csi.hadoop.com
volumeHandle: hdfs-volume
volumeAttributes:
hdfs.path: "/user/data"
hdfs.namenode: "hdfs://namenode-service:8020"

Note: CSI for HDFS is still evolving and may require customization.


Networking and Configuration Considerations

  • Use headless services to expose StatefulSets
  • Set environment variables for HADOOP_CONF_DIR inside containers
  • Use ConfigMaps to inject core-site.xml and hdfs-site.xml
  • Ensure persistent volumes for NameNode and DataNode data
  • Enable Kerberos authentication in secure environments

Monitoring and Maintenance

Use Prometheus-compatible exporters and sidecars:

  • JMX Exporter for HDFS metrics
  • Integrate with Grafana, Datadog, or OpenTelemetry

Logs can be routed via Fluentd or Logstash to ELK or Loki for observability.


Real-World Use Cases

  1. Kubernetes-native Spark on HDFS
    Run Spark jobs in K8s using HDFS as the backend storage

  2. Hybrid Cloud Data Lakes
    Keep active data on-prem HDFS, archive to S3 — all accessed from K8s

  3. Edge-to-Core Pipelines
    Use lightweight HDFS clusters on edge nodes (e.g., K3s) and sync to central HDFS

  4. CI/CD for Data Pipelines
    Build test environments with ephemeral HDFS clusters inside Kubernetes for safe iteration


Best Practices

  • Use StatefulSets with volumeClaimTemplates for HDFS persistence
  • Isolate DataNodes per availability zone to improve fault tolerance
  • Mount HDFS as an external service for simpler cluster ops
  • Backup critical NameNode metadata regularly
  • Use anti-affinity rules to spread DataNodes across nodes

Conclusion

Bringing HDFS into Kubernetes environments opens the door to cloud-native big data workflows. Whether you’re running Spark, Presto, Hive, or ML pipelines, HDFS provides the high-throughput storage layer many applications need.

By choosing the right integration method — internal deployment, external access, or CSI — you can modernize your storage architecture while retaining the scalability and durability HDFS is known for.