Docker for Real Time Data Processing Integrating with Apache Spark and Kafka

In the world of real-time analytics, two open-source tools reign supreme—Apache Kafka for message streaming and Apache Spark for in-memory processing. Combine them with Docker, and you unlock an agile, scalable environment for building and testing data pipelines with consistency across machines and stages.

This post will guide you through using Dockerized Spark and Kafka to create real-time data processing pipelines, including key configurations, architecture, Docker Compose setup, and performance tips.

Why Use Docker for Real-Time Processing?

Using Docker for tools like Kafka and Spark offers:

Fast and repeatable development environments
Simplified local testing of distributed systems
Easy CI/CD integration for data pipelines
Portability across dev, staging, and prod

With Docker Compose or Kubernetes, you can emulate large-scale event-driven architectures without complex bare-metal setups.

Architecture Overview

[ Producers ] → [ Kafka (Docker) ] → [ Spark Structured Streaming (Docker) ] → [ Sink: DB / Data Lake / Dashboard ]

Kafka: Captures real-time event streams
Spark: Reads from Kafka, processes data in micro-batches or continuous mode
Docker: Encapsulates all components for repeatable deployment

Step 1: Define Docker Compose for Kafka and Spark

version: '3.8'

services:
zookeeper:
image: bitnami/zookeeper:3.8
ports:
- "2181:2181"
environment:
ALLOW_ANONYMOUS_LOGIN: yes

kafka:
image: bitnami/kafka:3.5
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_CFG_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
ALLOW_PLAINTEXT_LISTENER: yes

spark:
image: bitnami/spark:3.4
ports:
- "4040:4040"
environment:
- SPARK_MODE=master

Add workers if needed or scale them manually:

docker-compose up --scale spark-worker=2

Step 2: Produce Data to Kafka

Use a Python or shell producer to simulate real-time ingestion.

Python Kafka Producer Example

from kafka import KafkaProducer
import json, time

producer = KafkaProducer(bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8'))

while True:
message = {"timestamp": time.time(), "event": "click"}
producer.send("events", message)
time.sleep(1)

Step 3: Read Kafka Stream in Dockerized Spark

Spark Structured Streaming Read

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("KafkaStream").getOrCreate()

df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "events") \
.load()

parsed = df.selectExpr("CAST(value AS STRING)")

query = parsed.writeStream.format("console").start()
query.awaitTermination()

Mount this script inside the Spark container or build a custom image.

Step 4: Handle Output Sinks

Write Spark-processed data to:

PostgreSQL / MySQL
MongoDB
Delta Lake / HDFS
Redis
REST APIs

Example output to PostgreSQL:

query = parsed.writeStream \
.format("jdbc") \
.option("url", "jdbc:postgresql://postgres:5432/streamdb") \
.option("dbtable", "events") \
.option("user", "postgres") \
.option("password", "pass") \
.start()

Monitoring and Tuning

Access Spark UI on http://localhost:4040
Monitor Kafka with tools like Kafdrop or Kafka Manager
Use checkpointing to recover Spark streaming jobs
Enable metrics and logging for container diagnostics

Performance Tips

Use partitioned Kafka topics for parallel Spark consumption
Mount Docker volumes to persist checkpoints
Configure Spark memory and executor settings via environment variables
Deploy in Kubernetes with autoscaling for production setups

Conclusion

By combining Docker, Apache Kafka, and Apache Spark, you gain a modular and scalable foundation for real-time stream processing pipelines. Whether you’re analyzing user behavior, detecting fraud, or triggering workflows, Dockerized environments offer the flexibility and reproducibility you need.

Start small with Compose and evolve to orchestration platforms like Kubernetes as you scale your architecture.