In the world of real-time analytics, two open-source tools reign supreme—Apache Kafka for message streaming and Apache Spark for in-memory processing. Combine them with Docker, and you unlock an agile, scalable environment for building and testing data pipelines with consistency across machines and stages.

This post will guide you through using Dockerized Spark and Kafka to create real-time data processing pipelines, including key configurations, architecture, Docker Compose setup, and performance tips.


Why Use Docker for Real-Time Processing?

Using Docker for tools like Kafka and Spark offers:

  • Fast and repeatable development environments
  • Simplified local testing of distributed systems
  • Easy CI/CD integration for data pipelines
  • Portability across dev, staging, and prod

With Docker Compose or Kubernetes, you can emulate large-scale event-driven architectures without complex bare-metal setups.


Architecture Overview

[ Producers ] → [ Kafka (Docker) ] → [ Spark Structured Streaming (Docker) ] → [ Sink: DB / Data Lake / Dashboard ]
  • Kafka: Captures real-time event streams
  • Spark: Reads from Kafka, processes data in micro-batches or continuous mode
  • Docker: Encapsulates all components for repeatable deployment

Step 1: Define Docker Compose for Kafka and Spark

version: '3.8'

services:
zookeeper:
image: bitnami/zookeeper:3.8
ports:
- "2181:2181"
environment:
ALLOW_ANONYMOUS_LOGIN: yes

kafka:
image: bitnami/kafka:3.5
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_CFG_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
ALLOW_PLAINTEXT_LISTENER: yes

spark:
image: bitnami/spark:3.4
ports:
- "4040:4040"
environment:
- SPARK_MODE=master

Add workers if needed or scale them manually:

docker-compose up --scale spark-worker=2

Step 2: Produce Data to Kafka

Use a Python or shell producer to simulate real-time ingestion.

Python Kafka Producer Example

from kafka import KafkaProducer
import json, time

producer = KafkaProducer(bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8'))

while True:
message = {"timestamp": time.time(), "event": "click"}
producer.send("events", message)
time.sleep(1)

Step 3: Read Kafka Stream in Dockerized Spark

Spark Structured Streaming Read

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("KafkaStream").getOrCreate()

df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "events") \
.load()

parsed = df.selectExpr("CAST(value AS STRING)")

query = parsed.writeStream.format("console").start()
query.awaitTermination()

Mount this script inside the Spark container or build a custom image.


Step 4: Handle Output Sinks

Write Spark-processed data to:

  • PostgreSQL / MySQL
  • MongoDB
  • Delta Lake / HDFS
  • Redis
  • REST APIs

Example output to PostgreSQL:

query = parsed.writeStream \
.format("jdbc") \
.option("url", "jdbc:postgresql://postgres:5432/streamdb") \
.option("dbtable", "events") \
.option("user", "postgres") \
.option("password", "pass") \
.start()

Monitoring and Tuning

  • Access Spark UI on http://localhost:4040
  • Monitor Kafka with tools like Kafdrop or Kafka Manager
  • Use checkpointing to recover Spark streaming jobs
  • Enable metrics and logging for container diagnostics

Performance Tips

  • Use partitioned Kafka topics for parallel Spark consumption
  • Mount Docker volumes to persist checkpoints
  • Configure Spark memory and executor settings via environment variables
  • Deploy in Kubernetes with autoscaling for production setups

Conclusion

By combining Docker, Apache Kafka, and Apache Spark, you gain a modular and scalable foundation for real-time stream processing pipelines. Whether you’re analyzing user behavior, detecting fraud, or triggering workflows, Dockerized environments offer the flexibility and reproducibility you need.

Start small with Compose and evolve to orchestration platforms like Kubernetes as you scale your architecture.