Apache Kafka has become the de facto standard for real-time data streaming, but its robust architecture also makes it a solid backbone for batch processing pipelines. In many modern data ecosystems, organizations rely on hybrid architectures that support both streaming and batch workflows — whether it’s for ETL jobs, real-time analytics, or event-driven applications.

This blog explores how to optimize Kafka for both batch and streaming use cases, providing configuration strategies, architectural patterns, and performance tuning tips to ensure high throughput, low latency, and data consistency across your platform.


The Challenge of Hybrid Integration

While Kafka is inherently built for streaming, it also serves as a reliable buffer and transport mechanism for batch-oriented systems like:

  • Data warehouses (e.g., Snowflake, BigQuery)
  • ETL tools (e.g., Apache NiFi, Airflow)
  • Data lakes (e.g., HDFS, S3)

Challenges of integrating both include:

  • Handling different latency expectations
  • Managing offsets and checkpointing
  • Tuning retention policies for long-term batch access
  • Avoiding resource contention between fast and slow consumers

Architecture Overview

[Producers]  
(apps, CDC, logs)  
↓  
[Kafka Topics]  
↙      ↘  
[Streaming Consumers]  [Batch Jobs]  
(Flink, Spark, ksqlDB)   (NiFi, Airflow, Hive)  

To optimize this architecture, you need to balance throughput, durability, and latency for each consumer type.


Tip 1: Use Separate Topics for Batch and Streaming

To isolate workloads:

  • Create dedicated topics for streaming and batch reads
  • Use Kafka Streams or ksqlDB to fan out and transform data
  • Apply compression and compaction policies tailored to the workload

Example:

  • orders.raw – Raw stream from producers
  • orders.analytics – Stream-processed topic for dashboards
  • orders.archive – Compacted topic for batch processing

Tip 2: Tune Retention for Batch Compatibility

Kafka’s default retention may be too short for batch jobs running hourly/daily.

Adjust topic-level retention:

kafka-configs.sh --alter \
--entity-type topics --entity-name orders.archive \
--add-config retention.ms=172800000 \
--bootstrap-server localhost:9092

You can also enable time + size-based retention for efficient cleanup.


Tip 3: Optimize Batching on the Producer Side

For streaming and batch compatibility, enable:

acks=all
linger.ms=10
batch.size=32768
compression.type=snappy

These settings:

  • Improve write throughput
  • Allow micro-batching in real-time apps
  • Compress data for storage efficiency

Tip 4: Leverage Consumer Group Tuning

Batch jobs can tolerate latency, so increase poll intervals and batch size:

fetch.min.bytes=1048576
fetch.max.wait.ms=500
max.poll.records=1000

Streaming apps benefit from:

  • Low-latency polling
  • Fast offset commits (use auto or async)
  • Commit intervals tuned to avoid lag

Tip 5: Use Kafka Connect for Scheduled Integration

Use Kafka Connect Sink Connectors to push data in batch mode:

  • JDBC Sink → Data warehouse
  • HDFS Sink → Data lake
  • S3 Sink → Object storage

Schedule connectors with tools like Airflow or Oozie for hourly/daily delivery while streaming apps run continuously.


Tip 6: Apply Topic Compaction for Batch Reprocessing

For immutable keys (e.g., user ID, order ID), enable log compaction:

cleanup.policy=compact

This ensures the topic retains the latest state per key, ideal for rehydrating data into batch systems without replaying all history.


Tip 7: Monitor for Lag and Throughput

Use tools like:

  • Prometheus + Kafka Exporter
  • Kafka Manager / Confluent Control Center
  • Burrow for consumer lag

Track:

  • Consumer lag by group
  • Broker disk usage
  • Message throughput
  • Time to first byte for consumers

Set alerts for batch consumers falling behind or streaming apps underperforming.


Best Practices

  • Use idempotent writes and deduplication keys
  • Configure isolation.level=read_committed for exactly-once semantics
  • Define consumer groups by workload type (e.g., etl-batch, flink-streaming)
  • Store offsets externally for batch jobs (e.g., checkpoint DB)
  • Partition topics based on throughput and parallelism needs

Conclusion

Apache Kafka can power both low-latency streaming pipelines and reliable batch data integrations — if configured thoughtfully. By applying topic isolation, tuning producer/consumer settings, and leveraging Kafka Connect or Streams, you can unify batch and real-time workflows on a single platform.

As organizations evolve toward lakehouse and event-driven architectures, Kafka will continue to serve as the core integration bus — supporting timely insights and durable data pipelines alike.