Apache Kafka is a distributed, high-throughput messaging platform that powers many real-time applications and data pipelines. While Kafka is robust, maintaining a healthy Kafka cluster can be challenging, especially at scale.

This guide provides a practical approach to troubleshooting common Kafka cluster issues, covering symptoms, root causes, and actionable fixes. Whether you’re dealing with broker crashes, consumer lag, or replication failures, this post will help you keep your Kafka system running smoothly.


1. Broker Is Down or Unresponsive

Symptoms:

  • Producer errors: Connection refused
  • Consumer failures: Broker not available
  • Missing metrics in monitoring tools

Causes:

  • JVM crashes or out-of-memory errors
  • Misconfigured listeners or advertised.listeners
  • Port conflicts or firewall blocks

Fix:

  • Check broker logs (e.g., /var/log/kafka/server.log) for errors
  • Validate Zookeeper connectivity
  • Confirm correct listener and advertised.listener settings
  • Restart broker and monitor with kafka-broker-api-versions.sh

2. Under-Replicated Partitions

Symptoms:

  • UnderReplicatedPartitions metric > 0
  • Alerts from monitoring tools
  • High ISR shrink/expand rate

Causes:

  • Slow broker recovery
  • Network latency between brokers
  • Disk IO bottlenecks

Fix:

  • Use CLI:
    kafka-topics.sh --describe --topic your-topic --bootstrap-server broker:9092
    
  • Check for broker logs indicating replication delay
  • Tune replica.fetch.max.bytes and replica.fetch.wait.max.ms
  • Restart slow brokers gracefully

3. Consumer Lag Growing

Symptoms:

  • Lag metrics rising (check with Burrow or Prometheus)
  • Consumer apps falling behind
  • Out-of-order processing or late events

Causes:

  • Slow consumers
  • Not enough partitions for parallelism
  • Long GC pauses or insufficient memory

Fix:

  • Scale consumer group horizontally
  • Increase topic partitions (use Admin API cautiously)
  • Tune consumer configs:
    max.poll.records=500
    fetch.max.bytes=52428800
    
  • Profile consumer app for bottlenecks

4. Message Loss or Duplicate Delivery

Symptoms:

  • Events missing from sinks (e.g., DB, S3)
  • Duplicate writes to external systems
  • Unexpected behavior in streaming applications

Causes:

  • Producers not using idempotent writes
  • Consumers not committing offsets properly
  • Non-transactional sink operations

Fix:

  • Enable idempotent producer:
    enable.idempotence=true
    
  • Use enable.auto.commit=false and manually commit after successful processing
  • For exactly-once, use Kafka Transactions with proper transaction boundaries

5. ZooKeeper or Controller Election Issues

Symptoms:

  • Brokers not joining the cluster
  • No active controller elected
  • Frequent controller failovers

Causes:

  • Network issues between brokers and ZooKeeper
  • Clock skew or jitter
  • High ZooKeeper latency

Fix:

  • Ensure ZooKeeper ensemble is healthy (zkServer.sh status)
  • Reduce number of partitions to limit controller workload
  • Use kafka-metadata-quorum.sh (Kafka KRaft mode) if on Kafka ≥ 3.3

6. SSL/TLS or Authentication Failures

Symptoms:

  • TLS handshake errors in logs
  • Clients fail to connect securely
  • Kafka Connect or MirrorMaker showing auth errors

Causes:

  • Invalid or expired certificates
  • Incorrect keystore/truststore configs
  • SASL or ACL misconfigurations

Fix:

  • Validate keystore and truststore paths
  • Ensure all brokers share the correct CA chain
  • Check client logs for:
    javax.net.ssl.SSLHandshakeException
    
  • Rotate expiring certs before renewal deadlines

7. Disk Usage or Log Retention Problems

Symptoms:

  • Disk space alerts
  • Kafka log cleaner stalling
  • Topics not respecting retention policies

Causes:

  • Retention period too long
  • Log compaction misconfigured
  • Partitions not getting deleted

Fix:

  • Set retention policy in topic config:
    kafka-configs.sh --alter --topic your-topic \
    --add-config retention.ms=604800000 \
    --bootstrap-server broker:9092
    
  • Enable log cleaner for compacted topics
  • Set quotas for producer to prevent overproduction

8. Partition Skew or Hot Partitions

Symptoms:

  • Uneven broker load
  • High throughput or CPU on a subset of brokers
  • Consumer lag only on certain partitions

Causes:

  • Poor partition key design (low cardinality)
  • Uneven message distribution from producers

Fix:

  • Use high-cardinality partition keys
  • Monitor partition distribution:
    kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list broker:9092 --topic your-topic --time -1
    
  • Repartition topics with Kafka Streams or Flink if needed

Tools for Kafka Troubleshooting

  • Kafka CLI Tools: kafka-topics.sh, kafka-consumer-groups.sh, kafka-configs.sh
  • Monitoring: Prometheus, Grafana, Burrow, Confluent Control Center
  • Logging: Check logs under /var/log/kafka/
  • JMX Exporter: For broker metrics
  • Cruise Control: For broker rebalancing and optimization

Conclusion

Apache Kafka is a production-grade system, but it’s not immune to misconfigurations, network failures, and performance bottlenecks. By learning how to diagnose and fix common issues, you can ensure maximum uptime, data integrity, and operational efficiency in your Kafka-based systems.

Always combine proactive monitoring, best practices, and automated alerts to detect issues early and resolve them before they impact users.