Apache Kafka has become the de facto standard for real-time data streaming and ingestion pipelines. To support modern applications that require high-throughput, low-latency, and scalable messaging, designing an efficient Kafka cluster architecture is critical.

In this post, we’ll explore how to architect a scalable Kafka cluster for high throughput, covering hardware choices, broker and topic configurations, partitioning, replication, producer tuning, and best practices for performance.


Key Components of a Kafka Cluster

A Kafka cluster consists of:

  • Brokers: Store and serve messages
  • Producers: Publish data to topics
  • Consumers: Read data from topics
  • ZooKeeper (Kafka ≤ 2.x): Manages metadata and leader election (replaced with KRaft in 3.x+)
  • Topics: Logical streams of data divided into partitions

Each partition is the unit of parallelism and can be replicated for fault tolerance.


Cluster Sizing and Hardware Recommendations

CPU: Multi-core (8–32 cores per broker)
Memory: 32GB–128GB per broker (heap: ~6–8 GB; rest used by OS cache)
Disk: SSDs preferred; RAID-10 for redundancy
Network: 10 Gbps+ recommended for high-throughput clusters

Cluster sizing depends on:

  • Throughput (MBps or events/sec)
  • Message size
  • Retention policies
  • Replication factor
  • Consumer lag tolerance

Partitioning for Scalability

Kafka achieves horizontal scalability through topic partitions:

  • More partitions = more parallelism
  • Partitions must be distributed evenly across brokers
  • Each partition can have one leader and multiple followers

Example:

Topic: orders  
Partitions: 24  
Replication Factor: 3  
→ 24 x 3 = 72 total replicas distributed across brokers

Guidelines:

  • Keep partition count divisible by the number of consumers
  • Avoid having too few partitions (limits throughput) or too many (adds overhead)

Replication and Durability

Replication factor (RF) ensures fault tolerance.

Replication Factor = 3  
min.insync.replicas = 2  
acks = all
  • If a broker fails, followers can take over as leaders
  • min.insync.replicas ensures quorum for writes
  • Set acks=all for strongest durability

Trade-off: Higher RF = better durability, but more network and disk usage


Optimizing Producers for Throughput

Tuning producer settings can significantly impact throughput:

batch.size=65536             # Increase batch size (bytes)
linger.ms=5                  # Small delay to allow batching
compression.type=snappy     # Compress messages to reduce bandwidth
acks=all                     # Strong delivery guarantee
max.in.flight.requests.per.connection=5
retries=3

Use asynchronous sending and keyed partitioning only when ordering is needed.


Broker Configurations for Performance

Key broker settings:

num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.segment.bytes=1073741824         # 1 GB segment files
log.retention.hours=168              # Keep data for 7 days
log.retention.check.interval.ms=300000
log.cleaner.enable=true              # Enable log compaction if needed
num.replica.fetchers=4
replica.fetch.max.bytes=10485760

Tune these based on disk I/O, message size, and network throughput.


Kafka 3.x and KRaft Mode

With Kafka 3.x+, you can run in KRaft (Kafka Raft Metadata Mode) without ZooKeeper.

Benefits:

  • Simplified deployment
  • Lower operational overhead
  • Improved metadata scaling

Enable by setting:

process.roles=broker,controller
node.id=1
controller.quorum.voters=1@host1:9093,2@host2:9093,3@host3:9093

Monitoring and Observability

Use tools like:

  • Prometheus + Grafana
  • Kafka Exporter (consumer lag, broker health)
  • Confluent Control Center
  • JMX metrics

Monitor:

  • Broker disk usage
  • Request latency
  • Consumer lag
  • Partition imbalance
  • Under-replicated partitions

Set alerts for:

  • ISR shrinkage
  • Broker failures
  • Queue size spikes

Best Practices

  • Keep brokers evenly balanced (rebalance partitions regularly)
  • Use dedicated ZooKeeper/KRaft nodes
  • Monitor consumer lag continuously
  • Use rack awareness to spread replicas across zones
  • Tune OS-level TCP parameters for low-latency networking
  • Enable auto topic creation cautiously in production
  • Use schema registries for data governance

Conclusion

Architecting a Kafka cluster for high throughput requires thoughtful planning across hardware, partitioning, replication, and configuration tuning. With the right strategy, Kafka can scale to handle millions of events per second with minimal latency and maximum reliability.

By following the practices outlined above, you can build a Kafka architecture that’s not just fast — but also resilient, observable, and production-ready.