Architecting a Scalable Kafka Cluster for High Throughput

Apache Kafka has become the de facto standard for real-time data streaming and ingestion pipelines. To support modern applications that require high-throughput, low-latency, and scalable messaging, designing an efficient Kafka cluster architecture is critical.

In this post, we’ll explore how to architect a scalable Kafka cluster for high throughput, covering hardware choices, broker and topic configurations, partitioning, replication, producer tuning, and best practices for performance.

Key Components of a Kafka Cluster

A Kafka cluster consists of:

Brokers: Store and serve messages
Producers: Publish data to topics
Consumers: Read data from topics
ZooKeeper (Kafka ≤ 2.x): Manages metadata and leader election (replaced with KRaft in 3.x+)
Topics: Logical streams of data divided into partitions

Each partition is the unit of parallelism and can be replicated for fault tolerance.

Cluster Sizing and Hardware Recommendations

CPU: Multi-core (8–32 cores per broker)
Memory: 32GB–128GB per broker (heap: ~6–8 GB; rest used by OS cache)
Disk: SSDs preferred; RAID-10 for redundancy
Network: 10 Gbps+ recommended for high-throughput clusters

Cluster sizing depends on:

Throughput (MBps or events/sec)
Message size
Retention policies
Replication factor
Consumer lag tolerance

Partitioning for Scalability

Kafka achieves horizontal scalability through topic partitions:

More partitions = more parallelism
Partitions must be distributed evenly across brokers
Each partition can have one leader and multiple followers

Example:

Topic: orders  
Partitions: 24  
Replication Factor: 3  
→ 24 x 3 = 72 total replicas distributed across brokers

Guidelines:

Keep partition count divisible by the number of consumers
Avoid having too few partitions (limits throughput) or too many (adds overhead)

Replication and Durability

Replication factor (RF) ensures fault tolerance.

Replication Factor = 3  
min.insync.replicas = 2  
acks = all

If a broker fails, followers can take over as leaders
min.insync.replicas ensures quorum for writes
Set acks=all for strongest durability

Trade-off: Higher RF = better durability, but more network and disk usage

Optimizing Producers for Throughput

Tuning producer settings can significantly impact throughput:

batch.size=65536             # Increase batch size (bytes)
linger.ms=5                  # Small delay to allow batching
compression.type=snappy     # Compress messages to reduce bandwidth
acks=all                     # Strong delivery guarantee
max.in.flight.requests.per.connection=5
retries=3

Use asynchronous sending and keyed partitioning only when ordering is needed.

Broker Configurations for Performance

Key broker settings:

num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.segment.bytes=1073741824         # 1 GB segment files
log.retention.hours=168              # Keep data for 7 days
log.retention.check.interval.ms=300000
log.cleaner.enable=true              # Enable log compaction if needed
num.replica.fetchers=4
replica.fetch.max.bytes=10485760

Tune these based on disk I/O, message size, and network throughput.

Kafka 3.x and KRaft Mode

With Kafka 3.x+, you can run in KRaft (Kafka Raft Metadata Mode) without ZooKeeper.

Benefits:

Simplified deployment
Lower operational overhead
Improved metadata scaling

Enable by setting:

process.roles=broker,controller
node.id=1
controller.quorum.voters=1@host1:9093,2@host2:9093,3@host3:9093

Monitoring and Observability

Use tools like:

Prometheus + Grafana
Kafka Exporter (consumer lag, broker health)
Confluent Control Center
JMX metrics

Monitor:

Broker disk usage
Request latency
Consumer lag
Partition imbalance
Under-replicated partitions

Set alerts for:

ISR shrinkage
Broker failures
Queue size spikes

Best Practices

Keep brokers evenly balanced (rebalance partitions regularly)
Use dedicated ZooKeeper/KRaft nodes
Monitor consumer lag continuously
Use rack awareness to spread replicas across zones
Tune OS-level TCP parameters for low-latency networking
Enable auto topic creation cautiously in production
Use schema registries for data governance

Conclusion

Architecting a Kafka cluster for high throughput requires thoughtful planning across hardware, partitioning, replication, and configuration tuning. With the right strategy, Kafka can scale to handle millions of events per second with minimal latency and maximum reliability.

By following the practices outlined above, you can build a Kafka architecture that’s not just fast — but also resilient, observable, and production-ready.