Architecting a Scalable Kafka Cluster for High Throughput
Design a robust and scalable Apache Kafka cluster for high-volume real-time data processing
Apache Kafka has become the de facto standard for real-time data streaming and ingestion pipelines. To support modern applications that require high-throughput, low-latency, and scalable messaging, designing an efficient Kafka cluster architecture is critical.
In this post, we’ll explore how to architect a scalable Kafka cluster for high throughput, covering hardware choices, broker and topic configurations, partitioning, replication, producer tuning, and best practices for performance.
Key Components of a Kafka Cluster
A Kafka cluster consists of:
- Brokers: Store and serve messages
- Producers: Publish data to topics
- Consumers: Read data from topics
- ZooKeeper (Kafka ≤ 2.x): Manages metadata and leader election (replaced with KRaft in 3.x+)
- Topics: Logical streams of data divided into partitions
Each partition is the unit of parallelism and can be replicated for fault tolerance.
Cluster Sizing and Hardware Recommendations
CPU: Multi-core (8–32 cores per broker)
Memory: 32GB–128GB per broker (heap: ~6–8 GB; rest used by OS cache)
Disk: SSDs preferred; RAID-10 for redundancy
Network: 10 Gbps+ recommended for high-throughput clusters
Cluster sizing depends on:
- Throughput (MBps or events/sec)
- Message size
- Retention policies
- Replication factor
- Consumer lag tolerance
Partitioning for Scalability
Kafka achieves horizontal scalability through topic partitions:
- More partitions = more parallelism
- Partitions must be distributed evenly across brokers
- Each partition can have one leader and multiple followers
Example:
Topic: orders
Partitions: 24
Replication Factor: 3
→ 24 x 3 = 72 total replicas distributed across brokers
Guidelines:
- Keep partition count divisible by the number of consumers
- Avoid having too few partitions (limits throughput) or too many (adds overhead)
Replication and Durability
Replication factor (RF) ensures fault tolerance.
Replication Factor = 3
min.insync.replicas = 2
acks = all
- If a broker fails, followers can take over as leaders
- min.insync.replicas ensures quorum for writes
- Set acks=all for strongest durability
Trade-off: Higher RF = better durability, but more network and disk usage
Optimizing Producers for Throughput
Tuning producer settings can significantly impact throughput:
batch.size=65536 # Increase batch size (bytes)
linger.ms=5 # Small delay to allow batching
compression.type=snappy # Compress messages to reduce bandwidth
acks=all # Strong delivery guarantee
max.in.flight.requests.per.connection=5
retries=3
Use asynchronous sending and keyed partitioning only when ordering is needed.
Broker Configurations for Performance
Key broker settings:
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.segment.bytes=1073741824 # 1 GB segment files
log.retention.hours=168 # Keep data for 7 days
log.retention.check.interval.ms=300000
log.cleaner.enable=true # Enable log compaction if needed
num.replica.fetchers=4
replica.fetch.max.bytes=10485760
Tune these based on disk I/O, message size, and network throughput.
Kafka 3.x and KRaft Mode
With Kafka 3.x+, you can run in KRaft (Kafka Raft Metadata Mode) without ZooKeeper.
Benefits:
- Simplified deployment
- Lower operational overhead
- Improved metadata scaling
Enable by setting:
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@host1:9093,2@host2:9093,3@host3:9093
Monitoring and Observability
Use tools like:
- Prometheus + Grafana
- Kafka Exporter (consumer lag, broker health)
- Confluent Control Center
- JMX metrics
Monitor:
- Broker disk usage
- Request latency
- Consumer lag
- Partition imbalance
- Under-replicated partitions
Set alerts for:
- ISR shrinkage
- Broker failures
- Queue size spikes
Best Practices
- Keep brokers evenly balanced (rebalance partitions regularly)
- Use dedicated ZooKeeper/KRaft nodes
- Monitor consumer lag continuously
- Use rack awareness to spread replicas across zones
- Tune OS-level TCP parameters for low-latency networking
- Enable auto topic creation cautiously in production
- Use schema registries for data governance
Conclusion
Architecting a Kafka cluster for high throughput requires thoughtful planning across hardware, partitioning, replication, and configuration tuning. With the right strategy, Kafka can scale to handle millions of events per second with minimal latency and maximum reliability.
By following the practices outlined above, you can build a Kafka architecture that’s not just fast — but also resilient, observable, and production-ready.