Optimizing Pulsar for High Availability and Fault Tolerance
Best practices for configuring Apache Pulsar to ensure resilient and highly available messaging systems
As businesses increasingly rely on real-time data platforms, high availability (HA) and fault tolerance become critical requirements for any messaging system. Apache Pulsar, a distributed pub-sub messaging platform, is architected with multi-layered resilience in mind — but to fully leverage its capabilities, careful configuration and operational planning are essential.
This post explores how to optimize Pulsar for high availability and fault tolerance, including deployment architectures, broker replication, BookKeeper tuning, and best practices for operational reliability.
Pulsar’s Built-In Architecture for HA
Pulsar separates concerns across distinct components:
- Pulsar Brokers: Handle client connections and metadata
- Apache BookKeeper: Stores the actual message data in ledgers
- ZooKeeper: Manages cluster coordination and service discovery
This separation allows Pulsar to:
- Scale storage independently
- Achieve high message durability via replicated ledgers
- Support transparent broker failover
Key High Availability Features
Feature | Description |
---|---|
ReplicationFactor | Replicates message ledgers across multiple Bookies |
Persistent Subscriptions | Ensure message delivery even after broker or client restarts |
Geo-Replication | Replicates topics across clusters and regions for disaster recovery |
Load Balancing | Brokers dynamically assign topics to spread traffic |
Dead Letter Topics | Retain undeliverable messages for retry and auditing |
Broker Configuration for HA
Ensure Pulsar brokers are set up with:
loadBalancerEnabled: true
loadBalancerPlacementStrategy: least_loaded
loadBalancerAutoBundleSplitEnabled: true
loadBalancerAutoUnloadSplitBundlesEnabled: true
loadBalancerSheddingEnabled: true
This configuration helps:
- Distribute bundles evenly across brokers
- Automatically rebalance traffic in case of node failure
- Prevent hotspots from overwhelming single brokers
BookKeeper Replication Settings
Critical for fault tolerance:
managedLedgerDefaultEnsembleSize=3
managedLedgerDefaultWriteQuorum=3
managedLedgerDefaultAckQuorum=2
Explanation:
- Ensemble size: Number of Bookies used per ledger
- Write quorum: Number of Bookies written to per entry
- Ack quorum: Minimum Bookies that must acknowledge to consider a write successful
This ensures that data is not lost even if one Bookie fails.
Zookeeper for Metadata Consistency
Zookeeper is the backbone for:
- Cluster coordination
- Metadata management
- Leader elections
Best practices:
- Deploy Zookeeper in an odd-sized ensemble (3 or 5 nodes)
- Monitor Zookeeper latency and election churn
- Protect against split-brain scenarios with quorum-based writes
Enabling Geo-Replication for Disaster Recovery
Pulsar supports built-in geo-replication between clusters:
bin/pulsar-admin namespaces set-replicas my-tenant/my-ns --clusters us-west,us-east
Configure each cluster to accept replication:
bin/pulsar-admin clusters create us-west --url http://us-west-pulsar:8080
bin/pulsar-admin clusters create us-east --url http://us-east-pulsar:8080
Messages published to a topic in one region will replicate to all others, enabling multi-region failover.
Monitoring and Recovery Tools
Use Pulsar Manager or Prometheus/Grafana to monitor:
- Broker health and throughput
- BookKeeper ledger replication status
- Backlog growth and consumer lag
- ZooKeeper session state
Alert on:
- Bookie disk failures
- Broker JVM heap spikes
- Ledger replication lag
- Namespace unavailability
Enable auto-recovery and bookie autorecovery daemon to rebuild lost replicas.
Handling Broker and Bookie Failures
- Brokers are stateless — failed brokers can be restarted or replaced easily
- Bookies can be added/removed dynamically — Pulsar redistributes traffic automatically
- Use
pulsar-admin bookies list-bookies
to validate quorum health - Re-replicate under-replicated ledgers:
bin/bookkeeper shell autorecovery
Best Practices
- Use separate disks for journal and ledger storage in BookKeeper
- Isolate Pulsar components on different machines or containers
- Enable message deduplication to avoid duplication during retries
- Set message TTLs and backlog limits to prevent resource exhaustion
- Regularly test failover scenarios in staging environments
Conclusion
Apache Pulsar offers a rich, resilient architecture designed for high availability and fault tolerance. But to realize these benefits, it’s important to configure replication, quorum, geo-replication, and broker balancing thoughtfully.
By following the practices in this post, you can ensure that your Pulsar cluster is not only performant but also resilient to node failures, network partitions, and regional outages — making it ready for production-grade, mission-critical event streaming workloads.