Optimizing Pulsar for High Availability and Fault Tolerance

As businesses increasingly rely on real-time data platforms, high availability (HA) and fault tolerance become critical requirements for any messaging system. Apache Pulsar, a distributed pub-sub messaging platform, is architected with multi-layered resilience in mind — but to fully leverage its capabilities, careful configuration and operational planning are essential.

This post explores how to optimize Pulsar for high availability and fault tolerance, including deployment architectures, broker replication, BookKeeper tuning, and best practices for operational reliability.

Pulsar’s Built-In Architecture for HA

Pulsar separates concerns across distinct components:

Pulsar Brokers: Handle client connections and metadata
Apache BookKeeper: Stores the actual message data in ledgers
ZooKeeper: Manages cluster coordination and service discovery

This separation allows Pulsar to:

Scale storage independently
Achieve high message durability via replicated ledgers
Support transparent broker failover

Key High Availability Features

Feature	Description
ReplicationFactor	Replicates message ledgers across multiple Bookies
Persistent Subscriptions	Ensure message delivery even after broker or client restarts
Geo-Replication	Replicates topics across clusters and regions for disaster recovery
Load Balancing	Brokers dynamically assign topics to spread traffic
Dead Letter Topics	Retain undeliverable messages for retry and auditing

Broker Configuration for HA

Ensure Pulsar brokers are set up with:

loadBalancerEnabled: true
loadBalancerPlacementStrategy: least_loaded
loadBalancerAutoBundleSplitEnabled: true
loadBalancerAutoUnloadSplitBundlesEnabled: true
loadBalancerSheddingEnabled: true

This configuration helps:

Distribute bundles evenly across brokers
Automatically rebalance traffic in case of node failure
Prevent hotspots from overwhelming single brokers

BookKeeper Replication Settings

Critical for fault tolerance:

managedLedgerDefaultEnsembleSize=3
managedLedgerDefaultWriteQuorum=3
managedLedgerDefaultAckQuorum=2

Explanation:

Ensemble size: Number of Bookies used per ledger
Write quorum: Number of Bookies written to per entry
Ack quorum: Minimum Bookies that must acknowledge to consider a write successful

This ensures that data is not lost even if one Bookie fails.

Zookeeper for Metadata Consistency

Zookeeper is the backbone for:

Cluster coordination
Metadata management
Leader elections

Best practices:

Deploy Zookeeper in an odd-sized ensemble (3 or 5 nodes)
Monitor Zookeeper latency and election churn
Protect against split-brain scenarios with quorum-based writes

Enabling Geo-Replication for Disaster Recovery

Pulsar supports built-in geo-replication between clusters:

bin/pulsar-admin namespaces set-replicas my-tenant/my-ns --clusters us-west,us-east

Configure each cluster to accept replication:

bin/pulsar-admin clusters create us-west --url http://us-west-pulsar:8080
bin/pulsar-admin clusters create us-east --url http://us-east-pulsar:8080

Messages published to a topic in one region will replicate to all others, enabling multi-region failover.

Monitoring and Recovery Tools

Use Pulsar Manager or Prometheus/Grafana to monitor:

Broker health and throughput
BookKeeper ledger replication status
Backlog growth and consumer lag
ZooKeeper session state

Alert on:

Bookie disk failures
Broker JVM heap spikes
Ledger replication lag
Namespace unavailability

Enable auto-recovery and bookie autorecovery daemon to rebuild lost replicas.

Handling Broker and Bookie Failures

Brokers are stateless — failed brokers can be restarted or replaced easily
Bookies can be added/removed dynamically — Pulsar redistributes traffic automatically
Use pulsar-admin bookies list-bookies to validate quorum health
Re-replicate under-replicated ledgers:

bin/bookkeeper shell autorecovery

Best Practices

Use separate disks for journal and ledger storage in BookKeeper
Isolate Pulsar components on different machines or containers
Enable message deduplication to avoid duplication during retries
Set message TTLs and backlog limits to prevent resource exhaustion
Regularly test failover scenarios in staging environments

Conclusion

Apache Pulsar offers a rich, resilient architecture designed for high availability and fault tolerance. But to realize these benefits, it’s important to configure replication, quorum, geo-replication, and broker balancing thoughtfully.

By following the practices in this post, you can ensure that your Pulsar cluster is not only performant but also resilient to node failures, network partitions, and regional outages — making it ready for production-grade, mission-critical event streaming workloads.