As businesses increasingly rely on real-time data platforms, high availability (HA) and fault tolerance become critical requirements for any messaging system. Apache Pulsar, a distributed pub-sub messaging platform, is architected with multi-layered resilience in mind — but to fully leverage its capabilities, careful configuration and operational planning are essential.

This post explores how to optimize Pulsar for high availability and fault tolerance, including deployment architectures, broker replication, BookKeeper tuning, and best practices for operational reliability.


Pulsar’s Built-In Architecture for HA

Pulsar separates concerns across distinct components:

  • Pulsar Brokers: Handle client connections and metadata
  • Apache BookKeeper: Stores the actual message data in ledgers
  • ZooKeeper: Manages cluster coordination and service discovery

This separation allows Pulsar to:

  • Scale storage independently
  • Achieve high message durability via replicated ledgers
  • Support transparent broker failover

Key High Availability Features

Feature Description
ReplicationFactor Replicates message ledgers across multiple Bookies
Persistent Subscriptions Ensure message delivery even after broker or client restarts
Geo-Replication Replicates topics across clusters and regions for disaster recovery
Load Balancing Brokers dynamically assign topics to spread traffic
Dead Letter Topics Retain undeliverable messages for retry and auditing

Broker Configuration for HA

Ensure Pulsar brokers are set up with:

loadBalancerEnabled: true
loadBalancerPlacementStrategy: least_loaded
loadBalancerAutoBundleSplitEnabled: true
loadBalancerAutoUnloadSplitBundlesEnabled: true
loadBalancerSheddingEnabled: true

This configuration helps:

  • Distribute bundles evenly across brokers
  • Automatically rebalance traffic in case of node failure
  • Prevent hotspots from overwhelming single brokers

BookKeeper Replication Settings

Critical for fault tolerance:

managedLedgerDefaultEnsembleSize=3
managedLedgerDefaultWriteQuorum=3
managedLedgerDefaultAckQuorum=2

Explanation:

  • Ensemble size: Number of Bookies used per ledger
  • Write quorum: Number of Bookies written to per entry
  • Ack quorum: Minimum Bookies that must acknowledge to consider a write successful

This ensures that data is not lost even if one Bookie fails.


Zookeeper for Metadata Consistency

Zookeeper is the backbone for:

  • Cluster coordination
  • Metadata management
  • Leader elections

Best practices:

  • Deploy Zookeeper in an odd-sized ensemble (3 or 5 nodes)
  • Monitor Zookeeper latency and election churn
  • Protect against split-brain scenarios with quorum-based writes

Enabling Geo-Replication for Disaster Recovery

Pulsar supports built-in geo-replication between clusters:

bin/pulsar-admin namespaces set-replicas my-tenant/my-ns --clusters us-west,us-east

Configure each cluster to accept replication:

bin/pulsar-admin clusters create us-west --url http://us-west-pulsar:8080
bin/pulsar-admin clusters create us-east --url http://us-east-pulsar:8080

Messages published to a topic in one region will replicate to all others, enabling multi-region failover.


Monitoring and Recovery Tools

Use Pulsar Manager or Prometheus/Grafana to monitor:

  • Broker health and throughput
  • BookKeeper ledger replication status
  • Backlog growth and consumer lag
  • ZooKeeper session state

Alert on:

  • Bookie disk failures
  • Broker JVM heap spikes
  • Ledger replication lag
  • Namespace unavailability

Enable auto-recovery and bookie autorecovery daemon to rebuild lost replicas.


Handling Broker and Bookie Failures

  • Brokers are stateless — failed brokers can be restarted or replaced easily
  • Bookies can be added/removed dynamically — Pulsar redistributes traffic automatically
  • Use pulsar-admin bookies list-bookies to validate quorum health
  • Re-replicate under-replicated ledgers:
bin/bookkeeper shell autorecovery

Best Practices

  • Use separate disks for journal and ledger storage in BookKeeper
  • Isolate Pulsar components on different machines or containers
  • Enable message deduplication to avoid duplication during retries
  • Set message TTLs and backlog limits to prevent resource exhaustion
  • Regularly test failover scenarios in staging environments

Conclusion

Apache Pulsar offers a rich, resilient architecture designed for high availability and fault tolerance. But to realize these benefits, it’s important to configure replication, quorum, geo-replication, and broker balancing thoughtfully.

By following the practices in this post, you can ensure that your Pulsar cluster is not only performant but also resilient to node failures, network partitions, and regional outages — making it ready for production-grade, mission-critical event streaming workloads.