Understanding Apache Pulsar Architecture Brokers Topics and Storage

Apache Pulsar is a next-generation distributed messaging and streaming platform that offers high throughput, low latency, and multi-tenancy out of the box. What sets Pulsar apart from traditional messaging systems like Apache Kafka is its decoupled architecture, where compute (brokers) and storage (BookKeeper) are separated.

In this post, we’ll explore the core components of Apache Pulsar’s architecture — including brokers, topics, namespaces, and storage layers — and how they work together to provide a scalable, resilient, and flexible messaging system.

Pulsar Architectural Overview

Apache Pulsar is made up of three primary components:

Brokers – Route messages and manage client connections
Topics – Logical streams for publishing/subscribing
BookKeeper – Persistent storage layer for messages

This design enables elastic scalability, high availability, and geographic distribution.

Pulsar Broker

The broker is the stateless component of Pulsar that:

Accepts connections from producers and consumers
Handles subscription management
Authenticates and authorizes client requests
Routes messages to the appropriate storage layer (BookKeeper)
Loads topic metadata from ZooKeeper and Topic Ownership Service

Because brokers are stateless, they can be horizontally scaled without data loss or migration complexity.

Topics in Pulsar

Pulsar organizes data into topics, which are:

Logical append-only logs
Partitioned for parallelism (optional)
Hosted within namespaces, which are grouped under tenants

Topic naming convention:

persistent://<tenant>/<namespace>/<topic>

Example:

persistent://acme/iot/temperature

There are two types of topics:

Persistent: Stored durably using BookKeeper
Non-persistent: Stored in memory, best-effort delivery

Each partition of a topic is independently distributed and replicated across brokers.

Message Storage with Apache BookKeeper

Unlike Kafka, which stores data on the broker’s local disk, Pulsar uses Apache BookKeeper as a dedicated storage layer.

How it works:

Messages are written to a ledger, which is a sequence of entries stored in BookKeeper
Ledgers are replicated across bookies for durability
Once a ledger is full, a new ledger is created, and the metadata is updated

Benefits of this model:

True separation of compute and storage
Better fault tolerance (data survives broker restarts)
Enables tiered storage (offload to S3, GCS, etc.)

Pulsar Namespaces and Tenants

To support multi-tenancy, Pulsar uses a hierarchical resource model:

Tenant: A logical grouping (e.g., a company or app)
Namespace: A resource space for topics, policies, and quotas
Topics: The actual message streams

Admins can apply quotas, authentication rules, and retention policies at the namespace level, enabling isolated and secure usage per tenant.

Message Retention and Acknowledgments

Pulsar supports advanced retention strategies:

Message retention: Keep acknowledged messages for delayed consumers
Acknowledgment levels:
- Individual ack
- Cumulative ack (efficient for ordered consumption)

Combined with subscriptions, this allows flexible delivery models:

Exclusive
Shared (Round Robin)
Failover
Key_Shared (order-preserving for keys)

Geo-Replication

Pulsar supports built-in geo-replication between clusters using Pulsar Replication Service.

Advantages:

No external tools needed
Simple configuration at namespace level
Active-active or active-passive patterns

Example config:

bin/pulsar-admin namespaces set-clusters \
--clusters us-west,eu-central my-tenant/my-namespace

Scaling and Fault Tolerance

Brokers scale independently from bookies
BookKeeper handles replication and durability
Brokers use ZooKeeper for service discovery and metadata management
Failed brokers do not lose data — leadership transfers to another broker

This model ensures resilient message delivery and elastic scaling across compute and storage layers.

Key Benefits of Pulsar’s Architecture

Feature	Description
Decoupled Compute/Storage	Independent scaling of brokers and bookies
Stateless Brokers	Simplifies elasticity and failover
Multi-Tenancy	Namespace and tenant-level isolation
Durability via BookKeeper	High-availability storage layer
Built-in Geo-Replication	Simplified DR and multi-region deployments
Flexible Subscriptions	Multiple consumer models for different use cases
Tiered Storage	Offload old data to cloud storage (S3, GCS)

Conclusion

Apache Pulsar’s architecture was designed from the ground up to support cloud-native, multi-tenant, and real-time messaging applications. By separating brokers and storage, Pulsar enables better scalability, resilience, and operational simplicity.

Whether you’re building IoT platforms, financial data pipelines, or enterprise messaging systems, understanding Pulsar’s core architecture is the first step toward designing high-performance and future-proof streaming applications.