Optimizing Pulsar for Low Cost High Volume Data Pipelines

As data pipelines scale to handle billions of events per day, infrastructure costs can become a significant concern. Apache Pulsar’s cloud-native architecture offers unique opportunities for cost optimization without compromising throughput or reliability.

In this blog, we’ll explore strategies for optimizing Apache Pulsar to support low-cost, high-volume data pipelines. From topic design and compaction to tiered storage and efficient consumer patterns — we’ll cover everything needed to run Pulsar at scale without breaking the budget.

Key Cost Drivers in Pulsar

To optimize costs, you first need to understand where Pulsar’s resource usage comes from:

Storage (BookKeeper disks, tiered cloud storage)
Compute (brokers and bookies)
Network (data transfer between components)
Replication and redundancy overhead
Message retention settings

Tuning these areas effectively can help drive down cost per GB and cost per message.

1. Use Tiered Storage to Offload Cold Data

Pulsar supports tiered storage with connectors for S3, GCS, Azure Blob, etc.

Benefits:

Store hot data on fast (expensive) BookKeeper disks
Offload cold segments to object storage (cheap and scalable)
Reduce BookKeeper disk requirements

Enable tiered offload:

bin/pulsar-admin topics offload \
persistent://tenant/namespace/topic \
--size-threshold 10M

Configure in broker.conf:

managedLedgerOffloadDriver=s3
s3ManagedLedgerOffloadBucket=my-bucket
s3ManagedLedgerOffloadRegion=us-east-1

2. Optimize Topic Partitions and Compaction

Avoid excessive partition counts — each partition adds metadata overhead.
Use log compaction to retain only the latest state per key instead of full history.

Enable compaction:

bin/pulsar-admin topics compact persistent://my-tenant/my-ns/my-topic

Use compaction for:

User profiles
Product catalogs
IoT device states

3. Tune Retention Policies

Unnecessary message retention = wasted storage = higher cost.

Set practical per-topic retention policies:

bin/pulsar-admin namespaces set-retention \
--size 1G \
--time 1h \
my-tenant/my-namespace

Best practices:

Store long-term data in data lakes (S3, Hadoop)
Keep Pulsar focused on recent, high-throughput data

4. Use Batching and Compression

Reduce network and disk I/O by batching and compressing messages.

Producer options:

Producer<byte[]> producer = client.newProducer()
.topic("high-volume")
.enableBatching(true)
.compressionType(CompressionType.LZ4)
.batchingMaxMessages(100)
.batchingMaxPublishDelay(10, TimeUnit.MILLISECONDS)
.create();

Compression types supported:

LZ4 (best for speed)
ZLIB / ZSTD (best for ratio)

5. Reduce Replication Overhead (Where Safe)

Default replication factor is 2 or 3 — safe but resource-heavy.
For non-critical topics, reduce to 1 or 2 to save disk and bandwidth.

Example:

bookkeeperEnsemble=2
bookkeeperWriteQuorum=2
bookkeeperAckQuorum=1

Use caution: this reduces fault tolerance.

6. Scale Using Shared Subscriptions

Avoid spinning up too many dedicated consumers. Use Shared subscriptions for:

Load balancing
Efficient resource usage
Horizontal scaling

They also reduce consumer lag by increasing concurrency.

7. Monitor and Right-Size Infrastructure

Use Prometheus + Grafana or Pulsar Manager to:

Track disk usage per namespace
Identify unused or underused topics
Monitor backlog and consumer lag

Use metrics like:

pulsar_storage_used_storage_size
pulsar_publish_rate
pulsar_consumer_ack_rate

Remove idle topics, scale down unused brokers/bookies, and automate cleanup.

8. Use Function-as-a-Service for Processing

Instead of dedicated compute clusters, use Pulsar Functions or AWS Lambda-style microservices:

Stateless processing
No additional cluster to manage
Event-driven billing in serverless environments

Pulsar Functions are ideal for simple filtering, routing, and transformation tasks.

9. Consolidate Namespaces and Reduce Isolation Overhead

Overusing tenants/namespaces increases memory and metadata costs.

Best practices:

Use per-application or per-team namespaces — not per-feature
Consolidate under shared tenants with fine-grained permissions
Reuse topics wherever logical (e.g., by event type)

Conclusion

Apache Pulsar offers a flexible, scalable foundation for high-volume data pipelines, but without the right tuning, costs can grow quickly. By leveraging features like tiered storage, compaction, batching, and shared subscriptions, you can reduce infrastructure usage while maintaining top-tier performance.

Start optimizing your Pulsar stack today to build cost-effective, real-time systems ready for production-scale workloads.