Building a Scalable Log Aggregation System with Pulsar and Elasticsearch

As distributed systems grow in complexity, managing and analyzing logs becomes a crucial part of maintaining observability, performance, and security. Traditional logging approaches often fail to scale with modern cloud-native workloads.

Enter Apache Pulsar and Elasticsearch — a powerful duo for building real-time log aggregation systems that can handle massive throughput while offering fast search and analytics.

In this blog, we’ll show how to build a scalable log aggregation pipeline using Pulsar as the log ingestion platform and Elasticsearch as the search and visualization engine.

Why Pulsar and Elasticsearch?

Apache Pulsar:

Pub-sub messaging with high throughput
Multi-tenancy and geo-replication
Tiered storage and backpressure handling
Native support for queue and topic-based ingestion

Elasticsearch:

Distributed search engine
Real-time log indexing and analytics
Kibana integration for visualization
JSON-friendly schema-less design

Together, they provide a pipeline that is:

Scalable and resilient
Real-time and searchable
Cloud-native and extensible

Architecture Overview

[Applications / Microservices]
↓
[Log Emitters: Fluent Bit / Filebeat / Custom]
↓
[Apache Pulsar Topics (logs)]
↓
[Pulsar IO Elasticsearch Sink Connector]
↓
[Elasticsearch]
↓
[Kibana Dashboards]

Step 1: Ingest Logs into Pulsar

You can emit logs to Pulsar using:

Fluent Bit with HTTP or Pulsar output plugin
Filebeat with Kafka → Pulsar bridge
Custom log appender (e.g., Log4j2 Pulsar Appender)

Example using Pulsar HTTP Producer:

curl -X POST http://localhost:8080/admin/v3/persistent/public/default/logs \
-H "Content-Type: application/json" \
-d '{"level":"INFO", "msg":"User login successful", "service":"auth-service", "ts":"2024-04-13T10:15:00Z"}'

Step 2: Set Up the Pulsar Elasticsearch Sink Connector

Use Pulsar IO to push log messages to Elasticsearch automatically.

Deploy the connector:

bin/pulsar-admin sink create \
--archive connectors/pulsar-io-elastic-search-<version>.nar \
--sink-config-file es-sink-config.json \
--name elasticsearch-sink \
--inputs logs \
--sink-type elastic_search

Sample es-sink-config.json:

{
"elasticSearchUrl": "http://elasticsearch:9200",
"indexName": "logs-index",
"typeName": "_doc",
"username": "elastic",
"password": "changeme",
"schemaEnable": false,
"batchSize": 200
}

Step 3: Define Index Mappings in Elasticsearch (Optional)

Define a custom index mapping for structured logs:

PUT /logs-index
{
"mappings": {
"properties": {
"level": { "type": "keyword" },
"msg": { "type": "text" },
"service": { "type": "keyword" },
"ts": { "type": "date" }
}
}
}

This ensures optimized search performance and proper timestamp parsing.

Step 4: Visualize Logs in Kibana

Launch Kibana and connect to your Elasticsearch instance
Create an index pattern for logs-index
Use the Discover tab to view live logs
Build visualizations for error rates, service activity, and log levels

Real-World Use Cases

DevOps Monitoring: Track application errors, warnings, and slow requests
Security Analytics: Detect login anomalies, DDoS patterns
SRE Dashboards: Real-time latency breakdown and service traces
Compliance Auditing: Long-term retention and searchability of access logs

Performance and Scaling Tips

Enable Pulsar batching to reduce ingestion overhead
Use dedicated Pulsar topics per service or environment
Deploy multiple sink instances for parallel writes
Set index lifecycle policies in Elasticsearch to manage storage costs
Tune heap and indexing buffer sizes for high-throughput ingestion

Monitoring and Alerts

Monitor Pulsar and Elasticsearch using:

Prometheus + Grafana
Elastic APM
Pulsar Admin CLI
Log alerts via ElastAlert or Watcher

Set alerts for:

Spike in error logs
Services with no logs for extended periods
High indexing delay or search latency

Conclusion

Apache Pulsar and Elasticsearch offer a highly scalable, real-time solution for log aggregation, search, and visualization. By leveraging Pulsar IO connectors, structured log formats, and Elasticsearch analytics, you can build an end-to-end log observability stack that meets modern DevOps, SRE, and compliance needs.

This architecture scales from single microservices to global multi-tenant environments — giving you full control over your logs, from ingestion to insights.