Building a Real Time Analytics Pipeline with Pulsar and Apache Spark

In an era where decisions are increasingly driven by data, businesses need to act on insights in real time. Apache Pulsar and Apache Spark together form a robust foundation for real-time analytics pipelines. While Pulsar excels in scalable messaging and event delivery, Spark enables complex streaming computations at scale.

This post walks through building a real-time analytics pipeline using Apache Pulsar as the message broker and Apache Spark Structured Streaming for processing — providing a scalable, fault-tolerant framework to derive actionable insights instantly.

Why Apache Pulsar and Apache Spark?

Feature	Apache Pulsar	Apache Spark
Message Delivery	High-throughput pub-sub + queuing	Native Pulsar integration
Multi-Tenancy & Geo-Rep	Built-in	Compatible with large-scale ingestion
Event Time Support	Yes	Yes (with watermarking)
Streaming Computation	No	Yes (Structured Streaming)
Scalability	Horizontal (broker + bookies)	Horizontal (executors + drivers)

By combining Pulsar and Spark, you get real-time messaging and distributed stream processing with rich SQL-like transformations.

Architecture Overview

[Producers] → [Apache Pulsar] → [Spark Structured Streaming] → [Data Lake / Dashboard]

Producers: Emit events (e.g., user clicks, IoT metrics)
Pulsar: Buffers and streams events to consumers
Spark: Transforms, aggregates, and persists data
Sinks: Amazon S3, Delta Lake, Elasticsearch, or BI tools

Step 1: Setting Up Apache Pulsar

Start Pulsar in standalone mode (for local testing):

bin/pulsar standalone

Create a topic:

bin/pulsar-admin topics create persistent://public/default/user-events

Send sample data:

bin/pulsar-client produce persistent://public/default/user-events \
-m '{"user_id": 101, "action": "click", "ts": 1699999999}'

Step 2: Integrate Spark with Pulsar

Spark supports Pulsar as a source via pulsar-spark connector:

Add Maven coordinates:

<dependency>
<groupId>org.apache.pulsar</groupId>
<artifactId>pulsar-spark-connector_2.12</artifactId>
<version>3.0.0</version>
</dependency>

Step 3: Create Spark Structured Streaming Job

Example Scala code:

val spark = SparkSession.builder
.appName("Pulsar-Spark Analytics")
.master("local[*]")
.getOrCreate()

val df = spark.readStream
.format("pulsar")
.option("service.url", "pulsar://localhost:6650")
.option("topic", "persistent://public/default/user-events")
.load()

import spark.implicits._
val parsed = df.selectExpr("CAST(value AS STRING) as json")
.select(from_json($"json", schema_of[Event]).as("data"))
.select("data.*")

val agg = parsed
.withWatermark("ts", "2 minutes")
.groupBy(window($"ts", "1 minute"), $"action")
.count()

agg.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()

This example:

Reads from a Pulsar topic
Parses JSON event messages
Applies watermarking for late data
Aggregates by action type and 1-minute windows

Step 4: Sink to External Systems

You can write the output to:

S3 / HDFS (via Parquet or Delta Lake)
Elasticsearch for real-time dashboards
Kafka or another Pulsar topic for chaining jobs

Example (write to file):

agg.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/data/analytics/")
.option("checkpointLocation", "/data/checkpoints/")
.start()

Real-World Use Cases

E-Commerce: Analyze user actions for personalization
IoT: Aggregate sensor data across devices in real time
Finance: Monitor fraud patterns using streaming anomaly detection
Telecom: Track usage patterns and latency anomalies live

Monitoring & Fault Tolerance

Use Pulsar Manager UI to monitor topic throughput and subscriptions
Configure Spark Checkpointing to resume streams after failure
Use structured logs and metrics sinks (e.g., Prometheus + Grafana)

Best Practices

Use dedicated Pulsar partitions for parallel Spark readers
Set processingTime triggers (e.g., every 10s) for predictable latency
Design stateless streaming logic when possible
Tune executor memory and cores for backpressure handling

Conclusion

Apache Pulsar and Apache Spark form a powerful stack for building real-time analytics pipelines. Pulsar’s distributed messaging guarantees and Spark’s streaming engine allow you to build pipelines that are scalable, fault-tolerant, and production-ready.

Whether you’re analyzing user activity, monitoring IoT devices, or processing transactions — this combo equips you with the tools to derive insights as data arrives.