Pulsar with Apache Beam Real Time ETL Processing in the Cloud

Modern data pipelines require the ability to process and transform events in real time with low latency, high scalability, and cloud-native flexibility. While Apache Pulsar serves as a powerful messaging backbone with built-in multi-tenancy and geo-replication, Apache Beam provides a unified model for defining batch and streaming ETL jobs that can run on multiple execution engines.

In this post, we explore how to use Apache Pulsar with Apache Beam to implement cloud-native, real-time ETL pipelines, enabling organizations to stream, enrich, and load data into analytical systems efficiently.

Why Pulsar + Apache Beam?

Feature	Apache Pulsar	Apache Beam
Messaging Engine	Distributed pub-sub, queue, and topic storage	Streaming/batch data processing model
Strengths	Multi-tenancy, geo-replication, tiered storage	Unified model, portability, advanced transforms
Use Case	Event ingestion and transport	Transformation, enrichment, windowing
Cloud Native	Yes (K8s-ready, serverless Pulsar Functions)	Yes (runs on Dataflow, Flink, Spark)

Together, Pulsar and Beam allow for a decoupled, end-to-end streaming ETL system with flexibility and fault tolerance.

Architecture Overview

[Producers (Apps, APIs, DBs)]  
↓  
[Apache Pulsar Topics]  
↓  
[Apache Beam Pipelines on Flink/Dataflow]  
↓  
[Cloud Targets: BigQuery, Snowflake, S3, Hudi]  

This architecture supports:

Real-time ingestion and transformation
Stateless or stateful operations
Pluggable runners (Flink, Spark, Google Dataflow)
Dynamic scalability and error recovery

Reading from Pulsar in Apache Beam

Apache Beam supports reading from Pulsar using community connectors or through Pulsar IO integration.

Sample pipeline to read from Pulsar:

PulsarIO.Read<String> pulsarRead = PulsarIO.read()
.withServiceUrl("pulsar://pulsar-broker:6650")
.withAdminUrl("http://pulsar-admin:8080")
.withTopic("persistent://public/default/transactions")
.withSubscriptionName("beam-subscription")
.withSchema(Schema.STRING);

Pipeline pipeline = Pipeline.create(options);

pipeline
.apply("ReadFromPulsar", pulsarRead)
.apply("TransformData", ParDo.of(new TransformFn()))
.apply("WriteToSink", ...);

Note: Use a compatible runner such as Apache Flink, Google Dataflow, or Spark to execute the pipeline in a streaming context.

Transforming Data with Beam

Use Beam’s ParDo, Windowing, and GroupByKey APIs to apply ETL logic:

static class TransformFn extends DoFn<String, KV<String, Integer>> {
@ProcessElement
public void processElement(ProcessContext c) {
String line = c.element();
// Example transformation
String key = extractKey(line);
c.output(KV.of(key, 1));
}
}

For time-based aggregation:

.apply(Window.<KV<String, Integer>>into(FixedWindows.of(Duration.standardMinutes(5))))
.apply(Sum.integersPerKey())

Writing Output to Cloud Sinks

Beam supports writing to:

BigQuery
Google Cloud Storage (GCS) / S3
Apache Hudi or Iceberg
JDBC / NoSQL stores

Example: Writing to BigQuery

.apply("ToTableRow", MapElements.into(TypeDescriptor.of(TableRow.class))
.via((KV<String, Integer> kv) -> new TableRow().set("user", kv.getKey()).set("count", kv.getValue())))
.apply("WriteToBQ", BigQueryIO.writeTableRows()
.to("project:dataset.table")
.withSchema(schema)
.withWriteDisposition(WRITE_APPEND));

Deployment on Cloud

You can run Beam pipelines on:

Google Cloud Dataflow (managed, serverless)
Apache Flink on Kubernetes
EMR with SparkRunner

Use CI/CD for pipeline versioning and integrate with Pulsar Functions or IO connectors for managed I/O handling.

Monitoring and Error Handling

Monitor:

Pulsar message lag with Prometheus/Grafana
Beam job metrics on your runner’s UI (Dataflow/Flink Dashboard)
Throughput and retries

Use:

Dead-letter queues (DLQs) in Pulsar
Retry mechanisms in Beam (DoFn with try/catch or side outputs)

Best Practices

✅ Use schema validation (Avro/Protobuf) in Pulsar
✅ Deploy Beam pipelines as containers for portability
✅ Apply windowing and watermarks for accurate aggregation
✅ Monitor Pulsar topic lag and Beam job health
✅ Use Stateful DoFns for session-based processing
✅ Scale Beam pipelines dynamically with runner configs

Conclusion

By combining Apache Pulsar and Apache Beam, you can build flexible, cloud-native ETL pipelines that process and deliver data in real time. Pulsar ensures durable, scalable messaging, while Beam provides a powerful programming model for transformation and enrichment.

Together, they enable modern streaming architectures that are resilient, scalable, and future-proof — perfect for cloud-native, real-time analytics platforms.