Modern data pipelines require the ability to process and transform events in real time with low latency, high scalability, and cloud-native flexibility. While Apache Pulsar serves as a powerful messaging backbone with built-in multi-tenancy and geo-replication, Apache Beam provides a unified model for defining batch and streaming ETL jobs that can run on multiple execution engines.

In this post, we explore how to use Apache Pulsar with Apache Beam to implement cloud-native, real-time ETL pipelines, enabling organizations to stream, enrich, and load data into analytical systems efficiently.


Why Pulsar + Apache Beam?

Feature Apache Pulsar Apache Beam
Messaging Engine Distributed pub-sub, queue, and topic storage Streaming/batch data processing model
Strengths Multi-tenancy, geo-replication, tiered storage Unified model, portability, advanced transforms
Use Case Event ingestion and transport Transformation, enrichment, windowing
Cloud Native Yes (K8s-ready, serverless Pulsar Functions) Yes (runs on Dataflow, Flink, Spark)

Together, Pulsar and Beam allow for a decoupled, end-to-end streaming ETL system with flexibility and fault tolerance.


Architecture Overview

[Producers (Apps, APIs, DBs)]  
↓  
[Apache Pulsar Topics]  
↓  
[Apache Beam Pipelines on Flink/Dataflow]  
↓  
[Cloud Targets: BigQuery, Snowflake, S3, Hudi]  

This architecture supports:

  • Real-time ingestion and transformation
  • Stateless or stateful operations
  • Pluggable runners (Flink, Spark, Google Dataflow)
  • Dynamic scalability and error recovery

Reading from Pulsar in Apache Beam

Apache Beam supports reading from Pulsar using community connectors or through Pulsar IO integration.

Sample pipeline to read from Pulsar:

PulsarIO.Read<String> pulsarRead = PulsarIO.read()
.withServiceUrl("pulsar://pulsar-broker:6650")
.withAdminUrl("http://pulsar-admin:8080")
.withTopic("persistent://public/default/transactions")
.withSubscriptionName("beam-subscription")
.withSchema(Schema.STRING);

Pipeline pipeline = Pipeline.create(options);

pipeline
.apply("ReadFromPulsar", pulsarRead)
.apply("TransformData", ParDo.of(new TransformFn()))
.apply("WriteToSink", ...);

Note: Use a compatible runner such as Apache Flink, Google Dataflow, or Spark to execute the pipeline in a streaming context.


Transforming Data with Beam

Use Beam’s ParDo, Windowing, and GroupByKey APIs to apply ETL logic:

static class TransformFn extends DoFn<String, KV<String, Integer>> {
@ProcessElement
public void processElement(ProcessContext c) {
String line = c.element();
// Example transformation
String key = extractKey(line);
c.output(KV.of(key, 1));
}
}

For time-based aggregation:

.apply(Window.<KV<String, Integer>>into(FixedWindows.of(Duration.standardMinutes(5))))
.apply(Sum.integersPerKey())

Writing Output to Cloud Sinks

Beam supports writing to:

  • BigQuery
  • Google Cloud Storage (GCS) / S3
  • Apache Hudi or Iceberg
  • JDBC / NoSQL stores

Example: Writing to BigQuery

.apply("ToTableRow", MapElements.into(TypeDescriptor.of(TableRow.class))
.via((KV<String, Integer> kv) -> new TableRow().set("user", kv.getKey()).set("count", kv.getValue())))
.apply("WriteToBQ", BigQueryIO.writeTableRows()
.to("project:dataset.table")
.withSchema(schema)
.withWriteDisposition(WRITE_APPEND));

Deployment on Cloud

You can run Beam pipelines on:

  • Google Cloud Dataflow (managed, serverless)
  • Apache Flink on Kubernetes
  • EMR with SparkRunner

Use CI/CD for pipeline versioning and integrate with Pulsar Functions or IO connectors for managed I/O handling.


Monitoring and Error Handling

Monitor:

  • Pulsar message lag with Prometheus/Grafana
  • Beam job metrics on your runner’s UI (Dataflow/Flink Dashboard)
  • Throughput and retries

Use:

  • Dead-letter queues (DLQs) in Pulsar
  • Retry mechanisms in Beam (DoFn with try/catch or side outputs)

Best Practices

✅ Use schema validation (Avro/Protobuf) in Pulsar
✅ Deploy Beam pipelines as containers for portability
✅ Apply windowing and watermarks for accurate aggregation
✅ Monitor Pulsar topic lag and Beam job health
✅ Use Stateful DoFns for session-based processing
✅ Scale Beam pipelines dynamically with runner configs


Conclusion

By combining Apache Pulsar and Apache Beam, you can build flexible, cloud-native ETL pipelines that process and deliver data in real time. Pulsar ensures durable, scalable messaging, while Beam provides a powerful programming model for transformation and enrichment.

Together, they enable modern streaming architectures that are resilient, scalable, and future-proof — perfect for cloud-native, real-time analytics platforms.