Building Fault Tolerant Data Pipelines with Pulsar and Apache Hudi

Modern data platforms demand pipelines that are resilient, reliable, and real-time. Failures in ingestion, network, or processing logic shouldn’t compromise the consistency or completeness of data. By combining Apache Pulsar, a distributed event streaming platform, with Apache Hudi, a transactional data lake framework, you can build fault-tolerant streaming pipelines that scale.

In this post, we explore how to use Pulsar and Hudi to design end-to-end resilient pipelines, covering ingestion architecture, deduplication, schema handling, and recovery from failures.

Why Combine Pulsar and Hudi?

Feature	Apache Pulsar	Apache Hudi
Messaging	Distributed publish-subscribe	—
Streaming Ingestion	Low-latency, high-throughput	Supports streaming writes via Spark/Flink
Storage	Ephemeral	Durable lakehouse format (on S3/HDFS)
Fault Tolerance	Acknowledgments, persistence	Time-travel, rollback, schema evolution
Deduplication	At message level (Key_Shared)	At record level via precombine keys

Together, they form a streaming + storage pattern with durability and consistency guarantees.

Pipeline Architecture

[Producers / IoT Devices]  
↓  
[Apache Pulsar Topics]  
↓  
[Stream Processing (Flink / Spark Structured Streaming)]  
↓  
[Apache Hudi Tables on S3 / HDFS]  
↓  
[Query Layer: Presto / Trino / Athena / Hive]  

Step 1: Ingesting Data into Pulsar

Produce messages to Pulsar using a schema-encoded format like Avro or JSON:

{
"device_id": "sensor-001",
"event_time": "2024-11-16T10:00:00Z",
"temperature": 25.3,
"status": "ok"
}

Use persistent topics with Key_Shared or Shared subscriptions for parallelism and ordering.

Step 2: Streaming Into Hudi Using Apache Flink

Pulsar provides Flink connectors that can write directly to Hudi.

Sample Flink job logic:

DataStream<RowData> pulsarStream = env.fromSource(
PulsarSourceBuilder.build("pulsar://broker:6650", "iot-topic"),
WatermarkStrategy.noWatermarks(),
"pulsar-source"
);

Configuration hudiConf = new Configuration();
hudiConf.setString(FlinkOptions.PATH, "s3://hudi/iot-events");
hudiConf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");
hudiConf.setString(FlinkOptions.RECORD_KEY_FIELD, "device_id");
hudiConf.setString(FlinkOptions.PRECOMBINE_FIELD, "event_time");

pulsarStream.addSink(HudiSinkProvider.createSink(hudiConf));

This configuration ensures upserts and record-level deduplication.

Step 3: Ensuring Fault Tolerance

Pulsar ensures message durability via acknowledgments and bookkeeper-based persistence
Flink/Spark use checkpointing and exactly-once semantics
Hudi writes are atomic — failed writes are rolled back using commit timelines

Use the precombine field to keep the latest version of each event:

hoodie.datasource.write.precombine.field = event_time

This avoids duplicates during retries or replays.

Step 4: Recovery and Reprocessing

To reprocess or recover data:

Replay Pulsar messages from earliest or timestamp
Use Hudi’s incremental query to resume processing:

spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", "20241116093000")
.load("s3://hudi/iot-events")

Use Hudi CLI to rollback failed commits:

hudi-cli
> connect --path s3://hudi/iot-events
> rollback 20241116103000

Step 5: Schema Evolution and Compatibility

Enable schema evolution in both Pulsar and Hudi:

Pulsar:

bin/pulsar-admin schemas upload --filename event-schema.avsc persistent://tenant/ns/topic

Hudi:

hoodie.datasource.write.schema.evolution.enable = true

Always keep fields nullable when adding new ones.

Monitoring and Observability

Monitor your pipeline with:

Pulsar metrics: Broker throughput, consumer lag
Hudi timeline: Commit stats, compaction status
Processing framework: Flink / Spark UI, checkpoints
Dashboards: Prometheus + Grafana, CloudWatch, or Datadog

Best Practices

Use Key_Shared subscription for deduplicated parallelism
Use event_time as precombine field for ordering
Enable Hudi metadata table for faster file listing
Use inline compaction and clustering for MOR tables
Monitor commit latency and record size for tuning

Conclusion

By combining Apache Pulsar with Apache Hudi, you can build end-to-end, fault-tolerant data pipelines that are real-time, consistent, and scalable. Pulsar ensures reliable ingestion, while Hudi guarantees ACID semantics and time-travel — making this duo a powerful foundation for building modern data lakes and analytics platforms.

Embrace this architecture to build pipelines that are resilient by design and ready for scale.