Modern data platforms demand pipelines that are resilient, reliable, and real-time. Failures in ingestion, network, or processing logic shouldn’t compromise the consistency or completeness of data. By combining Apache Pulsar, a distributed event streaming platform, with Apache Hudi, a transactional data lake framework, you can build fault-tolerant streaming pipelines that scale.

In this post, we explore how to use Pulsar and Hudi to design end-to-end resilient pipelines, covering ingestion architecture, deduplication, schema handling, and recovery from failures.


Why Combine Pulsar and Hudi?

Feature Apache Pulsar Apache Hudi
Messaging Distributed publish-subscribe
Streaming Ingestion Low-latency, high-throughput Supports streaming writes via Spark/Flink
Storage Ephemeral Durable lakehouse format (on S3/HDFS)
Fault Tolerance Acknowledgments, persistence Time-travel, rollback, schema evolution
Deduplication At message level (Key_Shared) At record level via precombine keys

Together, they form a streaming + storage pattern with durability and consistency guarantees.


Pipeline Architecture

[Producers / IoT Devices]  
↓  
[Apache Pulsar Topics]  
↓  
[Stream Processing (Flink / Spark Structured Streaming)]  
↓  
[Apache Hudi Tables on S3 / HDFS]  
↓  
[Query Layer: Presto / Trino / Athena / Hive]  

Step 1: Ingesting Data into Pulsar

Produce messages to Pulsar using a schema-encoded format like Avro or JSON:

{
"device_id": "sensor-001",
"event_time": "2024-11-16T10:00:00Z",
"temperature": 25.3,
"status": "ok"
}

Use persistent topics with Key_Shared or Shared subscriptions for parallelism and ordering.


Pulsar provides Flink connectors that can write directly to Hudi.

Sample Flink job logic:

DataStream<RowData> pulsarStream = env.fromSource(
PulsarSourceBuilder.build("pulsar://broker:6650", "iot-topic"),
WatermarkStrategy.noWatermarks(),
"pulsar-source"
);

Configuration hudiConf = new Configuration();
hudiConf.setString(FlinkOptions.PATH, "s3://hudi/iot-events");
hudiConf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");
hudiConf.setString(FlinkOptions.RECORD_KEY_FIELD, "device_id");
hudiConf.setString(FlinkOptions.PRECOMBINE_FIELD, "event_time");

pulsarStream.addSink(HudiSinkProvider.createSink(hudiConf));

This configuration ensures upserts and record-level deduplication.


Step 3: Ensuring Fault Tolerance

  • Pulsar ensures message durability via acknowledgments and bookkeeper-based persistence
  • Flink/Spark use checkpointing and exactly-once semantics
  • Hudi writes are atomic — failed writes are rolled back using commit timelines

Use the precombine field to keep the latest version of each event:

hoodie.datasource.write.precombine.field = event_time

This avoids duplicates during retries or replays.


Step 4: Recovery and Reprocessing

To reprocess or recover data:

  • Replay Pulsar messages from earliest or timestamp
  • Use Hudi’s incremental query to resume processing:
spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", "20241116093000")
.load("s3://hudi/iot-events")
  • Use Hudi CLI to rollback failed commits:
hudi-cli
> connect --path s3://hudi/iot-events
> rollback 20241116103000

Step 5: Schema Evolution and Compatibility

Enable schema evolution in both Pulsar and Hudi:

  • Pulsar:
    bin/pulsar-admin schemas upload --filename event-schema.avsc persistent://tenant/ns/topic
    
  • Hudi:
    hoodie.datasource.write.schema.evolution.enable = true
    

Always keep fields nullable when adding new ones.


Monitoring and Observability

Monitor your pipeline with:

  • Pulsar metrics: Broker throughput, consumer lag
  • Hudi timeline: Commit stats, compaction status
  • Processing framework: Flink / Spark UI, checkpoints
  • Dashboards: Prometheus + Grafana, CloudWatch, or Datadog

Best Practices

  • Use Key_Shared subscription for deduplicated parallelism
  • Use event_time as precombine field for ordering
  • Enable Hudi metadata table for faster file listing
  • Use inline compaction and clustering for MOR tables
  • Monitor commit latency and record size for tuning

Conclusion

By combining Apache Pulsar with Apache Hudi, you can build end-to-end, fault-tolerant data pipelines that are real-time, consistent, and scalable. Pulsar ensures reliable ingestion, while Hudi guarantees ACID semantics and time-travel — making this duo a powerful foundation for building modern data lakes and analytics platforms.

Embrace this architecture to build pipelines that are resilient by design and ready for scale.