Using Hudi DeltaStreamer with Kafka for Low Latency Pipelines
Build near real-time lakehouse ingestion using Hudi DeltaStreamer and Kafka
As modern applications generate massive streams of data in real time, engineering teams are challenged with ingesting this data efficiently into data lakes for downstream analytics and machine learning.
Apache Hudi’s DeltaStreamer, combined with Apache Kafka, offers a powerful solution to build low-latency pipelines that continuously ingest and upsert streaming data into a transactional data lake — all without writing custom Spark code.
In this post, we’ll explore how to configure Hudi DeltaStreamer to consume from Kafka, write to S3 or HDFS, and tune it for scalable, real-time ingestion.
What is Hudi DeltaStreamer?
DeltaStreamer is a Spark-based utility provided by Hudi that supports:
- Source-to-lake ingestion from Kafka, files, JDBC, and Hive
- Upserts, inserts, and deletes using Hudi table formats
- Schema evolution, transformations, and checkpointing
- Both batch and continuous modes
It removes the need to write and maintain custom ingestion code for structured streaming jobs.
Use Case: Kafka to Lakehouse with DeltaStreamer
Architecture overview:
Kafka → DeltaStreamer (Spark Job) → Hudi Table (on S3/HDFS) → Athena / Presto / Hive
DeltaStreamer reads JSON/Avro events from a Kafka topic and writes them to a Hudi Copy-on-Write (COW) or Merge-on-Read (MOR) table in a transactional and incremental manner.
Prerequisites
- Apache Hudi 0.10.0 or later
- Apache Kafka running with topics populated
- Apache Spark 3.x cluster (YARN, EMR, or standalone)
- Access to S3 or HDFS
- Schema registry (optional for Avro)
Step 1: Define Properties File
Create a config file: kafka-source.properties
# Kafka source configs
hoodie.deltastreamer.source.class=org.apache.hudi.utilities.sources.KafkaSource
hoodie.deltastreamer.source.kafka.topic=kafka_orders
hoodie.deltastreamer.source.kafka.bootstrap.servers=broker1:9092,broker2:9092
hoodie.deltastreamer.source.kafka.group.id=hudi_ingest_group
hoodie.deltastreamer.source.kafka.fetch.max.records=1000
# Schema provider
hoodie.deltastreamer.schema.provider.class=org.apache.hudi.utilities.schema.FilebasedSchemaProvider
hoodie.deltastreamer.source.schema.file=/path/to/schema.avsc
# Target Hudi table configs
hoodie.table.name=hudi_orders
hoodie.datasource.write.recordkey.field=order_id
hoodie.datasource.write.precombine.field=event_ts
hoodie.datasource.write.partitionpath.field=region
hoodie.datasource.write.table.type=COPY_ON_WRITE
hoodie.datasource.write.operation=upsert
hoodie.deltastreamer.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator
# Storage location
hoodie.base.path=s3a://my-lake/hudi_orders
Step 2: Run DeltaStreamer in Continuous Mode
Use the following command to launch DeltaStreamer in continuous ingestion mode:
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
--deploy-mode cluster \
/path/to/hudi-utilities-bundle.jar \
--props kafka-source.properties \
--target-base-path s3a://my-lake/hudi_orders \
--table-type COPY_ON_WRITE \
--continuous
This runs indefinitely and continuously consumes from Kafka, writing every few seconds based on commit interval.
Step 3: Tune for Low Latency
- Reduce commit delay:
hoodie.deltastreamer.write.batch.size=5000 hoodie.write.commit.trigger.strategy=num_commits
- Enable async compaction (for MOR):
hoodie.compact.inline=false hoodie.compact.async.enabled=true
-
Enable checkpointing to ensure resume from last read offset: DeltaStreamer uses a
.checkpoint
file stored in the base path automatically. - Use Avro with Schema Registry for schema evolution:
hoodie.deltastreamer.source.kafka.value.deserializer=io.confluent.kafka.serializers.KafkaAvroDeserializer
Step 4: Query the Data
If Hive sync is enabled, the table becomes instantly queryable:
SELECT * FROM default.hudi_orders
WHERE region = 'us-east' AND event_ts > '2024-11-16T00:00:00';
You can use:
- Presto / Trino
- Athena
- Hive
- Spark SQL
Monitoring and Observability
- Use Spark UI to track job progress and memory usage
- Enable Hudi metrics:
hoodie.metrics.on=true hoodie.metrics.reporter.type=GRAPHITE
- Inspect commit timeline using
hudi-cli
:hudi-cli > connect --path s3a://my-lake/hudi_orders hudi-cli > show commits
Best Practices
- Use COPY_ON_WRITE for fast queries and MERGE_ON_READ for faster ingestion
- Use partitioning (e.g., by region, date) to boost query speed
- Tune Spark executors and write parallelism:
--conf spark.executor.memory=4G --conf spark.sql.shuffle.partitions=100
- Use Kafka topic compaction and proper retention settings
Conclusion
Using Hudi DeltaStreamer with Kafka provides a robust, production-grade framework for building low-latency streaming pipelines into your data lake. It simplifies ingestion, supports updates and deletes, and enables fast downstream analytics.
With the right configuration and tuning, this architecture supports real-time use cases such as CDC ingestion, event stream processing, and operational analytics on top of cloud-native lakehouse storage.