Using Hudi with Kafka for Real Time Data Ingestion

Modern data-driven applications demand low-latency data availability for analytics and decision-making. Traditional batch-based data lakes struggle to meet this need. Enter Apache Hudi and Apache Kafka — a powerful combination for building real-time ingestion pipelines into your data lake or lakehouse architecture.

In this post, we explore how to use Apache Hudi with Kafka to capture and store streaming data in real time. We’ll walk through the architecture, setup options, and best practices for reliable, scalable ingestion.

Why Use Apache Hudi and Kafka?

Feature	Apache Kafka	Apache Hudi
Message Queue	Yes	No
Storage Layer	No	Yes
Real-Time Ingestion	Yes (Producer/Consumer)	Yes (Streaming & Batch Writes)
Incremental Queries	No	Yes
Upserts/Deletes	No	Yes
Integration with Hive/Spark	Limited	Native

Together, Kafka and Hudi allow:

Real-time data ingestion
ACID-compliant writes to data lake
Near real-time analytical query support

Architecture Overview

[Kafka Producers] → [Kafka Topic] → [Hudi Streamer or Spark Job] → [Hudi Table in HDFS/S3] → [Query Engines: Presto, Hive, Spark]

You can use:

Hudi DeltaStreamer for low-code ingestion
Custom Spark Streaming jobs for full flexibility

Setting Up Apache Hudi with Kafka

1. Kafka Topic with Streaming Data

Use Kafka Connect or your application to produce JSON/Avro/Parquet messages to a Kafka topic:

kafka-topics.sh --create --topic hudi-events --bootstrap-server localhost:9092 --partitions 3

2. Configure DeltaStreamer

DeltaStreamer reads from Kafka and writes to Hudi tables using checkpointed streaming.

Example config:

hoodie.deltastreamer.source.class=org.apache.hudi.utilities.sources.JsonKafkaSource
hoodie.deltastreamer.source.kafka.topic=hudi-events
hoodie.deltastreamer.source.kafka.bootstrap.servers=localhost:9092
hoodie.deltastreamer.target.base.path=hdfs:///datalake/hudi/events
hoodie.deltastreamer.target.table.name=hudi_events
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.precombine.field=ts
hoodie.datasource.write.table.type=MERGE_ON_READ

Run DeltaStreamer:

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
hudi-utilities-bundle.jar \
--props kafka-source.properties \
--continuous

Write Modes: COPY_ON_WRITE vs MERGE_ON_READ

Mode	Characteristics
COPY_ON_WRITE	Fast reads, slower writes, data stored in Parquet
MERGE_ON_READ	Faster writes, supports incremental queries

Use MERGE_ON_READ for real-time data with upserts. Choose COPY_ON_WRITE if you prioritize query speed.

Querying Hudi Data in Hive/Spark

CREATE EXTERNAL TABLE hudi_events (
id STRING,
event_type STRING,
ts TIMESTAMP
)
STORED AS PARQUET
LOCATION 'hdfs:///datalake/hudi/events';

Query in Spark:

df = spark.read.format("hudi").load("hdfs:///datalake/hudi/events")
df.createOrReplaceTempView("events")
spark.sql("SELECT * FROM events WHERE ts > current_timestamp() - interval 10 minutes").show()

Enabling Incremental Queries

Hudi allows incremental reads based on commit time:

incremental_df = spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20240411000000") \
.load("hdfs:///datalake/hudi/events")

This allows consuming only new records since last checkpoint — perfect for downstream pipelines.

Error Handling and Checkpointing

DeltaStreamer and Spark can checkpoint offsets in HDFS:

Prevents message loss
Ensures at-least-once delivery
Supports restart on failure

Always monitor checkpoint logs and store them in a persistent path.

Best Practices

Use JSON or Avro for Kafka payloads (schema evolution supported)
Enable Hive sync for SQL-based access
Use partitioning on ingestion timestamp or logical key
Set hoodie.cleaner.policy to manage old versions
Use MERGE_ON_READ with compaction for low-latency ingestion

Conclusion

Apache Hudi and Kafka together form a real-time ingestion powerhouse, giving you the flexibility of Kafka with the durability and ACID semantics of Hudi. With the right configuration, you can deliver real-time analytics, incremental ETL, and streaming lakehouse architectures without compromising performance or consistency.

This architecture empowers teams to build low-latency, reliable pipelines that deliver insights from fresh data in seconds.