Using Hudi with DeltaStreamer for Simplified Data Ingestion

Apache Hudi is a powerful data lakehouse framework that enables upserts, time-travel, and incremental querying on distributed storage systems like HDFS or S3. To simplify the process of ingesting data into Hudi tables, Hudi provides a built-in tool called DeltaStreamer.

Hudi DeltaStreamer offers a low-code, scalable solution for ingesting batch and streaming data from sources such as Kafka, HDFS, and DFS-compatible systems. It automates schema inference, compaction, and syncing with Hive Metastore or AWS Glue Catalog.

In this guide, we’ll explore how to use DeltaStreamer to build production-ready ingestion pipelines with minimal code and maximum flexibility.

What is Hudi DeltaStreamer?

DeltaStreamer is a command-line utility bundled with Hudi that allows users to:

Ingest data from DFS sources or Kafka
Perform upserts, inserts, and bulk inserts
Support Copy-on-Write (COW) and Merge-on-Read (MOR) tables
Sync metadata with Hive or AWS Glue
Apply basic transformations using SQL or custom hooks

It is ideal for users who don’t want to write Spark jobs manually.

Supported Source Types

DeltaStreamer supports multiple source types via its Source classes:

Source Type	Description
DFS	Reads from CSV, JSON, Parquet
Kafka	Supports Avro/JSON messages
Hive	For SQL-based ingestion
JDBC	For relational database sources

For streaming ingestion, Kafka is the most common source. For batch, use DFS.

Setting Up Hudi DeltaStreamer

Prerequisites:

Apache Hudi installed (Hudi bundle JAR or tarball)
Spark 3.x compatible environment
Source data in S3, HDFS, or Kafka
Hive Metastore or AWS Glue configured (optional)

Basic directory layout:

/hudi
├── deltastreamer-config/
├── input-data/
└── scripts/

Example: Batch Ingestion from HDFS (DFS Source)

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
/path/to/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path hdfs:///user/hudi/orders_cow \
--target-table orders_cow \
--props /hudi/deltastreamer-config/orders.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--source-ordering-field ts \
--hoodie-conf hoodie.datasource.write.recordkey.field=order_id \
--hoodie-conf hoodie.datasource.write.partitionpath.field=order_date \
--run-bootstrap

Sample Properties File (`orders.properties`)

hoodie.deltastreamer.source.dfs.root=hdfs:///user/data/orders_raw
hoodie.datasource.write.table.name=orders_cow
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.database=default
hoodie.datasource.hive_sync.table=orders_cow
hoodie.datasource.hive_sync.partition_fields=order_date
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.datasource.hive_sync.mode=hms

Example: Streaming Ingestion from Kafka

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
/path/to/hudi-utilities-bundle.jar \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--target-base-path s3a://datalake/transactions_mor \
--target-table transactions_mor \
--props /hudi/deltastreamer-config/transactions.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--continuous

This runs a long-running streaming job that continuously ingests and merges updates from Kafka.

Schema Provider Options

FileBasedSchemaProvider: Reads Avro schema from disk
ClasspathBasedSchemaProvider: Reads schema from classpath JAR
HiveSchemaProvider: Pulls schema from Hive table

Use the schema provider that best fits your environment and governance model.

Running Compaction for MOR Tables

For MOR tables, compaction is needed to merge logs:

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
/path/to/hudi-utilities-bundle.jar \
--compact \
--target-base-path s3a://datalake/transactions_mor \
--target-table transactions_mor \
--props /hudi/deltastreamer-config/transactions.properties

Alternatively, use inline compaction with:

hoodie.compact.inline=true
hoodie.compact.inline.max.delta.commits=5

Best Practices

Use dedicated schema providers for governance and stability
Partition on low-cardinality fields (e.g., date, region)
Use inline compaction for low-latency use cases
Enable Hive sync for downstream querying (Athena, Hive, Presto)
Monitor ingestion jobs via logs and checkpoints
For production, integrate with Airflow, Dagster, or Glue Workflows

Conclusion

DeltaStreamer simplifies building streaming and batch ingestion pipelines into Hudi without writing Spark code. Whether you’re ingesting from Kafka, S3, or HDFS, DeltaStreamer provides a unified and configurable interface to build incremental, versioned, and queryable data lakes with minimal setup.

By adopting Hudi and DeltaStreamer, you can accelerate the path to a real-time, scalable, and cloud-native lakehouse architecture.