Apache Hudi is a powerful data lakehouse framework that enables upserts, time-travel, and incremental querying on distributed storage systems like HDFS or S3. To simplify the process of ingesting data into Hudi tables, Hudi provides a built-in tool called DeltaStreamer.

Hudi DeltaStreamer offers a low-code, scalable solution for ingesting batch and streaming data from sources such as Kafka, HDFS, and DFS-compatible systems. It automates schema inference, compaction, and syncing with Hive Metastore or AWS Glue Catalog.

In this guide, we’ll explore how to use DeltaStreamer to build production-ready ingestion pipelines with minimal code and maximum flexibility.


What is Hudi DeltaStreamer?

DeltaStreamer is a command-line utility bundled with Hudi that allows users to:

  • Ingest data from DFS sources or Kafka
  • Perform upserts, inserts, and bulk inserts
  • Support Copy-on-Write (COW) and Merge-on-Read (MOR) tables
  • Sync metadata with Hive or AWS Glue
  • Apply basic transformations using SQL or custom hooks

It is ideal for users who don’t want to write Spark jobs manually.


Supported Source Types

DeltaStreamer supports multiple source types via its Source classes:

Source Type Description
DFS Reads from CSV, JSON, Parquet
Kafka Supports Avro/JSON messages
Hive For SQL-based ingestion
JDBC For relational database sources

For streaming ingestion, Kafka is the most common source. For batch, use DFS.


Setting Up Hudi DeltaStreamer

Prerequisites:

  • Apache Hudi installed (Hudi bundle JAR or tarball)
  • Spark 3.x compatible environment
  • Source data in S3, HDFS, or Kafka
  • Hive Metastore or AWS Glue configured (optional)

Basic directory layout:

/hudi
├── deltastreamer-config/
├── input-data/
└── scripts/

Example: Batch Ingestion from HDFS (DFS Source)

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
/path/to/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path hdfs:///user/hudi/orders_cow \
--target-table orders_cow \
--props /hudi/deltastreamer-config/orders.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--source-ordering-field ts \
--hoodie-conf hoodie.datasource.write.recordkey.field=order_id \
--hoodie-conf hoodie.datasource.write.partitionpath.field=order_date \
--run-bootstrap

Sample Properties File (orders.properties)

hoodie.deltastreamer.source.dfs.root=hdfs:///user/data/orders_raw
hoodie.datasource.write.table.name=orders_cow
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.database=default
hoodie.datasource.hive_sync.table=orders_cow
hoodie.datasource.hive_sync.partition_fields=order_date
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.datasource.hive_sync.mode=hms

Example: Streaming Ingestion from Kafka

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
/path/to/hudi-utilities-bundle.jar \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--target-base-path s3a://datalake/transactions_mor \
--target-table transactions_mor \
--props /hudi/deltastreamer-config/transactions.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--continuous

This runs a long-running streaming job that continuously ingests and merges updates from Kafka.


Schema Provider Options

  • FileBasedSchemaProvider: Reads Avro schema from disk
  • ClasspathBasedSchemaProvider: Reads schema from classpath JAR
  • HiveSchemaProvider: Pulls schema from Hive table

Use the schema provider that best fits your environment and governance model.


Running Compaction for MOR Tables

For MOR tables, compaction is needed to merge logs:

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
/path/to/hudi-utilities-bundle.jar \
--compact \
--target-base-path s3a://datalake/transactions_mor \
--target-table transactions_mor \
--props /hudi/deltastreamer-config/transactions.properties

Alternatively, use inline compaction with:

hoodie.compact.inline=true
hoodie.compact.inline.max.delta.commits=5

Best Practices

  • Use dedicated schema providers for governance and stability
  • Partition on low-cardinality fields (e.g., date, region)
  • Use inline compaction for low-latency use cases
  • Enable Hive sync for downstream querying (Athena, Hive, Presto)
  • Monitor ingestion jobs via logs and checkpoints
  • For production, integrate with Airflow, Dagster, or Glue Workflows

Conclusion

DeltaStreamer simplifies building streaming and batch ingestion pipelines into Hudi without writing Spark code. Whether you’re ingesting from Kafka, S3, or HDFS, DeltaStreamer provides a unified and configurable interface to build incremental, versioned, and queryable data lakes with minimal setup.

By adopting Hudi and DeltaStreamer, you can accelerate the path to a real-time, scalable, and cloud-native lakehouse architecture.