Understanding Hudi Architecture and Core Components

Modern data lakes need to support real-time ingestion, incremental processing, and efficient querying — all while scaling to handle petabytes of data. Apache Hudi (Hadoop Upserts Deletes and Incrementals) addresses these challenges by bringing transactional capabilities and streaming semantics to data lakes on Hadoop-compatible storage.

In this blog, we’ll explore the architecture and core components of Apache Hudi, how it enables ACID-like guarantees on object stores, and the moving parts behind its fast, flexible data ingestion framework.

What is Apache Hudi?

Apache Hudi is an open-source data lake platform that enables:

Upserts and deletes in data lakes
Incremental pull of data changes
Time travel and data versioning
Streaming + batch ingestion
Efficient querying via Hive, Spark, Presto, and Trino

It supports both copy-on-write (COW) and merge-on-read (MOR) storage strategies, optimized for read-heavy or write-heavy use cases, respectively.

High-Level Hudi Architecture

At a high level, Hudi consists of the following layers:

Ingestion Layer (e.g., DeltaStreamer, Spark Jobs)
Storage Layer (COW/MOR files on HDFS, S3, GCS)
Metadata Layer (commit timeline, metadata table)
Query Layer (Hive, Spark SQL, Presto, Trino, Flink)

These layers work together to bring transactional data management to large-scale object storage systems.

Core Components of Hudi

Let’s dive into the major components that make Hudi work:

1. Write Client

The Hudi Write Client is responsible for ingesting and writing data into Hudi tables. It supports operations like:

UPSERT: Insert new records and update existing ones
INSERT: Insert-only workload
BULK_INSERT: High-throughput batch ingestion
DELETE: Remove records by key

val hudiOptions = Map(
"hoodie.table.name" -> "user_events",
"hoodie.datasource.write.recordkey.field" -> "user_id",
"hoodie.datasource.write.operation" -> "upsert"
)

data.write
.format("hudi")
.options(hudiOptions)
.mode(SaveMode.Append)
.save("/data/hudi/user_events")

This client coordinates with the Timeline Server and metadata table to manage writes and commit operations.

2. Timeline and Commit Protocol

Hudi tracks every operation using a commit timeline. Each write generates an instant (timestamped action) — like:

.commit: Successfully completed commit
.inflight: In-progress operation
.requested: Operation request logged

These are visible in .hoodie metadata directory:

/data/hudi/.hoodie/
├── 20240410091523.commit
├── 20240410092000.inflight
└── 20240410093000.requested

The timeline enables rollback, clustering, and incremental reads.

3. Metadata Table

The Hudi Metadata Table stores auxiliary information like:

File listings
Column statistics
Record locations

It improves query planning speed by eliminating costly file listing operations — especially helpful on object stores like S3 or GCS.

Enable it via:

hoodie.metadata.enable=true

4. DeltaStreamer

DeltaStreamer is a built-in tool for ingesting data from sources like Kafka, Hive, RDBMS (via Sqoop), or files. It supports:

Continuous or batch ingestion
Source-to-target schema evolution
ETL with transformation logic

Run it with a simple config:

hoodie-delta-streamer \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--target-base-path s3://data/hudi/events \
--target-table events

DeltaStreamer helps avoid writing custom Spark jobs for ingestion.

5. Storage Types: Copy-on-Write (COW) vs Merge-on-Read (MOR)

Feature	Copy-on-Write (COW)	Merge-on-Read (MOR)
Write Cost	Higher (overwrite files)	Lower (write delta logs)
Read Performance	Fast	Slower (needs merge)
Use Case	Read-heavy workloads	Write-heavy + streaming
Compaction Required	No	Yes

Choose COW for frequent reads and MOR for frequent updates or streaming ingestion.

6. Query Engines and Access Methods

Hudi integrates with multiple engines:

Apache Hive (via Hive InputFormat)
Spark SQL (spark.read.format("hudi"))
Presto/Trino (hive catalog with Hudi tables)
Flink for real-time processing

It supports snapshot queries, incremental queries, and point-in-time queries:

Incremental read example:

val df = spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", "20240401000000")
.load("/data/hudi/user_events")

Bonus: Key Hudi Table Types

Copy-on-Write Table: Stores only base files
Merge-on-Read Table: Stores base files + delta logs
Bootstrap Table: Helps migrate existing data into Hudi format with minimal rewrite

Best Practices

Enable metadata table for large datasets
Use clustering to optimize file sizes for downstream engines
Schedule compaction periodically for MOR tables
Choose UPSERT vs BULK_INSERT depending on ingestion frequency
Monitor commits via timeline for observability and rollback

Conclusion

Apache Hudi revolutionizes the way we build modern data lakes by supporting real-time ingestion, ACID operations, and incremental analytics at scale. Understanding its architecture — from the Write Client and Timeline to the Metadata Table and DeltaStreamer — equips you to build faster, more reliable, and cost-efficient big data pipelines.

Whether you’re ingesting millions of records per hour or supporting time-travel analytics, Hudi offers the flexibility and power to keep your lake fresh and query-ready.