Exploring Hudi Table Types COW vs MOR

Apache Hudi (Hadoop Upserts Deletes and Incrementals) has emerged as a powerful transactional data lake framework that supports streaming and batch processing. It brings ACID semantics and efficient data management to distributed storage systems like HDFS, Amazon S3, or Azure Data Lake.

One of the key design choices in Hudi is the table type — specifically, choosing between Copy-on-Write (COW) and Merge-on-Read (MOR). This post explores both table types in depth, highlighting the trade-offs, performance implications, and ideal use cases for each.

Overview of Hudi Table Types

Apache Hudi supports two table types:

Table Type	Storage Behavior	Query Performance	Write Performance
Copy-on-Write (COW)	Rewrites data files on every update	Fast (columnar files)	Slower (due to rewrites)
Merge-on-Read (MOR)	Appends delta logs, compacts later	Slower (needs merge)	Faster (write-optimized)

Choosing the right type depends on your write/read workload, query latency requirements, and storage cost considerations.

Copy-on-Write (COW)

In COW mode:

Data is stored in columnar formats like Parquet
On every insert/update, the affected files are rewritten
Offers excellent read performance, especially for OLAP workloads

Use cases:

Batch ETL pipelines
Read-heavy analytics (e.g., BI tools, dashboards)
Data sets with fewer updates or slow update frequency

Example config:

hoodie.table.type = COPY_ON_WRITE

Pros:

Fast read queries (columnar access)
Simplified storage layout
Ideal for immutable or append-only data

Cons:

Slower write times
Resource-intensive during updates

Merge-on-Read (MOR)

In MOR mode:

Inserts go to base files (e.g., Parquet), while updates are appended to delta logs (e.g., Avro)
Query engines merge base files + logs at read time
Periodic compaction rewrites base files for better performance

Use cases:

Real-time ingestion (e.g., CDC streams)
Write-heavy workloads
Low-latency update pipelines

Example config:

hoodie.table.type = MERGE_ON_READ

Pros:

Faster ingestion and updates
Supports near real-time processing
Efficient for frequent mutations (upserts/deletes)

Cons:

Slower reads due to on-the-fly merge
Requires compaction for long-term performance

Compaction in MOR Tables

Compaction is the process of merging delta logs with base files to:

Improve query speed
Reduce number of small files
Maintain columnar format

Scheduled compaction:

hoodie.compact.inline = true
hoodie.compact.inline.max.delta.commits = 5

You can also run compaction manually using Hudi CLI or Spark jobs.

Query Modes: Snapshot vs. Read-Optimized

Hudi supports two read modes:

Snapshot: Returns the most recent state (base + log for MOR)
Read-optimized: Only reads base files (faster, may skip recent updates in MOR)

Choose based on your freshness and latency requirements.

When to Use COW vs MOR

Scenario	Recommended Table Type
Daily batch ETL and BI queries	COW
Real-time ingestion with upserts	MOR
Append-only datasets	COW
High write frequency (e.g., IoT)	MOR
Fast read latency needed	COW
Low update-to-read ratio	COW

For hybrid use cases, MOR offers more flexibility, while COW simplifies storage and query patterns.

Integrating with Query Engines

Both COW and MOR tables are accessible via:

Apache Hive
Apache Spark SQL
Presto/Trino
Apache Flink (via Hudi connector)

Enable Hudi catalog or use Hudi Hive Sync Tool to keep schema registered.

Example Spark SQL read:

spark.read.format("hudi")
.option("hoodie.datasource.query.type", "snapshot")
.load("s3://datalake/hudi/orders_mor/")

Best Practices

Use COW when query performance is the top priority
Use MOR for streaming pipelines or CDC ingestion
Schedule compaction jobs in MOR to avoid performance degradation
Monitor small file growth and manage commit frequency
Use partitioning wisely to optimize both write and read paths

Conclusion

Choosing between COW and MOR in Apache Hudi is crucial for building efficient, scalable, and reliable data lakes. While COW offers simplicity and read speed, MOR brings flexibility and performance for high-velocity, real-time data ingestion.

By understanding the trade-offs and applying them to your data workloads, you can design Hudi tables that meet both your operational and analytical requirements in the modern data lakehouse architecture.