Exploring Hudi Table Types COW vs MOR
Understand the difference between Hudi's Copy-on-Write and Merge-on-Read tables for efficient data lake management
Apache Hudi (Hadoop Upserts Deletes and Incrementals) has emerged as a powerful transactional data lake framework that supports streaming and batch processing. It brings ACID semantics and efficient data management to distributed storage systems like HDFS, Amazon S3, or Azure Data Lake.
One of the key design choices in Hudi is the table type — specifically, choosing between Copy-on-Write (COW) and Merge-on-Read (MOR). This post explores both table types in depth, highlighting the trade-offs, performance implications, and ideal use cases for each.
Overview of Hudi Table Types
Apache Hudi supports two table types:
Table Type | Storage Behavior | Query Performance | Write Performance |
---|---|---|---|
Copy-on-Write (COW) | Rewrites data files on every update | Fast (columnar files) | Slower (due to rewrites) |
Merge-on-Read (MOR) | Appends delta logs, compacts later | Slower (needs merge) | Faster (write-optimized) |
Choosing the right type depends on your write/read workload, query latency requirements, and storage cost considerations.
Copy-on-Write (COW)
In COW mode:
- Data is stored in columnar formats like Parquet
- On every insert/update, the affected files are rewritten
- Offers excellent read performance, especially for OLAP workloads
Use cases:
- Batch ETL pipelines
- Read-heavy analytics (e.g., BI tools, dashboards)
- Data sets with fewer updates or slow update frequency
Example config:
hoodie.table.type = COPY_ON_WRITE
Pros:
- Fast read queries (columnar access)
- Simplified storage layout
- Ideal for immutable or append-only data
Cons:
- Slower write times
- Resource-intensive during updates
Merge-on-Read (MOR)
In MOR mode:
- Inserts go to base files (e.g., Parquet), while updates are appended to delta logs (e.g., Avro)
- Query engines merge base files + logs at read time
- Periodic compaction rewrites base files for better performance
Use cases:
- Real-time ingestion (e.g., CDC streams)
- Write-heavy workloads
- Low-latency update pipelines
Example config:
hoodie.table.type = MERGE_ON_READ
Pros:
- Faster ingestion and updates
- Supports near real-time processing
- Efficient for frequent mutations (upserts/deletes)
Cons:
- Slower reads due to on-the-fly merge
- Requires compaction for long-term performance
Compaction in MOR Tables
Compaction is the process of merging delta logs with base files to:
- Improve query speed
- Reduce number of small files
- Maintain columnar format
Scheduled compaction:
hoodie.compact.inline = true
hoodie.compact.inline.max.delta.commits = 5
You can also run compaction manually using Hudi CLI or Spark jobs.
Query Modes: Snapshot vs. Read-Optimized
Hudi supports two read modes:
- Snapshot: Returns the most recent state (base + log for MOR)
- Read-optimized: Only reads base files (faster, may skip recent updates in MOR)
Choose based on your freshness and latency requirements.
When to Use COW vs MOR
Scenario | Recommended Table Type |
---|---|
Daily batch ETL and BI queries | COW |
Real-time ingestion with upserts | MOR |
Append-only datasets | COW |
High write frequency (e.g., IoT) | MOR |
Fast read latency needed | COW |
Low update-to-read ratio | COW |
For hybrid use cases, MOR offers more flexibility, while COW simplifies storage and query patterns.
Integrating with Query Engines
Both COW and MOR tables are accessible via:
- Apache Hive
- Apache Spark SQL
- Presto/Trino
- Apache Flink (via Hudi connector)
Enable Hudi catalog or use Hudi Hive Sync Tool to keep schema registered.
Example Spark SQL read:
spark.read.format("hudi")
.option("hoodie.datasource.query.type", "snapshot")
.load("s3://datalake/hudi/orders_mor/")
Best Practices
- Use COW when query performance is the top priority
- Use MOR for streaming pipelines or CDC ingestion
- Schedule compaction jobs in MOR to avoid performance degradation
- Monitor small file growth and manage commit frequency
- Use partitioning wisely to optimize both write and read paths
Conclusion
Choosing between COW and MOR in Apache Hudi is crucial for building efficient, scalable, and reliable data lakes. While COW offers simplicity and read speed, MOR brings flexibility and performance for high-velocity, real-time data ingestion.
By understanding the trade-offs and applying them to your data workloads, you can design Hudi tables that meet both your operational and analytical requirements in the modern data lakehouse architecture.