Understanding Hudi Commit Timeline and Versioning
Explore how Apache Hudi manages commit timelines to enable data versioning, rollback, and time-travel
Apache Hudi brings data versioning, incremental processing, and time-travel queries to your data lake. These powerful features are made possible by Hudi’s commit timeline, a metadata structure that tracks the history of operations performed on a dataset.
In this guide, we’ll explore the Hudi commit timeline, how it manages file versions, what different commit states mean, and how to leverage it for debugging, rollback, auditing, and incremental ETL pipelines.
What is the Hudi Commit Timeline?
The commit timeline is a sequence of instant files stored in the .hoodie
metadata directory of a Hudi table. Each instant file represents a specific action taken on the dataset, such as:
- Inserts
- Upserts
- Deletes
- Compactions
- Cleanups
- Rollbacks
Each instant is named using a timestamp, which acts as a unique version identifier.
Example timeline files:
.hoodie/
├── 20240412103015.commit
├── 20240412104522.commit
├── 20240412120033.inflight
├── 20240412130000.clean
├── 20240412140000.rollback
Instant Types and Lifecycle
Every action creates an instant, and it can be in one of several states:
State | Description |
---|---|
Requested | Operation is scheduled but not started |
Inflight | Operation is currently running |
Completed | Operation finished successfully |
Rollback | Reverts a previously failed or partial operation |
Common types of instants:
.commit
,.compaction
,.delta_commit
→ successful writes.inflight
→ ongoing write.rollback
→ reverted operations.clean
,.savepoint
,.replacecommit
→ maintenance events
Commit Timeline for Copy-on-Write vs Merge-on-Read
Table Type | Instant Types Used | Description |
---|---|---|
Copy-on-Write | .commit |
Writes produce base files directly |
Merge-on-Read | .delta_commit , .compaction |
Log files + base files (with merge) |
Merge-on-Read stores changes in log files first (via delta commits), and periodically merges them into base files through compaction.
Use Cases for the Commit Timeline
- Data Versioning
Each commit represents a consistent snapshot of the dataset. You can replay or restore any version using the commit timestamp.
SELECT * FROM hudi_table
WHERE _hoodie_commit_time = '20240412103015';
- Time-Travel Queries
Retrieve data as of a specific time using Hudi’s built-in timestamp filtering.
SELECT * FROM hudi_table
WHERE _hoodie_commit_time <= '20240412120000';
- Incremental ETL Pipelines
Query only records changed since the last commit:
df = spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20240412104522") \
.load("s3://my-table")
- Rollback and Recovery
Automatically or manually rollback failed writes:
hoodie-cli
> rollback --instant 20240412120033
- Auditing and Debugging
Check file histories, modified partitions, and commit logs for tracing changes.
Viewing the Timeline with Hudi CLI
Use the CLI to inspect commit history:
hoodie-cli
> connect --path s3://your-hudi-table
> show commits
> show commit --commit 20240412103015
> show fsview all
This shows:
- Commits and operations performed
- Affected partitions and files
- Metadata and statistics per commit
File Layout and Metadata
Each commit file contains:
- commit metadata: JSON with operation details
- partition and file listings
- record count, write stats, errors
Log file layout (MOR):
partition/
├── file1.parquet
├── .file1_1.log
├── .file1_2.log
├── .file1_3.log
These logs are merged during compaction.
Best Practices
- Use incremental queries for efficient CDC and streaming ETL
- Always monitor for inflight or rollback states — these may indicate failure
- Enable savepoints before large changes for safe recovery
- Compact MOR tables regularly to reduce delta file size
- Track timeline size and consider archiving old commits to avoid metadata bloat
Conclusion
Apache Hudi’s commit timeline is the backbone of its powerful lakehouse capabilities — from data versioning and time-travel to incremental ingestion and rollback support.
Understanding how the timeline works is essential for managing production-grade Hudi datasets with confidence. With proper use, it transforms your data lake into a fully versioned, queryable, and auditable platform — ready for real-time and historical analytics.