Using Hive with Hudi for Incremental Data Management
Enable real-time data updates and incremental ETL in Hive using Apache Hudi
Modern data architectures demand the ability to handle incremental data ingestion, real-time updates, and low-latency analytics. While Hive excels at querying large datasets in batch mode, it falls short when it comes to updates, deletes, and record-level operations.
Enter Apache Hudi — a data lake platform built for incremental data management on top of HDFS or cloud object stores. By integrating Hive with Hudi, you can enable upserts, deletes, record-level indexing, and query optimizations — without abandoning your existing Hive infrastructure.
This post will guide you through how to use Hive with Hudi for efficient and real-time data management at scale.
What is Apache Hudi?
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a transactional data lake framework that supports:
- Insert, update, and delete operations
- Record-level ACID transactions on HDFS/S3
- Data versioning and rollback
- Incremental data pulls for ETL pipelines
- Real-time querying through Hive, Presto, and Spark
It stores data in Copy-on-Write (CoW) or Merge-on-Read (MoR) formats, enabling trade-offs between read/write performance.
Why Use Hudi with Hive?
Traditional Hive tables are immutable — making updates and deletes inefficient and costly. By using Hudi, you get:
- Efficient upserts and deletes in Hive-managed tables
- ACID guarantees on top of HDFS
- Ability to perform incremental queries for real-time pipelines
- Seamless integration with Hive metastore and SQL queries
This allows Hive users to modernize their batch processing with near-real-time capabilities.
Setting Up Hive with Hudi
To query Hudi tables from Hive:
- Use Hive 3.1.2+ (or a Hive version compiled with Hudi support)
- Add the Hudi Hive bundle to your Hive CLI or HiveServer2 classpath:
export HIVE_AUX_JARS_PATH=/path/to/hudi-hive-bundle.jar
- Use Spark or Hudi DeltaStreamer to write data into Hudi tables.
Creating a Hudi Table
Create a Hudi table using Spark or the Hudi CLI:
CREATE TABLE hudi_orders (
order_id STRING,
customer_id STRING,
amount DOUBLE,
order_date STRING
)
USING hudi
OPTIONS (
type = 'cow',
primaryKey = 'order_id',
preCombineField = 'order_date'
)
PARTITIONED BY (order_date);
Alternatively, use Hudi DeltaStreamer for ingestion pipelines.
Querying Hudi Tables from Hive
Once registered with the Hive metastore, Hudi tables can be queried like regular Hive tables:
SELECT * FROM hudi_orders WHERE order_date = '2024-11-16';
For Merge-on-Read (MoR) tables, you can choose between snapshot or read-optimized views:
-- Snapshot view (latest data with merge)
SELECT * FROM hudi_orders_rt;
-- Read-optimized view (best read performance)
SELECT * FROM hudi_orders_ro;
Make sure to set the correct SerDe and input formats when manually creating tables in Hive.
Performing Incremental Queries
One of Hudi’s most powerful features is incremental pull — retrieving only new or updated records since a given commit time.
Set Hive configurations:
SET hoodie.datasource.query.type=incremental;
SET hoodie.datasource.read.begin.instanttime=20240401000000;
SELECT * FROM hudi_orders WHERE order_date = '2024-11-16';
Use commit timestamps (in UTC format) to define the range for incremental processing.
Compaction and Cleaning
For Merge-on-Read tables, Hudi stores data in log files, which need to be compacted periodically for query performance.
Schedule compaction via:
spark-submit ... --class org.apache.hudi.utilities.HoodieCompactor ...
Also configure cleaner policies to retain only necessary versions:
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained=10
This ensures storage efficiency and prevents metadata bloat.
Hive ACID vs Hudi
Feature | Hive ACID (ORC) | Apache Hudi |
---|---|---|
Update/Delete Support | Yes (slow for large data) | Yes (efficient) |
Real-time Queries | No | Yes (Snapshot/MoR) |
Incremental Ingestion | No | Yes |
Integration with Spark | Limited | Native |
Object Store Support | No | Yes (S3, GCS) |
Hudi fills the gap where Hive ACID falls short in performance and scalability.
Best Practices
- Use Copy-on-Write for read-heavy workloads
- Use Merge-on-Read for write-heavy or streaming use cases
- Compact MoR tables regularly to avoid read overhead
- Use preCombineField to deduplicate records on upserts
- Schedule cleaning to remove obsolete files and manage storage
- Tune
hoodie.parquet.max.file.size
and commit configs for performance
Conclusion
Integrating Hive with Hudi empowers your data lake with incremental data capabilities, enabling record-level updates, time travel, and real-time analytics. Hudi brings transactional support and schema evolution to Hive environments without sacrificing Hadoop compatibility.
Whether you’re building ETL pipelines, CDC ingestion flows, or modern data lakes, Hudi and Hive together offer a compelling solution for big data with real-time agility.