Optimizing Hudi Queries with Spark SQL
Speed up analytics on Hudi datasets using Spark SQL with partition pruning, predicate pushdown, and index tuning
Apache Hudi is a powerful lakehouse platform that supports efficient upserts, time-travel, and incremental processing. But to truly unlock the potential of Hudi for fast analytics, you need to understand how to optimize queries — especially when using Spark SQL.
In this guide, we’ll explore strategies to optimize Hudi queries using Spark SQL, including tips for partition pruning, file indexing, metadata use, and incremental reads.
1. Enable Hudi-Specific Optimizations
Before running any queries, make sure Spark is configured to work well with Hudi:
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")
spark.conf.set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog")
These settings ensure that Spark interacts properly with Hudi tables and doesn’t treat them like regular Parquet files.
2. Leverage Partition Pruning
Hudi supports Hive-style partitioning, which Spark SQL can utilize to skip unnecessary file scans.
Example partitioned table:
CREATE TABLE hudi_sales (
id STRING,
product STRING,
amount DOUBLE,
sale_date STRING
)
USING hudi
PARTITIONED BY (sale_date)
OPTIONS (
type = 'cow',
primaryKey = 'id',
preCombineField = 'sale_ts'
);
Optimized query:
SELECT * FROM hudi_sales WHERE sale_date = '2024-11-16';
Avoid using functions on partition columns (e.g., DATE(sale_date)
) in WHERE clauses, as this disables pruning.
3. Enable Predicate Pushdown and Column Pruning
Use Parquet and ORC file formats with Snappy compression for optimal query performance. These formats support:
- Predicate pushdown: filters rows early during scan
- Column pruning: reads only required columns
SELECT amount FROM hudi_sales WHERE product = 'Laptop';
This query only scans the product
and amount
columns — reducing I/O.
4. Use Incremental Queries
Instead of scanning the entire dataset, Spark can read only new or updated records since the last commit.
val commits = spark.sql("SELECT DISTINCT(_hoodie_commit_time) FROM hudi_sales ORDER BY _hoodie_commit_time DESC LIMIT 2")
val beginTime = commits.collect().last.getString(0)
val df = spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", beginTime)
.load("s3://my-hudi/sales")
df.show()
This reduces the cost of repeat analytics jobs and supports CDC pipelines.
5. Use Hudi Metadata Table
Enable the metadata table for faster file listing:
hoodie.metadata.enable=true
hoodie.metadata.index.column.stats.enable=true
hoodie.metadata.partition.stats.enable=true
Metadata table eliminates the need for expensive file listing on large datasets — a key boost in cloud environments like S3 or GCS.
6. Optimize with Column Statistics and Bloom Index
Enable column stats index to accelerate filter queries:
hoodie.metadata.index.column.stats.enable=true
Enable bloom index to reduce file lookup time for point queries:
hoodie.index.type=BLOOM
hoodie.bloom.index.filter.type=MERGE
hoodie.bloom.index.update.partition.path=true
This is especially useful for workloads with selective filters on primary keys.
7. Compact and Clean Regularly
For Merge-on-Read (MOR) tables, compaction improves query performance:
spark-submit \
--class org.apache.hudi.utilities.HoodieCompactor \
--master yarn \
$HUDI_UTILITIES_JAR \
--base-path s3://my-hudi/sales \
--table-name hudi_sales
Also, use clean operations to remove obsolete files and reduce storage bloat.
8. Tune Spark and Hudi Parameters
For large-scale reads:
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.files.maxPartitionBytes", "134217728") // 128MB
spark.conf.set("spark.sql.broadcastTimeout", "600")
Set Hudi-specific read options:
.option("hoodie.datasource.read.as.of.instant", "20241116153000") // time-travel
.option("hoodie.datasource.read.streaming.skip_compaction", "true")
9. Query via Spark Catalog or Hive
If you registered your Hudi table in Hive Metastore or Glue Catalog:
SELECT * FROM spark_catalog.default.hudi_sales WHERE sale_date = '2024-11-16';
Ensure Hudi tables are synced using HiveSync:
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.mode=hms
hoodie.datasource.hive_sync.database=default
hoodie.datasource.hive_sync.table=hudi_sales
Conclusion
Optimizing Apache Hudi queries using Spark SQL is crucial for maintaining performance in large-scale data lakes. By leveraging partition pruning, predicate pushdown, metadata acceleration, and incremental reads, you can significantly reduce query latency and processing time.
These strategies help build scalable, cost-efficient lakehouse pipelines ready for modern analytics and machine learning workloads.