Databricks is a powerful unified analytics platform built on Apache Spark that supports enterprise-grade data engineering and AI workloads. While Delta Lake is the native table format for Databricks, some organizations choose to use Apache Hudi for its incremental processing, CDC support, and multi-engine compatibility across on-premise and cloud platforms.

In this post, you’ll learn how to use Hudi with Databricks for enterprise-scale data processing, including setup, write operations, catalog integration, and best practices for implementing robust and real-time data pipelines.


Why Use Hudi with Databricks?

Though Databricks favors Delta Lake, using Hudi in Databricks may be beneficial when:

  • You need open-source format flexibility for hybrid or multi-cloud environments
  • You’re already using Hudi with AWS Glue, EMR, or Presto
  • You want streaming upserts and incremental queries for data lakehouse ingestion
  • You aim to interoperate with non-Databricks compute engines

Hudi benefits:

  • Near real-time ingestion via DeltaStreamer
  • Time-travel and point-in-time querying
  • Built-in CDC (Change Data Capture) support
  • Support for MOR (Merge-on-Read) and COW (Copy-on-Write) tables

Setting Up Hudi in Databricks

  1. Use a Databricks Runtime that supports Apache Spark 3.2+

  2. Attach Hudi JARs to your cluster:

Upload the Hudi Spark bundle to DBFS or install from Maven.

org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1

Install using a cluster library or %pip:

%pip install --extra-index-url https://pypi.org/simple/ hudi-spark-bundle
  1. Create a notebook and import Hudi configs:
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

Writing Data to Hudi from Databricks

Here’s a basic example to write a Copy-on-Write (COW) table:

from pyspark.sql import SparkSession

df = spark.read.json("dbfs:/mnt/raw/events/")

hudi_options = {
'hoodie.table.name': 'enterprise_events',
'hoodie.datasource.write.recordkey.field': 'event_id',
'hoodie.datasource.write.partitionpath.field': 'event_type',
'hoodie.datasource.write.table.name': 'enterprise_events',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'event_ts',
'hoodie.upsert.shuffle.parallelism': 200,
'hoodie.insert.shuffle.parallelism': 200
}

df.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save("dbfs:/mnt/bronze/hudi/enterprise_events")

This creates a Hudi table on DBFS (or S3/Azure Blob) that supports ACID transactions and partitioned writes.


Enabling Hive Sync with Unity Catalog

To make Hudi tables visible for SQL analytics:

  1. Register the table using the CREATE TABLE command with location:
CREATE TABLE IF NOT EXISTS default.enterprise_events
USING Hudi
LOCATION 'dbfs:/mnt/bronze/hudi/enterprise_events';
  1. Use the Unity Catalog or Hive Metastore to expose the schema.

Reading Hudi Tables in Databricks

Read the table directly using DataFrame API:

hudi_df = spark.read.format("hudi").load("dbfs:/mnt/bronze/hudi/enterprise_events")
hudi_df.createOrReplaceTempView("events")

spark.sql("SELECT * FROM events WHERE event_type = 'login'")

To use incremental reads:

incremental_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': '20240417000000'
}

spark.read.format("hudi") \
.options(**incremental_options) \
.load("dbfs:/mnt/bronze/hudi/enterprise_events") \
.show()

Use Cases in Enterprise Workloads

  • CDC Pipelines: Real-time ingestion of change data from Kafka or Debezium into Hudi
  • Data Quality Checks: Merge fresh records while retaining historical data
  • Multi-Cloud Portability: Use the same Hudi table with Presto, Hive, or Flink
  • Archival + Time-Travel: Perform rollback or recovery using commit instants

Best Practices

  • Use COW tables for frequent reads and low-latency analytics
  • Use MOR tables for high-frequency upserts and streaming ingestion
  • Optimize compaction with:
    hoodie.compact.inline=true  
    hoodie.compact.inline.max.delta.commits=5  
    
  • Enable metadata table:
    hoodie.metadata.enable=true  
    
  • Partition by business keys like event_date, region, or category
  • Monitor table size and write amplification due to small files

Conclusion

Integrating Apache Hudi with Databricks offers a powerful and flexible alternative to Delta Lake — especially when you need open-format compatibility or want to migrate workloads from other environments. With built-in support for incremental ingestion, ACID transactions, and time-travel, Hudi enables real-time, reliable data pipelines directly within your enterprise Spark workflows.

By following the setup and best practices shared here, you can unlock scalable lakehouse architecture in Databricks — without vendor lock-in.