Advanced Metrics Collection in Hudi with Prometheus and Grafana

As Apache Hudi becomes a cornerstone for data lakes and lakehouse architectures, observability is essential for operating it reliably at scale. While Hudi provides logs and command-line tools, deep insights into its internal performance — such as write throughput, compaction health, and metadata operations — require a proper metrics system.

In this post, you’ll learn how to integrate Apache Hudi with Prometheus and Grafana to collect, scrape, and visualize advanced metrics. We’ll explore how Hudi exposes JVM and application-level metrics and how to build rich dashboards for monitoring and alerting.

Why Monitor Hudi?

Key reasons to monitor Hudi:

Track write and upsert throughput
Analyze compaction lag and frequency
Understand metadata table behavior
Detect pipeline slowdowns or failures
Observe query performance over time

Metrics provide real-time observability, proactive alerting, and historical analysis for capacity planning and SLA enforcement.

Metrics Architecture Overview

[Hudi Write Job (Spark)] → [Dropwizard Metrics] → [Prometheus Exporter] → [Prometheus] → [Grafana]

Hudi uses Dropwizard Metrics under the hood, and can expose them through:

Console
File
JMX
Prometheus HTTP endpoint (recommended)

Step 1: Enable Prometheus Metrics in Hudi

You can expose metrics via HTTP using the PrometheusReporter.

Set these configs in your Hudi job (e.g., in DeltaStreamer, Spark, or PySpark):

hoodie.metrics.on=true
hoodie.metrics.reporter.type=Prometheus
hoodie.metrics.prometheus.port=9090
hoodie.metrics.prometheus.prefix=hudi_
hoodie.metrics.prometheus.host=0.0.0.0

If you’re using Spark:

--conf hoodie.metrics.on=true \
--conf hoodie.metrics.reporter.type=Prometheus \
--conf hoodie.metrics.prometheus.port=9090

This exposes metrics on http://<executor_host>:9090/metrics

Step 2: Set Up Prometheus to Scrape Hudi

In prometheus.yml, add a job to scrape the Hudi metrics endpoint:

```yml scrape_configs:

job_name: ‘hudi’ static_configs:
- targets: [‘hudi-worker-node:9090’] ```

Make sure Prometheus can reach all Spark executors/workers that expose metrics.

Step 3: Visualize Metrics in Grafana

Launch Grafana and add Prometheus as a data source
Create dashboards using Hudi metrics (prefixed by hudi_)
Example metrics to track:
- hudi_write_commit_duration
- hudi_upsert_count
- hudi_compaction_duration
- hudi_metadata_table_size
- hudi_delta_commits_since_last_compaction
Create alert rules in Grafana to trigger when:
- Write latency exceeds thresholds
- Compaction backlog grows
- Job success rate drops

Sample Grafana Dashboard Panels

Write Throughput
- Metric: rate(hudi_upsert_count[1m])
- Unit: records/second
Compaction Health
- Metric: hudi_delta_commits_since_last_compaction
- Use thresholds to warn if compaction is lagging
Commit Latency
- Metric: hudi_write_commit_duration
- Visualization: heatmap or time series
Metadata Size Trend
- Metric: hudi_metadata_table_size
- Correlate with number of partitions/files

Advanced Use Cases

Integrate with Alertmanager to send alerts to Slack, PagerDuty, or email
Use labels for different Hudi tables or environments (env=prod, table=orders)
Store long-term metrics in Thanos or Cortex for year-over-year performance analysis

Best Practices

Always expose metrics from both driver and executor when possible
Use dedicated ports to avoid conflicts in multi-job clusters
Avoid exposing metrics over public networks — secure using firewalls or reverse proxies
Use templated dashboards in Grafana to filter by Hudi table or job ID

Conclusion

Monitoring Hudi with Prometheus and Grafana brings transparency and control to your data pipelines. Whether you’re debugging slow writes, scaling your compaction strategy, or enforcing SLAs, having rich metrics at your fingertips is key to running a production-grade lakehouse.

By following the setup and best practices outlined here, you can ensure end-to-end observability across your Hudi-powered data lake.