Building Real Time Fraud Detection Pipelines with Hudi
Learn how to power real-time fraud detection with Apache Hudi, streaming data, and incremental processing
With the rapid rise of digital transactions, real-time fraud detection has become a top priority for banks, fintech platforms, and e-commerce companies. Detecting fraud early can prevent millions in losses and protect user trust.
To support this, data teams need low-latency, scalable pipelines capable of ingesting streaming data, applying detection logic, and updating dashboards in near real-time.
Apache Hudi enables real-time fraud detection by offering:
- Incremental ingestion
- Record-level updates (upserts)
- Time travel queries
- Streaming support on cloud storage (e.g., S3, HDFS)
In this post, we’ll walk through how to build a fraud detection pipeline using Apache Hudi, Apache Kafka, and Spark Structured Streaming.
Why Apache Hudi for Fraud Detection?
Apache Hudi offers several key advantages for fraud detection pipelines:
- Upserts & Deletes: Update suspicious transactions as fraud is confirmed
- Incremental Queries: Process only new records for real-time alerting
- ACID Transactions: Prevent data duplication or partial writes
- Time-Travel: Analyze historical data for forensic investigations
Architecture Overview
[Transaction Stream (Kafka)]
↓
[Spark Structured Streaming]
↓
[Hudi Table on S3 or HDFS]
↓
[Analytics + Alerts (Athena, Presto, Dashboards)]
You’ll use:
- Kafka to stream real-time transactions
- Spark to apply detection logic and write to Hudi
- Hudi tables to manage mutable, real-time data
- Query engines to power dashboards and alerts
Step 1: Ingest Streaming Data
Use Spark Structured Streaming to consume Kafka:
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "transactions") \
.load()
json_df = df.selectExpr("CAST(value AS STRING)").select(from_json(col("value"), schema).alias("data")).select("data.*")
Define a schema for transactions (e.g., user_id, amount, device_id, location, timestamp).
Step 2: Apply Fraud Detection Rules
Use Spark logic or ML models:
fraud_df = json_df.withColumn("is_fraud",
when((col("amount") > 10000) & (col("location") != col("home_location")), lit(True)).otherwise(lit(False))
)
You can enhance with:
- Historical profiling
- ML-based anomaly detection
- Rule-based scoring
Step 3: Write to Hudi with Upserts
Configure Hudi to update the same user’s transaction if additional fraud signals are found later:
hudi_options = {
'hoodie.table.name': 'fraud_txn_table',
'hoodie.datasource.write.recordkey.field': 'transaction_id',
'hoodie.datasource.write.partitionpath.field': 'dt',
'hoodie.datasource.write.precombine.field': 'event_time',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database': 'default',
'hoodie.datasource.hive_sync.table': 'fraud_txn_table',
'hoodie.datasource.hive_sync.mode': 'glue'
}
fraud_df.writeStream \
.format("hudi") \
.options(**hudi_options) \
.outputMode("append") \
.option("checkpointLocation", "s3://checkpoints/fraud/") \
.start("s3://data-lake/fraud_txn_table/")
Step 4: Query in Real-Time
Use Athena, Presto, or Spark SQL to monitor flagged transactions:
SELECT * FROM fraud_txn_table
WHERE is_fraud = true
AND event_time > current_timestamp - interval '10' minute;
You can connect BI tools or dashboards (e.g., Superset, Grafana) to monitor alerts live.
Step 5: Use Incremental Queries for Downstream Jobs
Power alerting engines or rule-based engines using only changed data:
spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20240401000000") \
.load("s3://data-lake/fraud_txn_table/")
Use this for:
- Streaming enrichment jobs
- ML feature engineering pipelines
- Notifying fraud teams
Best Practices
- Use Merge-on-Read for fast ingestion and delayed compaction
- Set proper precombine field (e.g.,
event_time
) for deduplication - Enable inline or async compaction to manage read performance
- Partition by date or region to reduce scan size
- Use Hive sync for seamless integration with Athena or Presto
- Secure your table with fine-grained access via AWS Lake Formation or Ranger
Conclusion
Apache Hudi is a powerful platform for building real-time fraud detection systems at scale. With upsert support, incremental queries, and streaming capabilities, Hudi enables fast, accurate, and secure pipelines — making it a perfect fit for modern, ACID-compliant lakehouses.
By combining Hudi with Spark, Kafka, and query tools like Athena or Presto, you can turn raw transactional streams into intelligent, real-time fraud insights.