Using HDFS with Apache Flink for Stream Processing

Apache Flink is a powerful open-source framework for real-time stream processing and stateful computations. While it excels at low-latency data processing, integrating it with persistent storage systems like HDFS is essential for durability, checkpointing, and data archiving.

In this blog, we’ll explore how to use HDFS with Apache Flink, enabling fault-tolerant streaming pipelines with persistent outputs and recoverable state. This integration is especially useful in production environments that require exactly-once processing guarantees and long-term storage of stream data.

Why Integrate HDFS with Flink?

Combining HDFS and Flink offers key advantages:

Durable state backend for checkpoints and savepoints
Long-term storage of processed data
Integration with existing Hadoop-based data lakes
Reliable recovery during failures
Batch or downstream analytics using Hive/Spark

Writing Flink Output to HDFS

To write streaming output to HDFS, use the StreamingFileSink available in the Flink FileSystem API.

StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path("hdfs://namenode:8020/flink/output"), new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(Duration.ofMinutes(15))
.withInactivityInterval(Duration.ofMinutes(5))
.withMaxPartSize(128 * 1024 * 1024)
.build()
)
.build();

dataStream.addSink(sink);

The rolling policy controls when files are finalized and made available to downstream consumers like Hive.

HDFS as a State Backend for Checkpointing

Flink supports multiple state backends. For large state or long-running applications, use HDFS-based checkpoints via the FileSystemStateBackend or RocksDBStateBackend.

env.enableCheckpointing(60000); // 1-minute interval
env.setStateBackend(new FileSystemStateBackend("hdfs://namenode:8020/flink/checkpoints"));

Or with RocksDB:

env.setStateBackend(new RocksDBStateBackend("hdfs://namenode:8020/flink/checkpoints", true));

This allows Flink to recover from failures using persisted state on HDFS.

Setting Up HDFS Access in Flink

Ensure hadoop-common and hadoop-hdfs dependencies are in Flink’s classpath.
Add core-site.xml and hdfs-site.xml to $FLINK_HOME/conf/ or ship them with your application JAR.
Confirm HDFS connectivity:

hdfs dfs -ls hdfs://namenode:8020/

Set HDFS permissions to allow Flink job access, especially for checkpointing and sink directories.

Example: Kafka to HDFS Streaming Pipeline

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(30000); // every 30 seconds

FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<>(
"events",
new SimpleStringSchema(),
kafkaProps
);

DataStream<String> stream = env.addSource(kafkaSource);

StreamingFileSink<String> hdfsSink = StreamingFileSink
.forRowFormat(new Path("hdfs://namenode:8020/streams/events"), new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(Duration.ofMinutes(10))
.build())
.build();

stream.addSink(hdfsSink);

env.execute("Kafka to HDFS Pipeline");

This pipeline reads messages from Kafka and writes them to HDFS in rolling files, with checkpointing for fault tolerance.

Integration with Hive and Downstream Analytics

Once Flink writes data to HDFS, Hive or Spark can pick it up for batch processing or analytics.

Example Hive external table:

CREATE EXTERNAL TABLE stream_events (
event STRING
)
STORED AS TEXTFILE
LOCATION 'hdfs://namenode:8020/streams/events';

For ORC/Parquet output, use the appropriate Flink formats via Table API or connectors.

Monitoring and Fault Tolerance

Enable checkpointing metrics in Flink dashboard
Store checkpoints on dedicated HDFS directories with proper ACLs
Tune minPauseBetweenCheckpoints, timeout, and maxConcurrentCheckpoints for stability
Use HA JobManager with ZooKeeper for production-grade recovery

Best Practices

Use RocksDB with HDFS backend for large stateful jobs
Avoid small file problems — use appropriate rolling policy
Secure HDFS with Kerberos if running in secure environments
Set up retention policies on checkpoint directories to avoid disk bloat
Use withBucketAssigner() to write partitioned output by time, user, etc.

Conclusion

Integrating Apache Flink with HDFS gives you the power of real-time stream processing with the durability and scalability of Hadoop. Whether you’re building streaming ETL, IoT pipelines, or log processors, this combination provides a strong foundation for fault-tolerant, stateful streaming applications.

By following the right architecture and tuning, you can deliver reliable low-latency pipelines that scale across distributed environments — while ensuring data durability and downstream compatibility with batch systems.