Using HDFS with Apache Nifi for Data Ingestion Workflows

As data pipelines grow in complexity, the need for flexible, scalable, and real-time data ingestion becomes critical. Apache Nifi, a powerful dataflow automation tool, offers seamless integration with HDFS — making it an ideal solution for ingesting structured and unstructured data into Hadoop ecosystems.

This blog explores how to use Apache Nifi with HDFS to build scalable data ingestion workflows. We’ll cover core processors, architecture considerations, performance tuning, and best practices for ensuring durable and traceable data movement.

Why Use Apache Nifi for HDFS Ingestion?

Apache Nifi provides:

Visual flow design with drag-and-drop interface
Native HDFS support via processors
Support for batch and real-time ingestion
Fine-grained data provenance tracking
Built-in retry and backpressure mechanisms

Together with HDFS, it allows you to create robust ingestion workflows that scale with your data lake.

Key HDFS Processors in Nifi

Apache Nifi includes several processors designed for interacting with HDFS:

PutHDFS: Writes flowfiles to HDFS
FetchHDFS: Retrieves files from HDFS
ListHDFS: Lists files in a directory (for polling)
PutParquet / PutORC: Writes data in columnar formats (via NARs)
PutDistributedMapCache: Caching for deduplication/state
MergeContent: Used to combine small files before writing

These processors use Hadoop’s APIs and require correct configuration of core-site.xml and hdfs-site.xml.

Setting Up Nifi for HDFS

Place core-site.xml and hdfs-site.xml into Nifi’s conf/ directory or set the Hadoop classpath via the HADOOP_CONF_DIR environment variable.
Configure PutHDFS processor:

Hadoop Configuration Resources: /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml  
Directory: /user/data/raw  
Conflict Resolution Strategy: replace / ignore / fail  
Compression Codec: none / gzip / snappy  

Assign a valid Kerberos principal and keytab if using secure Hadoop.

Example Workflow: Ingesting Logs from API to HDFS

Step-by-step flow:

InvokeHTTP – Pulls logs from external REST API
ExtractText – Extract fields from JSON/XML/CSV
RouteOnAttribute – Route data by type or region
UpdateAttribute – Rename files or tag metadata
PutHDFS – Writes data into appropriate HDFS partition

Optional: use MergeContent to group small files into blocks before writing to HDFS.

Organizing Output in HDFS by Time

Use Expression Language in PutHDFS to dynamically route data:

Directory: /user/logs/${now():format("yyyy/MM/dd/HH")}

This creates hour-based partitioning for efficient querying and downstream Hive table consumption.

Ensuring Fault Tolerance

Backpressure: Set thresholds on queues to avoid memory overrun
Retries: Configure retry attempts and penalty durations in processors
Provenance: Use data lineage view for traceability
Connection failure handling: Route failures to dead-letter queues for inspection

Nifi automatically persists state and retries failed operations, making it resilient for ingestion tasks.

Monitoring and Performance Tuning

Use Site-to-Site for cluster-to-cluster ingestion
Enable bulletin board to catch processor errors
Monitor JVM heap usage and garbage collection
Use NiFi Registry for versioning flows
Tune nifi.content.repository.archive.max.retention.period and other buffer settings for high-throughput workloads

Use Prometheus reporting tasks or Nifi built-in metrics for observability.

Best Practices

Minimize the number of small files — use MergeContent
Use PutHDFS over custom scripts for fault tolerance
Implement flowfile tagging with attributes for downstream processing
Secure flows with HTTPS and Kerberos
Use parameter contexts for managing environments (dev, prod)

Integration with Hive and Downstream Tools

Once data is in HDFS, define Hive external tables to query ingested data:

CREATE EXTERNAL TABLE api_logs (
id STRING,
message STRING,
ts TIMESTAMP
)
STORED AS TEXTFILE
LOCATION 'hdfs://namenode:8020/user/logs/';

You can also trigger Apache Spark, Hive ETL, or Kafka ingestion from within Nifi using respective processors.

Conclusion

Apache Nifi and HDFS together form a powerful ingestion stack that’s both easy to manage and highly scalable. Whether you’re onboarding data from APIs, filesystems, databases, or message queues, Nifi’s built-in processors, combined with HDFS’s durability, make it an ideal choice for modern data lake ingestion workflows.

With proper configuration, monitoring, and best practices, you can build resilient, secure, and traceable pipelines that ingest and organize data for analytics at scale.