As data pipelines grow in complexity, the need for flexible, scalable, and real-time data ingestion becomes critical. Apache Nifi, a powerful dataflow automation tool, offers seamless integration with HDFS — making it an ideal solution for ingesting structured and unstructured data into Hadoop ecosystems.

This blog explores how to use Apache Nifi with HDFS to build scalable data ingestion workflows. We’ll cover core processors, architecture considerations, performance tuning, and best practices for ensuring durable and traceable data movement.


Why Use Apache Nifi for HDFS Ingestion?

Apache Nifi provides:

  • Visual flow design with drag-and-drop interface
  • Native HDFS support via processors
  • Support for batch and real-time ingestion
  • Fine-grained data provenance tracking
  • Built-in retry and backpressure mechanisms

Together with HDFS, it allows you to create robust ingestion workflows that scale with your data lake.


Key HDFS Processors in Nifi

Apache Nifi includes several processors designed for interacting with HDFS:

  1. PutHDFS: Writes flowfiles to HDFS
  2. FetchHDFS: Retrieves files from HDFS
  3. ListHDFS: Lists files in a directory (for polling)
  4. PutParquet / PutORC: Writes data in columnar formats (via NARs)
  5. PutDistributedMapCache: Caching for deduplication/state
  6. MergeContent: Used to combine small files before writing

These processors use Hadoop’s APIs and require correct configuration of core-site.xml and hdfs-site.xml.


Setting Up Nifi for HDFS

  1. Place core-site.xml and hdfs-site.xml into Nifi’s conf/ directory or set the Hadoop classpath via the HADOOP_CONF_DIR environment variable.

  2. Configure PutHDFS processor:

Hadoop Configuration Resources: /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml  
Directory: /user/data/raw  
Conflict Resolution Strategy: replace / ignore / fail  
Compression Codec: none / gzip / snappy  
  1. Assign a valid Kerberos principal and keytab if using secure Hadoop.

Example Workflow: Ingesting Logs from API to HDFS

Step-by-step flow:

  1. InvokeHTTP – Pulls logs from external REST API
  2. ExtractText – Extract fields from JSON/XML/CSV
  3. RouteOnAttribute – Route data by type or region
  4. UpdateAttribute – Rename files or tag metadata
  5. PutHDFS – Writes data into appropriate HDFS partition

Optional: use MergeContent to group small files into blocks before writing to HDFS.


Organizing Output in HDFS by Time

Use Expression Language in PutHDFS to dynamically route data:

Directory: /user/logs/${now():format("yyyy/MM/dd/HH")}

This creates hour-based partitioning for efficient querying and downstream Hive table consumption.


Ensuring Fault Tolerance

  • Backpressure: Set thresholds on queues to avoid memory overrun
  • Retries: Configure retry attempts and penalty durations in processors
  • Provenance: Use data lineage view for traceability
  • Connection failure handling: Route failures to dead-letter queues for inspection

Nifi automatically persists state and retries failed operations, making it resilient for ingestion tasks.


Monitoring and Performance Tuning

  • Use Site-to-Site for cluster-to-cluster ingestion
  • Enable bulletin board to catch processor errors
  • Monitor JVM heap usage and garbage collection
  • Use NiFi Registry for versioning flows
  • Tune nifi.content.repository.archive.max.retention.period and other buffer settings for high-throughput workloads

Use Prometheus reporting tasks or Nifi built-in metrics for observability.


Best Practices

  • Minimize the number of small files — use MergeContent
  • Use PutHDFS over custom scripts for fault tolerance
  • Implement flowfile tagging with attributes for downstream processing
  • Secure flows with HTTPS and Kerberos
  • Use parameter contexts for managing environments (dev, prod)

Integration with Hive and Downstream Tools

Once data is in HDFS, define Hive external tables to query ingested data:

CREATE EXTERNAL TABLE api_logs (
id STRING,
message STRING,
ts TIMESTAMP
)
STORED AS TEXTFILE
LOCATION 'hdfs://namenode:8020/user/logs/';

You can also trigger Apache Spark, Hive ETL, or Kafka ingestion from within Nifi using respective processors.


Conclusion

Apache Nifi and HDFS together form a powerful ingestion stack that’s both easy to manage and highly scalable. Whether you’re onboarding data from APIs, filesystems, databases, or message queues, Nifi’s built-in processors, combined with HDFS’s durability, make it an ideal choice for modern data lake ingestion workflows.

With proper configuration, monitoring, and best practices, you can build resilient, secure, and traceable pipelines that ingest and organize data for analytics at scale.