Leveraging HDFS for Machine Learning Data Storage

Training and deploying machine learning (ML) models at scale requires a robust data storage layer that can handle large volumes of structured and unstructured data. The Hadoop Distributed File System (HDFS) is a natural fit for machine learning pipelines due to its scalability, fault tolerance, and high throughput.

This blog post explores how to leverage HDFS for machine learning data storage, integrate it into your ML workflows, and apply best practices for storing training data, features, models, and more.

Why Use HDFS for ML Data Storage?

HDFS offers several benefits for ML workloads:

Scalable storage for terabytes to petabytes of data
High throughput I/O for training large models
Fault tolerance via block replication
Compatibility with ML tools like Spark, TensorFlow, PyTorch, and Scikit-learn
Cost-effective storage across commodity hardware or cloud-backed Hadoop clusters

Common ML Data Types Stored in HDFS

Raw data (CSV, JSON, Avro, images, audio)
Preprocessed training data (normalized or cleaned)
Feature vectors (NumPy arrays, sparse matrices)
Labels and targets
Model checkpoints and artifacts
Logs and metrics from experiments

Organizing this data efficiently in HDFS ensures performance and reproducibility.

Directory Structure for ML Projects

Create a modular structure for your datasets:

/ml-projects/
├── images/
├── audio/
├── text/
├── features/
├── labels/
├── checkpoints/
├── logs/
└── predictions/

Example commands:

hdfs dfs -mkdir -p /ml-projects/image-classification/features
hdfs dfs -put features_batch_1.csv /ml-projects/image-classification/features/

Integrating HDFS with Spark for ML

Apache Spark MLlib and PySpark can read and write data from HDFS directly:

Loading data from HDFS:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ML").getOrCreate()
df = spark.read.csv("hdfs:///ml-projects/train_data.csv", header=True, inferSchema=True)

Saving preprocessed data:

df_cleaned.write.parquet("hdfs:///ml-projects/cleaned_data.parquet")

Using HDFS with TensorFlow and PyTorch

TensorFlow:

Use the tf.data API to load data directly from HDFS:

filenames = ["hdfs:///ml-projects/features/train.tfrecord"]
raw_dataset = tf.data.TFRecordDataset(filenames)

Enable HDFS file access via Hadoop native libraries or TensorFlow’s HDFS support.

PyTorch:

Use torch.utils.data.Dataset with custom HDFS loaders (or mount HDFS as a FUSE filesystem if needed).

Alternatively, use Petastorm or Arrow for PyTorch + HDFS integration.

Storing and Versioning ML Models in HDFS

Store model checkpoints and serialized models (e.g., .pkl, .pt, .pb) in a dedicated directory:

hdfs dfs -mkdir /ml-projects/models/image-classifier/v1
hdfs dfs -put model_checkpoint.pt /ml-projects/models/image-classifier/v1/

Track versions using directory naming conventions or metadata files.

Integrate with MLflow, DVC, or custom metadata registries that support HDFS.

Access Control and Security

Use Kerberos, Ranger, or HDFS ACLs to restrict access to sensitive datasets:

hdfs dfs -chmod 700 /ml-projects/confidential-data
hdfs dfs -setfacl -m user:mlengineer:rwx /ml-projects/confidential-data

This ensures your training data and model outputs are secure.

Best Practices

Use columnar formats like Parquet for feature storage
Compress large files with Snappy or Gzip to reduce I/O
Partition data by label/class or date for fast access
Store metadata files (JSON/YAML) alongside datasets
Clean up intermediate data to save space
Monitor HDFS usage with hdfs dfsadmin -report

Conclusion

HDFS is a powerful backend for scalable machine learning data storage, especially when working with big data and distributed training. Its integration with the Hadoop ecosystem and compatibility with Spark, TensorFlow, and other ML tools makes it ideal for handling the full lifecycle of ML workflows — from data ingestion to model storage.

By following the practices outlined here, you can build robust, scalable, and secure ML pipelines that efficiently manage data at every stage of the machine learning lifecycle.