Leveraging HDFS for Machine Learning Data Storage
Use Hadoop Distributed File System (HDFS) to manage scalable, reliable storage for ML pipelines
Training and deploying machine learning (ML) models at scale requires a robust data storage layer that can handle large volumes of structured and unstructured data. The Hadoop Distributed File System (HDFS) is a natural fit for machine learning pipelines due to its scalability, fault tolerance, and high throughput.
This blog post explores how to leverage HDFS for machine learning data storage, integrate it into your ML workflows, and apply best practices for storing training data, features, models, and more.
Why Use HDFS for ML Data Storage?
HDFS offers several benefits for ML workloads:
- Scalable storage for terabytes to petabytes of data
- High throughput I/O for training large models
- Fault tolerance via block replication
- Compatibility with ML tools like Spark, TensorFlow, PyTorch, and Scikit-learn
- Cost-effective storage across commodity hardware or cloud-backed Hadoop clusters
Common ML Data Types Stored in HDFS
- Raw data (CSV, JSON, Avro, images, audio)
- Preprocessed training data (normalized or cleaned)
- Feature vectors (NumPy arrays, sparse matrices)
- Labels and targets
- Model checkpoints and artifacts
- Logs and metrics from experiments
Organizing this data efficiently in HDFS ensures performance and reproducibility.
Directory Structure for ML Projects
Create a modular structure for your datasets:
/ml-projects/
├── images/
├── audio/
├── text/
├── features/
├── labels/
├── checkpoints/
├── logs/
└── predictions/
Example commands:
hdfs dfs -mkdir -p /ml-projects/image-classification/features
hdfs dfs -put features_batch_1.csv /ml-projects/image-classification/features/
Integrating HDFS with Spark for ML
Apache Spark MLlib and PySpark can read and write data from HDFS directly:
Loading data from HDFS:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ML").getOrCreate()
df = spark.read.csv("hdfs:///ml-projects/train_data.csv", header=True, inferSchema=True)
Saving preprocessed data:
df_cleaned.write.parquet("hdfs:///ml-projects/cleaned_data.parquet")
Using HDFS with TensorFlow and PyTorch
TensorFlow:
Use the tf.data
API to load data directly from HDFS:
filenames = ["hdfs:///ml-projects/features/train.tfrecord"]
raw_dataset = tf.data.TFRecordDataset(filenames)
Enable HDFS file access via Hadoop native libraries or TensorFlow’s HDFS support.
PyTorch:
Use torch.utils.data.Dataset
with custom HDFS loaders (or mount HDFS as a FUSE filesystem if needed).
Alternatively, use Petastorm or Arrow for PyTorch + HDFS integration.
Storing and Versioning ML Models in HDFS
Store model checkpoints and serialized models (e.g., .pkl
, .pt
, .pb
) in a dedicated directory:
hdfs dfs -mkdir /ml-projects/models/image-classifier/v1
hdfs dfs -put model_checkpoint.pt /ml-projects/models/image-classifier/v1/
Track versions using directory naming conventions or metadata files.
Integrate with MLflow, DVC, or custom metadata registries that support HDFS.
Access Control and Security
Use Kerberos, Ranger, or HDFS ACLs to restrict access to sensitive datasets:
hdfs dfs -chmod 700 /ml-projects/confidential-data
hdfs dfs -setfacl -m user:mlengineer:rwx /ml-projects/confidential-data
This ensures your training data and model outputs are secure.
Best Practices
- Use columnar formats like Parquet for feature storage
- Compress large files with Snappy or Gzip to reduce I/O
- Partition data by label/class or date for fast access
- Store metadata files (JSON/YAML) alongside datasets
- Clean up intermediate data to save space
- Monitor HDFS usage with
hdfs dfsadmin -report
Conclusion
HDFS is a powerful backend for scalable machine learning data storage, especially when working with big data and distributed training. Its integration with the Hadoop ecosystem and compatibility with Spark, TensorFlow, and other ML tools makes it ideal for handling the full lifecycle of ML workflows — from data ingestion to model storage.
By following the practices outlined here, you can build robust, scalable, and secure ML pipelines that efficiently manage data at every stage of the machine learning lifecycle.