Exploring Hive Data Compression Techniques for Efficient Storage and Performance
Learn how to use compression in Hive to reduce storage costs and boost query speed
Big data comes with big storage requirements. As data warehouses scale to handle petabytes of information, efficient storage and fast retrieval become critical. Hive offers a variety of data compression techniques to address these challenges — reducing storage costs and speeding up query performance.
In this guide, we’ll explore how to use compression in Apache Hive, understand supported formats and codecs, and learn best practices for applying compression in real-world analytics workloads.
Why Use Compression in Hive?
Benefits of data compression:
- Reduced HDFS storage footprint
- Lower I/O costs during query execution
- Faster data transfer over network
- Better performance when combined with vectorized execution
Compression is especially useful for large columnar datasets stored in ORC or Parquet formats.
File Formats and Compression Compatibility
Hive supports various file formats — and each supports different compression codecs.
Format | Compression Support | Recommended Codec |
---|---|---|
ORC | Native + Zlib, Snappy | Zlib or Snappy |
Parquet | Native + Snappy, Gzip | Snappy |
Text | Gzip, Bzip2 | Gzip |
Avro | Deflate, Snappy | Snappy |
RCFile | Gzip, Bzip2, Snappy | Snappy |
For analytics, ORC with Zlib or Snappy is generally the best combination for Hive performance.
Compressing ORC Tables
ORC supports built-in lightweight compression. Use Snappy for speed or Zlib for better compression ratios.
CREATE TABLE logs_orc (
user_id STRING,
action STRING,
event_time TIMESTAMP
)
STORED AS ORC
TBLPROPERTIES ("orc.compress" = "SNAPPY");
Alternatively, set compression globally:
SET hive.exec.compress.output=true;
SET hive.exec.orc.default.compress=SNAPPY;
Other valid values: ZLIB
, NONE
Compressing Parquet Tables
Parquet also supports built-in compression. To use Snappy:
CREATE TABLE user_parquet (
user_id STRING,
name STRING
)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression" = "SNAPPY");
Global config:
SET parquet.compression=SNAPPY;
Parquet with Snappy is ideal for fast analytics and BI workloads using Presto or Hive.
Compressing Text-Based Tables
For legacy systems using text files:
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
Then create the table and insert:
CREATE TABLE raw_logs (
line STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
Use compressed inserts during ETL.
Compressing Output from Hive Queries
Enable compression for Hive query results (e.g., when writing to HDFS or exporting):
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
To write compressed results into another table:
INSERT OVERWRITE TABLE compressed_sales
SELECT * FROM sales;
Ensure target table format supports compression.
Compression and Performance Trade-offs
Codec | Compression Ratio | Compression Speed | Decompression Speed | Use Case |
---|---|---|---|---|
Snappy | Medium | Very Fast | Very Fast | Interactive queries, ETL |
Zlib | High | Slow | Medium | Archival, batch workloads |
Gzip | High | Slow | Slow | Legacy tools, export |
Bzip2 | Very High | Very Slow | Slow | Rarely used |
Use Snappy for performance, Zlib for space efficiency, and avoid Bzip2 for most scenarios.
Best Practices for Hive Compression
- Use ORC or Parquet formats instead of Text or CSV
- Prefer Snappy for performance and Zlib for compact storage
- Set compression settings globally in
hive-site.xml
or dynamically per table - Use vectorized reads for ORC and Parquet tables
- Combine compression with partitioning and bucketing for optimal performance
- Periodically run compaction to avoid small files in transactional tables
Conclusion
Data compression in Hive is a powerful tool for managing large datasets efficiently. Whether you’re storing logs, transactions, or aggregated metrics, using the right file format and compression codec can greatly reduce your storage costs and speed up analytics.
By following best practices and understanding the trade-offs, you can design a Hive-based data warehouse that is both cost-effective and high-performing.