Big data comes with big storage requirements. As data warehouses scale to handle petabytes of information, efficient storage and fast retrieval become critical. Hive offers a variety of data compression techniques to address these challenges — reducing storage costs and speeding up query performance.

In this guide, we’ll explore how to use compression in Apache Hive, understand supported formats and codecs, and learn best practices for applying compression in real-world analytics workloads.


Why Use Compression in Hive?

Benefits of data compression:

  • Reduced HDFS storage footprint
  • Lower I/O costs during query execution
  • Faster data transfer over network
  • Better performance when combined with vectorized execution

Compression is especially useful for large columnar datasets stored in ORC or Parquet formats.


File Formats and Compression Compatibility

Hive supports various file formats — and each supports different compression codecs.

Format Compression Support Recommended Codec
ORC Native + Zlib, Snappy Zlib or Snappy
Parquet Native + Snappy, Gzip Snappy
Text Gzip, Bzip2 Gzip
Avro Deflate, Snappy Snappy
RCFile Gzip, Bzip2, Snappy Snappy

For analytics, ORC with Zlib or Snappy is generally the best combination for Hive performance.


Compressing ORC Tables

ORC supports built-in lightweight compression. Use Snappy for speed or Zlib for better compression ratios.

CREATE TABLE logs_orc (
user_id STRING,
action STRING,
event_time TIMESTAMP
)
STORED AS ORC
TBLPROPERTIES ("orc.compress" = "SNAPPY");

Alternatively, set compression globally:

SET hive.exec.compress.output=true;
SET hive.exec.orc.default.compress=SNAPPY;

Other valid values: ZLIB, NONE


Compressing Parquet Tables

Parquet also supports built-in compression. To use Snappy:

CREATE TABLE user_parquet (
user_id STRING,
name STRING
)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression" = "SNAPPY");

Global config:

SET parquet.compression=SNAPPY;

Parquet with Snappy is ideal for fast analytics and BI workloads using Presto or Hive.


Compressing Text-Based Tables

For legacy systems using text files:

SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;

Then create the table and insert:

CREATE TABLE raw_logs (
line STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Use compressed inserts during ETL.


Compressing Output from Hive Queries

Enable compression for Hive query results (e.g., when writing to HDFS or exporting):

SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

To write compressed results into another table:

INSERT OVERWRITE TABLE compressed_sales
SELECT * FROM sales;

Ensure target table format supports compression.


Compression and Performance Trade-offs

Codec Compression Ratio Compression Speed Decompression Speed Use Case
Snappy Medium Very Fast Very Fast Interactive queries, ETL
Zlib High Slow Medium Archival, batch workloads
Gzip High Slow Slow Legacy tools, export
Bzip2 Very High Very Slow Slow Rarely used

Use Snappy for performance, Zlib for space efficiency, and avoid Bzip2 for most scenarios.


Best Practices for Hive Compression

  • Use ORC or Parquet formats instead of Text or CSV
  • Prefer Snappy for performance and Zlib for compact storage
  • Set compression settings globally in hive-site.xml or dynamically per table
  • Use vectorized reads for ORC and Parquet tables
  • Combine compression with partitioning and bucketing for optimal performance
  • Periodically run compaction to avoid small files in transactional tables

Conclusion

Data compression in Hive is a powerful tool for managing large datasets efficiently. Whether you’re storing logs, transactions, or aggregated metrics, using the right file format and compression codec can greatly reduce your storage costs and speed up analytics.

By following best practices and understanding the trade-offs, you can design a Hive-based data warehouse that is both cost-effective and high-performing.