Online Analytical Processing (OLAP) systems are essential for powering business intelligence dashboards, drill-down analytics, and reporting workflows. Traditionally, OLAP cubes were built on dedicated systems like SSAS or Oracle OLAP. Today, Apache Hive makes it possible to build scalable OLAP cubes on big data using distributed storage and compute.

This blog post explains how to design and implement OLAP cubes using Hive, focusing on data modeling, cube materialization, aggregation techniques, and query performance optimization for large-scale analytical workloads.


What Is an OLAP Cube?

An OLAP cube is a multidimensional data structure optimized for complex analytical queries, such as:

  • Total sales by region, product, and month
  • Average session time by user segment and device
  • Drill-down analysis on time and geography dimensions

OLAP cubes typically involve:

  • Dimensions: Entities for slicing and dicing (e.g., date, product, region)
  • Facts: Metrics being analyzed (e.g., revenue, clicks)
  • Hierarchies: Nested levels for aggregation (e.g., year > quarter > month)

Why Use Hive for OLAP?

Apache Hive offers key advantages for OLAP on big data:

  • SQL interface for ad hoc queries
  • Scalable storage with HDFS or Amazon S3
  • Support for complex joins and aggregations
  • Integration with BI tools via JDBC
  • Compatibility with materialized views and pre-aggregated tables

With proper optimization, Hive can handle OLAP workloads over billions of rows efficiently.


Designing a Star Schema in Hive

Start with a star schema: a central fact table linked to multiple dimension tables.

-- Fact table
CREATE TABLE fact_sales (
product_id INT,
store_id INT,
time_id DATE,
units_sold INT,
revenue DOUBLE
)
STORED AS ORC;

-- Dimension tables
CREATE TABLE dim_product (
product_id INT,
product_name STRING,
category STRING
);

CREATE TABLE dim_store (
store_id INT,
region STRING,
store_type STRING
);

CREATE TABLE dim_time (
time_id DATE,
year INT,
month INT,
day INT
);

Store all tables in ORC or Parquet format with compression for performance.


Creating Pre-Aggregated OLAP Cubes

You can manually materialize cube aggregations:

CREATE TABLE cube_sales_region_month AS
SELECT
d.region,
t.year,
t.month,
SUM(f.revenue) AS total_revenue,
SUM(f.units_sold) AS total_units
FROM fact_sales f
JOIN dim_store d ON f.store_id = d.store_id
JOIN dim_time t ON f.time_id = t.time_id
GROUP BY d.region, t.year, t.month;

Partition by time for better pruning:

PARTITIONED BY (year INT, month INT)
STORED AS ORC;

This cube can now be queried directly for fast dashboard loads.


Using Hive Materialized Views

Hive supports incrementally maintained materialized views (Hive 3.0+):

CREATE MATERIALIZED VIEW cube_product_region
PARTITIONED ON (year, month)
AS
SELECT
p.category,
d.region,
t.year,
t.month,
SUM(f.revenue) AS total_revenue
FROM fact_sales f
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_store d ON f.store_id = d.store_id
JOIN dim_time t ON f.time_id = t.time_id
GROUP BY p.category, d.region, t.year, t.month;

Rebuild or refresh as needed:

ALTER MATERIALIZED VIEW cube_product_region REBUILD;

Materialized views reduce query complexity and improve performance for repetitive queries.


Enabling Cube Queries with ROLLUP and GROUPING SETS

Hive supports GROUPING SETS and ROLLUP for cube-like aggregations:

SELECT
region,
product_name,
SUM(revenue) AS total_revenue
FROM fact_sales f
JOIN dim_store d ON f.store_id = d.store_id
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY GROUPING SETS (
(region),
(product_name),
(region, product_name)
);

This generates multiple aggregations in a single pass.


Query Optimization for OLAP Workloads

Apply the following settings and strategies:

SET hive.vectorized.execution.enabled = true;
SET hive.cbo.enable = true;
SET hive.exec.parallel = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Other optimizations:

  • Use partitioning on time dimensions (year, month)
  • Use bucketing on high-cardinality dimensions (e.g., product_id)
  • Avoid small files — use compaction or combine outputs
  • Leverage LLAP or Tez for interactive query performance

Integrating OLAP Cubes with BI Tools

Expose OLAP cubes to dashboards by:

  • Connecting BI tools (Superset, Tableau, Power BI) via Hive JDBC/ODBC
  • Creating semantic layers in tools with cube hierarchies
  • Pre-aggregating data at multiple levels (daily, weekly, monthly)
  • Setting up refresh schedules using Apache Oozie, Airflow, or cron

Ensure data freshness aligns with business needs (e.g., hourly or daily cube refreshes).


Best Practices

  • Use star schema with dimension normalization
  • Avoid unnecessary joins in cubes — pre-flatten if needed
  • Store fact tables in ORC format with Snappy compression
  • Partition cubes by time for pruning
  • Enable materialized views for frequent aggregations
  • Cache hot queries using LLAP

Conclusion

Apache Hive provides a powerful foundation for building scalable OLAP cubes on large datasets. With a combination of proper schema design, cube materialization, and query optimization, you can deliver low-latency, multidimensional analytics for reporting and dashboarding.

As cloud-native and open-source data warehouses continue to evolve, Hive remains a reliable choice for organizations seeking robust, SQL-driven analytics over massive data volumes.