Building Scalable OLAP Cubes with Hive

Online Analytical Processing (OLAP) systems are essential for powering business intelligence dashboards, drill-down analytics, and reporting workflows. Traditionally, OLAP cubes were built on dedicated systems like SSAS or Oracle OLAP. Today, Apache Hive makes it possible to build scalable OLAP cubes on big data using distributed storage and compute.

This blog post explains how to design and implement OLAP cubes using Hive, focusing on data modeling, cube materialization, aggregation techniques, and query performance optimization for large-scale analytical workloads.

What Is an OLAP Cube?

An OLAP cube is a multidimensional data structure optimized for complex analytical queries, such as:

Total sales by region, product, and month
Average session time by user segment and device
Drill-down analysis on time and geography dimensions

OLAP cubes typically involve:

Dimensions: Entities for slicing and dicing (e.g., date, product, region)
Facts: Metrics being analyzed (e.g., revenue, clicks)
Hierarchies: Nested levels for aggregation (e.g., year > quarter > month)

Why Use Hive for OLAP?

Apache Hive offers key advantages for OLAP on big data:

SQL interface for ad hoc queries
Scalable storage with HDFS or Amazon S3
Support for complex joins and aggregations
Integration with BI tools via JDBC
Compatibility with materialized views and pre-aggregated tables

With proper optimization, Hive can handle OLAP workloads over billions of rows efficiently.

Designing a Star Schema in Hive

Start with a star schema: a central fact table linked to multiple dimension tables.

-- Fact table
CREATE TABLE fact_sales (
product_id INT,
store_id INT,
time_id DATE,
units_sold INT,
revenue DOUBLE
)
STORED AS ORC;

-- Dimension tables
CREATE TABLE dim_product (
product_id INT,
product_name STRING,
category STRING
);

CREATE TABLE dim_store (
store_id INT,
region STRING,
store_type STRING
);

CREATE TABLE dim_time (
time_id DATE,
year INT,
month INT,
day INT
);

Store all tables in ORC or Parquet format with compression for performance.

Creating Pre-Aggregated OLAP Cubes

You can manually materialize cube aggregations:

CREATE TABLE cube_sales_region_month AS
SELECT
d.region,
t.year,
t.month,
SUM(f.revenue) AS total_revenue,
SUM(f.units_sold) AS total_units
FROM fact_sales f
JOIN dim_store d ON f.store_id = d.store_id
JOIN dim_time t ON f.time_id = t.time_id
GROUP BY d.region, t.year, t.month;

Partition by time for better pruning:

PARTITIONED BY (year INT, month INT)
STORED AS ORC;

This cube can now be queried directly for fast dashboard loads.

Using Hive Materialized Views

Hive supports incrementally maintained materialized views (Hive 3.0+):

CREATE MATERIALIZED VIEW cube_product_region
PARTITIONED ON (year, month)
AS
SELECT
p.category,
d.region,
t.year,
t.month,
SUM(f.revenue) AS total_revenue
FROM fact_sales f
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_store d ON f.store_id = d.store_id
JOIN dim_time t ON f.time_id = t.time_id
GROUP BY p.category, d.region, t.year, t.month;

Rebuild or refresh as needed:

ALTER MATERIALIZED VIEW cube_product_region REBUILD;

Materialized views reduce query complexity and improve performance for repetitive queries.

Enabling Cube Queries with ROLLUP and GROUPING SETS

Hive supports GROUPING SETS and ROLLUP for cube-like aggregations:

SELECT
region,
product_name,
SUM(revenue) AS total_revenue
FROM fact_sales f
JOIN dim_store d ON f.store_id = d.store_id
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY GROUPING SETS (
(region),
(product_name),
(region, product_name)
);

This generates multiple aggregations in a single pass.

Query Optimization for OLAP Workloads

Apply the following settings and strategies:

SET hive.vectorized.execution.enabled = true;
SET hive.cbo.enable = true;
SET hive.exec.parallel = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Other optimizations:

Use partitioning on time dimensions (year, month)
Use bucketing on high-cardinality dimensions (e.g., product_id)
Avoid small files — use compaction or combine outputs
Leverage LLAP or Tez for interactive query performance

Integrating OLAP Cubes with BI Tools

Expose OLAP cubes to dashboards by:

Connecting BI tools (Superset, Tableau, Power BI) via Hive JDBC/ODBC
Creating semantic layers in tools with cube hierarchies
Pre-aggregating data at multiple levels (daily, weekly, monthly)
Setting up refresh schedules using Apache Oozie, Airflow, or cron

Ensure data freshness aligns with business needs (e.g., hourly or daily cube refreshes).

Best Practices

Use star schema with dimension normalization
Avoid unnecessary joins in cubes — pre-flatten if needed
Store fact tables in ORC format with Snappy compression
Partition cubes by time for pruning
Enable materialized views for frequent aggregations
Cache hot queries using LLAP

Conclusion

Apache Hive provides a powerful foundation for building scalable OLAP cubes on large datasets. With a combination of proper schema design, cube materialization, and query optimization, you can deliver low-latency, multidimensional analytics for reporting and dashboarding.

As cloud-native and open-source data warehouses continue to evolve, Hive remains a reliable choice for organizations seeking robust, SQL-driven analytics over massive data volumes.