Every interaction users make on a website or application — clicks, page views, purchases — leaves behind a trail of digital breadcrumbs known as clickstream data. Analyzing this data can uncover user behavior patterns, conversion paths, drop-off points, and more.

Apache Hive, with its SQL-like interface and compatibility with Hadoop-scale storage, is ideal for ingesting and analyzing large volumes of clickstream logs.

In this post, you’ll learn how to process raw clickstream data with Hive, design efficient schemas, perform sessionization, and build analytics pipelines to gain actionable insights.


What is Clickstream Data?

Clickstream data is a chronological series of user actions captured from websites, mobile apps, or digital products. Each record typically includes:

  • Timestamp
  • IP address
  • URL visited
  • Referrer
  • User agent
  • User/session ID
  • Action type (click, scroll, form submit, etc.)

This data is commonly stored in semi-structured log files or JSON/CSV formats, often in HDFS or cloud object storage.


Sample Clickstream Log Format

Here’s an example of raw JSON clickstream data:

{
"user_id": "abc123",
"session_id": "sess456",
"timestamp": "2024-11-16T08:45:00Z",
"url": "/product/987",
"referrer": "/category/shoes",
"event_type": "click",
"device": "mobile"
}

These logs are usually batched daily/hourly for analysis and stored in partitioned folders by date.


Creating Hive Tables for Clickstream Data

Create an external table pointing to the raw logs:

CREATE EXTERNAL TABLE raw_clickstream (
user_id STRING,
session_id STRING,
timestamp STRING,
url STRING,
referrer STRING,
event_type STRING,
device STRING
)
PARTITIONED BY (event_date STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/data/clickstream/raw/';

Then repair the table to load partitions:

MSCK REPAIR TABLE raw_clickstream;

Converting Timestamp and Creating a Session Table

Transform timestamps and create a structured session table in Parquet or ORC:

CREATE TABLE sessions (
user_id STRING,
session_id STRING,
event_time TIMESTAMP,
url STRING,
referrer STRING,
event_type STRING,
device STRING
)
PARTITIONED BY (event_date STRING)
STORED AS PARQUET;

Ingest data:

INSERT INTO TABLE sessions PARTITION (event_date)
SELECT
user_id,
session_id,
from_utc_timestamp(timestamp, 'UTC') AS event_time,
url,
referrer,
event_type,
device,
substr(timestamp, 1, 10) AS event_date
FROM raw_clickstream;

Sessionization and Event Ordering

Analyze user behavior by grouping and ordering events by session:

SELECT
session_id,
user_id,
collect_list(url) AS pages_visited,
min(event_time) AS session_start,
max(event_time) AS session_end,
unix_timestamp(max(event_time)) - unix_timestamp(min(event_time)) AS session_duration
FROM sessions
WHERE event_date = '2024-11-16'
GROUP BY session_id, user_id;

This reveals how long users stayed and what they viewed.


Funnel Analysis Example

Want to know how many users moved from a product page to checkout?

WITH events AS (
SELECT user_id, session_id, url, event_time
FROM sessions
WHERE event_date = '2024-11-16'
)

SELECT COUNT(DISTINCT user_id) AS funnel_users
FROM (
SELECT user_id,
max(case when url like '%/product/%' then 1 else 0 end) AS viewed_product,
max(case when url like '%/checkout%' then 1 else 0 end) AS reached_checkout
FROM events
GROUP BY user_id
) t
WHERE viewed_product = 1 AND reached_checkout = 1;

Device and Traffic Source Breakdown

Break down clickstream events by device type or referrer:

SELECT device, count(*) AS events
FROM sessions
WHERE event_date = '2024-11-16'
GROUP BY device
ORDER BY events DESC;

Or analyze traffic sources:

SELECT referrer, count(DISTINCT user_id) AS unique_visitors
FROM sessions
WHERE event_date = '2024-11-16'
GROUP BY referrer
ORDER BY unique_visitors DESC;

Creating Aggregated Daily Summary Table

Pre-aggregate clickstream metrics for BI or dashboard use:

CREATE TABLE daily_click_summary (
event_date STRING,
total_users INT,
total_sessions INT,
avg_session_duration DOUBLE
)
STORED AS PARQUET;

INSERT OVERWRITE TABLE daily_click_summary
SELECT
event_date,
count(DISTINCT user_id) AS total_users,
count(DISTINCT session_id) AS total_sessions,
avg(unix_timestamp(max(event_time)) - unix_timestamp(min(event_time))) AS avg_session_duration
FROM sessions
GROUP BY event_date;

Best Practices

  • Use Parquet or ORC formats for storage efficiency and fast queries
  • Partition tables by event_date for pruning and performance
  • Use collect_list() for path analysis and funnel visualization
  • Anonymize or mask PII for privacy compliance
  • Automate ingestion and transformation using Airflow, Oozie, or Hive scripts

Conclusion

Clickstream data is a goldmine for understanding user journeys, optimizing conversion rates, and driving personalization. Hive makes it possible to analyze vast volumes of raw click logs using familiar SQL, enabling batch and near real-time analytics at scale.

By mastering clickstream analytics in Hive, you can empower your organization with data-driven decision-making powered by web and app behavior insights.