Handling Nested and Complex Data Types in Hive

Modern big data pipelines often ingest semi-structured data such as JSON, Avro, or Parquet. Hive, a powerful data warehousing tool on Hadoop, supports complex and nested data types — including ARRAY, MAP, and STRUCT.

Understanding how to define, load, and query these types effectively can simplify schema design and enhance performance when working with log data, event streams, and hierarchical records.

In this post, we’ll explore how to handle nested and complex data types in Hive, including practical examples of querying and flattening deeply nested structures.

Overview of Complex Data Types in Hive

Hive supports the following complex data types:

ARRAY – A collection of ordered elements of the same type
MAP – A key-value pair collection with unique keys
STRUCT – A group of fields, each with its own name and type

These types can be nested recursively, allowing you to model real-world data structures.

Defining Complex Data Types in Hive Tables

Here’s how to create a Hive table with complex columns:

CREATE TABLE user_activity (
user_id STRING,
sessions ARRAY<STRUCT<session_id:STRING, duration:INT>>,
metadata MAP<STRING, STRING>
)
STORED AS PARQUET;

You can also embed nested structs:

CREATE TABLE orders (
order_id STRING,
customer STRUCT<id:STRING, name:STRING, contact:STRUCT<email:STRING, phone:STRING>>,
items ARRAY<STRUCT<sku:STRING, quantity:INT>>
);

Loading Nested Data

Nested data is typically stored in formats like Parquet, ORC, or Avro, which support complex types natively. You can load data directly into complex columns using:

LOAD DATA INPATH '/path/to/data.parquet' INTO TABLE user_activity;

For delimited files, define a SerDe (e.g., JSONSerDe or custom) to parse nested structures.

Querying Structs

Access fields in a struct using dot notation:

SELECT customer.name, customer.contact.email FROM orders;

You can alias struct fields during selection:

SELECT customer.id AS customer_id, customer.contact.phone AS phone FROM orders;

Working with Arrays

Use LATERAL VIEW explode() to flatten arrays:

SELECT user_id, s.session_id, s.duration
FROM user_activity
LATERAL VIEW explode(sessions) AS s;

This transforms each element in the array into a separate row — useful for detailed analysis and aggregation.

Handling Maps

Access values from a map using bracket notation:

SELECT metadata['browser'], metadata['location'] FROM user_activity;

To iterate over a map:

SELECT user_id, key, value
FROM user_activity
LATERAL VIEW explode(metadata) AS key, value;

Nesting Arrays and Structs

Complex types can be deeply nested:

SELECT items[0].sku, items[0].quantity FROM orders;

Or flatten nested arrays of structs:

SELECT o.order_id, i.sku, i.quantity
FROM orders o
LATERAL VIEW explode(o.items) AS i;

Be cautious with null values and empty arrays, which may lead to skipped rows in explode().

Using JSON Data with Complex Types

When dealing with JSON input, use the org.openx.data.jsonserde.JsonSerDe to map JSON to Hive complex types:

CREATE TABLE json_logs (
id STRING,
payload STRUCT<event:STRING, data:MAP<STRING, STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

This makes Hive a powerful tool for log analytics and event processing directly from raw JSON.

Best Practices

Use Parquet or ORC for storing complex types due to better performance and compression
Avoid deeply nested structures that are difficult to flatten or query
Normalize data if queries are complex and involve multiple nested levels
Use LATERAL VIEW carefully to prevent performance bottlenecks
Validate schema compatibility when ingesting Avro or JSON data

Conclusion

Working with complex and nested data types in Hive unlocks the ability to model rich, semi-structured data directly in your warehouse. Whether you’re processing logs, IoT events, or transactional metadata, Hive’s support for ARRAY, MAP, and STRUCT gives you the flexibility to store and analyze this data efficiently.

With the right storage formats and querying patterns, you can flatten, filter, and join nested structures at scale — transforming raw data into actionable insights.