JSON is everywhere in modern data infrastructure. It comes from APIs, event tracking systems, product analytics tools, webhooks, and application logs. It is flexible, human readable, and handles nested and variable structure elegantly. It is also, from the perspective of a SQL analyst trying to run aggregations on it, occasionally a nightmare.
The problem is not that JSON is bad. It is that SQL was designed for tabular data where every row has the same columns and every column has a consistent type. JSON data is often deeply nested, inconsistently structured across records, and contains arrays of objects that do not map cleanly to rows and columns without some transformation work.
Modern SQL engines have responded to this challenge with increasingly powerful JSON functions that let you extract values, flatten arrays, filter based on nested properties, and aggregate across JSON fields without leaving SQL. BigQuery, PostgreSQL, Snowflake, and DuckDB all have mature JSON support with slightly different syntax and capabilities.
This guide walks through how to analyze JSON data in SQL from the basics of extracting a single field all the way through flattening nested arrays, handling variable structure, and building useful aggregations from semi-structured data. Examples are provided in multiple dialects so you can apply them in whichever warehouse you use.
Understanding the JSON Data Model in SQL
Before getting into specific functions it helps to understand how SQL engines represent JSON data internally and what operations they support.
Most SQL engines store JSON as either a native JSON or JSONB type, as in PostgreSQL, or as a variant or string column that is understood to contain JSON, as in Snowflake which uses the VARIANT type and BigQuery which uses STRING or JSON columns. DuckDB has a native JSON type and also supports reading JSON files directly as tables.
Regardless of how the engine stores JSON, the core operations you need are the same. You need to extract scalar values from specific paths within a JSON document. You need to extract nested objects. You need to flatten arrays of values or objects into multiple rows. You need to handle cases where a key is missing or null. And you need to aggregate across JSON fields the same way you would aggregate across regular columns.
The path syntax for navigating JSON structure is consistent across most engines with minor variations. A path like $.user.address.city means start at the root of the JSON document, navigate to the user key, then to address, then to city. Array elements are accessed with bracket notation like $.items[0] for the first element of an array.
Sample Data Setup
The examples throughout this guide use a table called events with a JSON column called properties. This represents a common pattern in product analytics where events are logged with a flexible properties payload.
sql
-- PostgreSQL / DuckDB
CREATE TABLE events (
event_id BIGINT,
event_type VARCHAR,
user_id BIGINT,
occurred_at TIMESTAMP,
properties JSON
);
INSERT INTO events VALUES
(1, 'page_view', 101, '2024-01-15 09:00:00',
'{"page": "/home", "referrer": "google", "session_id": "abc123", "device": {"type": "mobile", "os": "iOS"}}'),
(2, 'purchase', 101, '2024-01-15 09:45:00',
'{"order_id": "ORD-001", "amount": 149.99, "currency": "USD", "items": [{"sku": "SHOE-42", "qty": 1, "price": 149.99}]}'),
(3, 'page_view', 102, '2024-01-15 10:00:00',
'{"page": "/products", "referrer": "direct", "session_id": "def456", "device": {"type": "desktop", "os": "Windows"}}'),
(4, 'purchase', 103, '2024-01-15 10:30:00',
'{"order_id": "ORD-002", "amount": 89.50, "currency": "USD", "items": [{"sku": "HAT-01", "qty": 2, "price": 29.99}, {"sku": "BELT-05", "qty": 1, "price": 29.52}]}');
Extracting Scalar Values From JSON
The most basic JSON operation is extracting a single value at a known path. Every engine has functions for this though the syntax differs.
PostgreSQL
PostgreSQL offers two main operators for JSON extraction. The -> operator extracts a JSON object field and returns it as JSON. The ->> operator extracts a field and returns it as text. Use -> when you need to navigate deeper and ->> when you want the final scalar value.
sql
-- Extract page as text
SELECT
event_id,
properties->>'page' AS page,
properties->>'referrer' AS referrer
FROM events
WHERE event_type = 'page_view';
-- Navigate nested JSON: get device type
SELECT
event_id,
properties->'device'->>'type' AS device_type,
properties->'device'->>'os' AS operating_system
FROM events
WHERE event_type = 'page_view';
PostgreSQL also supports the #>> operator for extracting values at a path specified as an array of keys, which is cleaner for deeply nested structures:
sql
-- Equivalent to properties->'device'->>'type'
SELECT
event_id,
properties #>> '{device,type}' AS device_type
FROM events
WHERE event_type = 'page_view';
BigQuery
BigQuery uses JSON_VALUE to extract a scalar value as a string and JSON_QUERY to extract a JSON fragment. Path syntax uses the $ root notation.
sql
-- Extract scalar values
SELECT
event_id,
JSON_VALUE(properties, '$.page') AS page,
JSON_VALUE(properties, '$.referrer') AS referrer,
JSON_VALUE(properties, '$.device.type') AS device_type
FROM events
WHERE event_type = 'page_view';
-- Extract and cast to a specific type
SELECT
event_id,
CAST(JSON_VALUE(properties, '$.amount') AS FLOAT64) AS order_amount,
JSON_VALUE(properties, '$.currency') AS currency
FROM events
WHERE event_type = 'purchase';
Snowflake
Snowflake stores semi-structured data in VARIANT columns and uses colon notation to navigate JSON paths. The :: operator casts to a specific type.
sql
-- Extract values using colon notation
SELECT
event_id,
properties:page::VARCHAR AS page,
properties:referrer::VARCHAR AS referrer,
properties:device:type::VARCHAR AS device_type
FROM events
WHERE event_type = 'page_view';
-- Extract numeric values
SELECT
event_id,
properties:order_id::VARCHAR AS order_id,
properties:amount::FLOAT AS order_amount,
properties:currency::VARCHAR AS currency
FROM events
WHERE event_type = 'purchase';
DuckDB
DuckDB supports both PostgreSQL-style operators and a json_extract function family. It also supports reading JSON files directly.
sql
-- PostgreSQL-style operators work in DuckDB
SELECT
event_id,
properties->>'$.page' AS page,
properties->>'$.device.type' AS device_type
FROM events
WHERE event_type = 'page_view';
-- Or using json_extract functions
SELECT
event_id,
json_extract_string(properties, '$.page') AS page,
json_extract_string(properties, '$.device.type') AS device_type
FROM events
WHERE event_type = 'page_view';
Filtering Rows Based on JSON Values
Extracting values for display is useful but the more powerful capability is filtering, grouping, and aggregating based on JSON field values. Once you can extract a value from JSON you can use it anywhere a regular column value would appear.
sql
-- PostgreSQL: filter by a nested JSON value
SELECT
event_id,
user_id,
occurred_at
FROM events
WHERE event_type = 'page_view'
AND properties->>'referrer' = 'google';
-- BigQuery: filter and aggregate
SELECT
JSON_VALUE(properties, '$.device.type') AS device_type,
COUNT(*) AS page_views
FROM events
WHERE event_type = 'page_view'
GROUP BY 1
ORDER BY 2 DESC;
-- Snowflake: filter on nested value
SELECT
properties:device:type::VARCHAR AS device_type,
COUNT(*) AS page_views
FROM events
WHERE event_type = 'page_view'
GROUP BY 1
ORDER BY 2 DESC;
Numeric comparisons work the same way once you cast the extracted value to the appropriate type:
sql
-- PostgreSQL: find high-value purchases
SELECT
event_id,
user_id,
(properties->>'amount')::NUMERIC AS order_amount
FROM events
WHERE event_type = 'purchase'
AND (properties->>'amount')::NUMERIC > 100;
-- BigQuery equivalent
SELECT
event_id,
user_id,
CAST(JSON_VALUE(properties, '$.amount') AS FLOAT64) AS order_amount
FROM events
WHERE event_type = 'purchase'
AND CAST(JSON_VALUE(properties, '$.amount') AS FLOAT64) > 100;
Aggregating JSON Fields
Once JSON values are extracted and cast to the right types, aggregation functions work exactly as they do on regular columns.
sql
-- PostgreSQL: revenue analysis from JSON
SELECT
DATE_TRUNC('day', occurred_at) AS event_date,
COUNT(*) AS total_purchases,
SUM((properties->>'amount')::NUMERIC) AS total_revenue,
AVG((properties->>'amount')::NUMERIC) AS avg_order_value,
MAX((properties->>'amount')::NUMERIC) AS max_order_value
FROM events
WHERE event_type = 'purchase'
GROUP BY 1
ORDER BY 1;
-- BigQuery: device breakdown for page views
SELECT
JSON_VALUE(properties, '$.device.type') AS device_type,
JSON_VALUE(properties, '$.device.os') AS operating_system,
COUNT(*) AS sessions,
COUNT(DISTINCT user_id) AS unique_users
FROM events
WHERE event_type = 'page_view'
GROUP BY 1, 2
ORDER BY 3 DESC;
-- Snowflake: revenue by currency
SELECT
properties:currency::VARCHAR AS currency,
COUNT(*) AS order_count,
SUM(properties:amount::FLOAT) AS total_revenue
FROM events
WHERE event_type = 'purchase'
GROUP BY 1;
Flattening JSON Arrays Into Rows
The most complex common JSON operation is flattening arrays. When a JSON field contains an array, like the items array in purchase events, each element of the array needs to become its own row for item-level analysis.
PostgreSQL: json_array_elements
sql
-- Flatten the items array in purchase events
SELECT
e.event_id,
e.user_id,
e.occurred_at,
item->>'sku' AS sku,
(item->>'qty')::INTEGER AS quantity,
(item->>'price')::NUMERIC AS unit_price
FROM events e,
json_array_elements(e.properties->'items') AS item
WHERE e.event_type = 'purchase';
The json_array_elements function takes a JSON array and returns one row per element. The lateral join (the comma syntax is implicit lateral in PostgreSQL) connects each element back to the parent row.
Result:
event_id | user_id | occurred_at | sku | quantity | unit_price
---------|---------|---------------------|----------|----------|----------
2 | 101 | 2024-01-15 09:45:00 | SHOE-42 | 1 | 149.99
4 | 103 | 2024-01-15 10:30:00 | HAT-01 | 2 | 29.99
4 | 103 | 2024-01-15 10:30:00 | BELT-05 | 1 | 29.52
Once flattened, item-level aggregations become straightforward:
sql
-- Most purchased SKUs
SELECT
item->>'sku' AS sku,
SUM((item->>'qty')::INTEGER) AS total_units_sold,
SUM((item->>'qty')::INTEGER * (item->>'price')::NUMERIC) AS total_revenue
FROM events e,
json_array_elements(e.properties->'items') AS item
WHERE e.event_type = 'purchase'
GROUP BY 1
ORDER BY 2 DESC;
BigQuery: JSON_QUERY_ARRAY and UNNEST
sql
-- Flatten items array in BigQuery
SELECT
e.event_id,
e.user_id,
JSON_VALUE(item, '$.sku') AS sku,
CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity,
CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS unit_price
FROM events e,
UNNEST(JSON_QUERY_ARRAY(e.properties, '$.items')) AS item
WHERE e.event_type = 'purchase';
Snowflake: FLATTEN
Snowflake uses the FLATTEN table function to explode arrays into rows. It is more explicit than PostgreSQL’s lateral join syntax but equally powerful.
sql
-- Flatten items array in Snowflake
SELECT
e.event_id,
e.user_id,
e.occurred_at,
item.value:sku::VARCHAR AS sku,
item.value:qty::INTEGER AS quantity,
item.value:price::FLOAT AS unit_price
FROM events e,
LATERAL FLATTEN(INPUT => e.properties:items) AS item
WHERE e.event_type = 'purchase';
DuckDB: UNNEST
sql
-- Flatten items array in DuckDB
SELECT
e.event_id,
e.user_id,
json_extract_string(item, '$.sku') AS sku,
CAST(json_extract_string(item, '$.qty') AS INTEGER) AS quantity,
CAST(json_extract_string(item, '$.price') AS DECIMAL) AS unit_price
FROM events e,
UNNEST(json_extract(e.properties, '$.items')) AS t(item)
WHERE e.event_type = 'purchase';
Handling Missing and Null JSON Fields
Real JSON data is inconsistently structured. Some records have fields that others do not. A robust JSON query handles missing fields gracefully rather than failing or silently dropping rows.
sql
-- PostgreSQL: handle missing fields with COALESCE
SELECT
event_id,
user_id,
COALESCE(properties->>'referrer', 'unknown') AS referrer,
COALESCE(properties->>'session_id', 'no_session') AS session_id
FROM events
WHERE event_type = 'page_view';
-- Check whether a key exists before extracting
SELECT
event_id,
CASE
WHEN properties ? 'referrer'
THEN properties->>'referrer'
ELSE 'direct'
END AS referrer_source
FROM events
WHERE event_type = 'page_view';
-- BigQuery: JSON_VALUE returns NULL for missing keys, use COALESCE
SELECT
event_id,
COALESCE(JSON_VALUE(properties, '$.referrer'), 'direct') AS referrer_source
FROM events
WHERE event_type = 'page_view';
-- Snowflake: missing keys return NULL, use NVL
SELECT
event_id,
NVL(properties:referrer::VARCHAR, 'direct') AS referrer_source
FROM events
WHERE event_type = 'page_view';
The PostgreSQL ? operator checks whether a key exists in a JSON object, which is useful when you want to distinguish between a key that is present with a null value and a key that is absent entirely.
Converting JSON to Relational Columns With CTEs
For complex JSON analysis, it is often cleaner to extract all the fields you need in a CTE and then query the CTE as if it were a regular table. This avoids repeating long JSON extraction expressions throughout the query.
sql
-- PostgreSQL: extract purchase fields in a CTE
WITH purchase_details AS (
SELECT
event_id,
user_id,
occurred_at,
properties->>'order_id' AS order_id,
(properties->>'amount')::NUMERIC AS order_amount,
properties->>'currency' AS currency
FROM events
WHERE event_type = 'purchase'
),
page_view_details AS (
SELECT
event_id,
user_id,
occurred_at,
properties->>'page' AS page,
properties->>'referrer' AS referrer,
properties->'device'->>'type' AS device_type,
properties->'device'->>'os' AS operating_system
FROM events
WHERE event_type = 'page_view'
)
-- Now join and analyze as regular tables
SELECT
pv.referrer,
pv.device_type,
COUNT(DISTINCT pv.user_id) AS users_who_viewed,
COUNT(DISTINCT pd.user_id) AS users_who_purchased,
SUM(pd.order_amount) AS revenue_from_cohort
FROM page_view_details pv
LEFT JOIN purchase_details pd
ON pv.user_id = pd.user_id
AND pd.occurred_at > pv.occurred_at
GROUP BY 1, 2
ORDER BY 5 DESC NULLS LAST;
This pattern makes JSON-heavy queries significantly more readable and maintainable. The CTE layer handles the messy JSON extraction and the outer query reads cleanly against named, typed columns.
Reading JSON Files Directly in DuckDB
DuckDB has a particularly useful capability for analysts working with JSON files locally or in object storage. It can read JSON files directly as tables without any schema definition.
sql
-- Read a JSON file directly
SELECT * FROM read_json_auto('events.json');
-- Read multiple JSON files with a glob pattern
SELECT * FROM read_json_auto('events/*.json');
-- Read JSON from S3
SELECT * FROM read_json_auto('s3://my-bucket/events/2024-01-*.json');
-- DuckDB infers the schema automatically
-- You can inspect what it inferred
DESCRIBE SELECT * FROM read_json_auto('events.json');
DuckDB’s read_json_auto function infers the schema from the file contents and automatically handles nested objects and arrays. For rapid exploratory analysis of JSON data witout setting up a database, this is extraordinarily convenient.
sql
-- Full analysis pipeline on a JSON file in one query
WITH raw AS (
SELECT * FROM read_json_auto('events.json')
),
purchases AS (
SELECT
user_id,
occurred_at,
properties.order_id AS order_id,
properties.amount AS order_amount
FROM raw
WHERE event_type = 'purchase'
)
SELECT
DATE_TRUNC('day', occurred_at) AS event_date,
COUNT(*) AS orders,
SUM(order_amount) AS revenue
FROM purchases
GROUP BY 1
ORDER BY 1;
JSON Analysis Cheat Sheet
| Operation | PostgreSQL | BigQuery | Snowflake | DuckDB |
|---|---|---|---|---|
| Extract as text | col->>'key' | JSON_VALUE(col, '$.key') | col:key::VARCHAR | json_extract_string(col, '$.key') |
| Extract as JSON | col->'key' | JSON_QUERY(col, '$.key') | col:key | json_extract(col, '$.key') |
| Nested path | col->'a'->>'b' | JSON_VALUE(col, '$.a.b') | col:a:b::VARCHAR | json_extract_string(col, '$.a.b') |
| Array element | col->'arr'->0 | JSON_VALUE(col, '$.arr[0]') | col:arr[0] | json_extract(col, '$.arr[0]') |
| Flatten array | json_array_elements(col->'arr') | UNNEST(JSON_QUERY_ARRAY(col, '$.arr')) | LATERAL FLATTEN(col:arr) | UNNEST(json_extract(col, '$.arr')) |
| Key exists check | col ? 'key' | JSON_VALUE IS NOT NULL | col:key IS NOT NULL | json_extract(col, '$.key') IS NOT NULL |
| Cast to number | (col->>'amt')::NUMERIC | CAST(JSON_VALUE(col, '$.amt') AS FLOAT64) | col:amt::FLOAT | CAST(json_extract_string(col, '$.amt') AS DECIMAL) |
Common Mistakes When Querying JSON in SQL
Forgetting to cast extracted values before using them in calculations is the most common error. JSON extraction functions return text or JSON objects. Arithmetic on text columns either throws an error or produces unexpected results depending on the engine. Always cast to the appropriate numeric type before summing, averaging, or comparing numerically.
Using -> instead of ->> in PostgreSQL is a frequent source of confusion for beginners. The single arrow returns a JSON object wrapped in quotes. The double arrow returns the raw text value. Aggregating or comparing JSON objects instead of their text values produces wrong results silently in some cases.
Not handling missing keys produces queries that silently drop rows or return nulls where values are expected. Always consider whether a JSON field might be absent from some records and handle that case explicitly with COALESCE, CASE, or the key existence check appropriate for your engine.
Flattening arrays in a WHERE clause instead of a FROM clause is a structural mistake. The array flattening function needs to be in the FROM clause as a lateral join. Putting it in the WHERE clause either errors or does not produce the row explosion you intended.
Extracting the same JSON field multiple times in a complex query is an inefficiency and a readability problem. Use a CTE to extract fields once and reference the CTE throughout rather than repeating extraction expressions in every SELECT and WHERE clause.
FAQs
How do I extract a value from a JSON column in SQL?
The syntax depends on your database. In PostgreSQL use the ->> operator for text extraction, for example properties->>'page'. In BigQuery use JSON_VALUE(properties, '$.page'). In Snowflake use colon notation, properties:page::VARCHAR. In DuckDB use json_extract_string(properties, '$.page'). All of these navigate to the specified key and return its value as a string that you can then cast to the appropriate type.
How do I flatten a JSON array into rows in SQL?
Use the array expansion function for your database in a lateral join. In PostgreSQL use json_array_elements(col->'array_key') in the FROM clause. In BigQuery use UNNEST(JSON_QUERY_ARRAY(col, '$.array_key')). In Snowflake use LATERAL FLATTEN(INPUT => col:array_key). In DuckDB use UNNEST(json_extract(col, '$.array_key')). Each element of the array becomes a separate row that you can then extract fields from individually.
How do I handle missing JSON keys in SQL?
All major SQL engines return NULL when a JSON extraction path does not exist in a document. Wrap your extraction in COALESCE to replace nulls with a default value, or use a CASE expression to handle the missing case explicitly. In PostgreSQL you can also use the ? operator to check whether a key exists before extracting it.
Which database is best for querying JSON data in SQL?
All four databases covered in this guide have mature JSON support. PostgreSQL has the most expressive operator syntax and is excellent for complex nested queries. Snowflake’s VARIANT type and FLATTEN function are well-optimized for semi-structured data at scale. BigQuery’s JSON functions are clean and consistent. DuckDB’s ability to read JSON files directly without schema definition makes it the fastest option for exploratory analysis of local or object storage JSON files. The best choice depends on where your data already lives rather than JSON capability alone.
Should I store data as JSON in SQL or convert it to columns?
SON storage makes sense when the structure is genuinely variable across records or when the schema evolves frequently enough that managing migrations is painful. For fields that are present in every record and have a consistent type, storing them as regular columns is almost always better for query performance, type safety, and query readability. A common pattern is to store a JSON properties column for variable attributes while promoting consistently present fields to dedicated columns.