Recently, we’ve seen an uptick in the number of Snowplow users who opt for using Snowflake as an alternative storage target to Redshift. However, the way Snowplow data is structured in Snowflake differs from Redshift. This post is meant as a primer on what the key differences are and how to deal with them.
In Redshift, you have the ‘main’ atomic.events
table and a number of additional child tables in the the atomic
schema (such as atomic.com_snowplowanalytics_snowplow_link_click_1
) which you have to join back to atomic.events
.
By contrast, in Snowflake everything is in the atomic.events table
; with ‘shredded’ data populating its own columns in that table: one for each context or unstruct
event. New columns get added to atomic.events
automatically when you start tracking (and loading) a new self-describing (unstruct
) event or context.
Columns that contain data for contexts begin with contexts_
, followed by the name the shredded table would normally have in Redshift, eg contexts_com_snowplowanalytics_snowplow_web_page_1
. Columns that contain data for unstruct
events begin with unstruct_event_
, followed by the name the shredded table would normally have in Redshift, eg unstruct_event_com_snowplowanalytics_snowplow_submit_form_1
.
The data in these columns is stored as an object or array. It’s easy enough to query those columns:
SELECT
contexts_com_snowplowanalytics_snowplow_web_page_1
FROM
atomic.events
WHERE
collector_tstamp > DATEADD('day', -1, CURRENT_DATE())::DATE
LIMIT 5;
row# | CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1 |
---|---|
1 | [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ] |
2 | [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ] |
3 | [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ] |
4 | [ { “id”: “7bddb40b-10e0-4d2b-86af-95a63347361e” } ] |
5 | [ { “id”: “51a95cde-a42e-4ca4-890c-d7b1bfd24e70” } ] |
As you can see, each cell contains the full array of contexts for the event. With the web_page
context, we’re only sending one JSON, but for other contexts you may have multiple JSONs in the array.
Here is an example of what an unstruct
event looks like:
SELECT
unstruct_event_com_snowplowanalytics_snowplow_link_click_1
FROM
atomic.events
WHERE
unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
LIMIT 1;
row# | UNSTRUCT_EVENT_COM_SNOWPLOWANALYTICS_SNOWPLOW_LINK_CLICK_1 |
---|---|
1 | { “elementClasses”: [ “icon”, “btn-flat” ], “elementId”: “search-button”, “elementTarget”: “”, “targetUrl”: “https://discourse.snowplow.io/search” } |
In this case the value of the cell is a single JSON rather than an array.
In analysing the data, you will want to extract specific elements from the JSON and ‘join’ them to the data in the rest of the atomic.events
table. To do that, you can call each element of the JSON by name.
Extracting the individual elements of the JSON works differently when the value is a JSON and when it is an array.
Let’s first take a look at the simpler case, in which we have a JSON. If we want to get each one of the fields from the example above, we can do it like this:
SELECT
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses AS element_classes,
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId AS element_id,
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementTarget AS element_target,
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:targetUrl AS target_url
FROM
atomic.events
WHERE
unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
LIMIT 1;
row# | ELEMENT_CLASSES | ELEMENT_ID | ELEMENT_TARGET | TARGET_URL |
---|---|---|---|---|
1 | [ “icon”, “btn-flat” ] | “search-button” | “” | “https://discourse.snowplow.io/search” |
To get the individual elements of the element_classes
array, you would run a query like this one:
SELECT
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses[0] AS element_classes_1,
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementClasses[1] AS element_classes_2,
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId AS element_id,
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementTarget AS element_target,
unstruct_event_com_snowplowanalytics_snowplow_link_click_1:targetUrl AS target_url
FROM
atomic.events
WHERE
unstruct_event_com_snowplowanalytics_snowplow_link_click_1 IS NOT NULL
AND unstruct_event_com_snowplowanalytics_snowplow_link_click_1:elementId = 'search-button'
LIMIT 1;
row# | ELEMENT_CLASSES_1 | ELEMENT_CLASSES_2 | ELEMENT_ID | ELEMENT_TARGET | TARGET_URL |
---|---|---|---|---|---|
1 | “icon” | “btn-flat” | “search-button” | “” | “https://discourse.snowplow.io/search” |
For arrays of JSONs, you can use Snowflake’s FLATTEN
function, which works similarly to JOIN
but within the same table (no join key required):
SELECT
e.event_id,
wp.value:id AS web_page_id
FROM
atomic.events e, -- Note the comma
LATERAL FLATTEN(INPUT => contexts_com_snowplowanalytics_snowplow_web_page_1) wp
LIMIT 5;
row# | EVENT_ID | WEB_PAGE_ID |
---|---|---|
1 | 13fc90f6-a7a1-4b2e-b966-616e175d8c73 | “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a” |
2 | db565e88-0979-49a0-a2a1-4487ef2af13a | “1982ff91-4078-49ec-be92-8b032faaf774” |
3 | 21852e22-d02a-430a-b65b-da302b43b053 | “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a” |
4 | 35f21fca-86e3-40d4-8594-9a1a47d78cad | “9750934c-f3b4-4049-bf21-4ab7a3b5cf8a” |
5 | a3b97738-0008-4e8b-8694-e10bd5004d71 | “1982ff91-4078-49ec-be92-8b032faaf774” |
You can find the full documentation for the FLATTEN
function here.