Missing Contexts sources while Setting up Snowplow web dbt package on Snowflake snowpipe

Hello People,
I am trying to set up the following snowplow web dbt package with Snowflake.
https://hub.getdbt.com/snowplow/snowplow_web/latest/

And from what I have understood so far, this package works out of the box only if data in snowflake was loaded using the loader.
I am loading the data using snowpipe and all I have is “events” table as a source and following sources are basically not directly available.

com_snowplowanalytics_snowplow_web_page_1
com_snowplowanalytics_snowplow_ua_parser_context_1
nl_basjes_yauaa_context_1

However,
When we get enriched events and I see that we have columns like “contexts” and “derived_contexts” that actually has the context information needed to build these sources.

Example :
I have the following information in contexts column that I could parse to build com_snowplowanalytics_snowplow_web_page_1

{
“data”: {
“id”: “aed6ee27-9a84-40d0-b60a-8ce83711e500”
},
“schema”: “iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0”

Similarly, I have the following in derived_context that I could parse to build nl_basjes_yauaa_context_1
{
“data”: {
“agentClass”: “Browser Webview”,
“agentName”: “Chrome Webview”,
“agentNameVersion”: “Chrome Webview 99.0.4844.73”,
“agentNameVersionMajor”: “Chrome Webview 99”,
“agentVersion”: “99.0.4844.73”,
“agentVersionMajor”: “99”,
“deviceBrand”: “Oppo”,
“deviceClass”: “Phone”,
“deviceName”: “Oppo CPH1819”,
“layoutEngineClass”: “Browser”,
“layoutEngineName”: “Blink”,
“layoutEngineNameVersion”: “Blink 99.0”,
“layoutEngineNameVersionMajor”: “Blink 99”,
“layoutEngineVersion”: “99.0”,
“layoutEngineVersionMajor”: “99”,
“operatingSystemClass”: “Mobile”,
“operatingSystemName”: “Android”,
“operatingSystemNameVersion”: “Android 10”,
“operatingSystemNameVersionMajor”: “Android 10”,
“operatingSystemVersion”: “10”,
“operatingSystemVersionBuild”: “QP1A.190711.020”,
“operatingSystemVersionMajor”: “10”,
“webviewAppName”: “Facebook App for Android”,
“webviewAppNameVersionMajor”: “Facebook App for Android 358”,
“webviewAppVersion”: “358.0.0.34.117”,
“webviewAppVersionMajor”: “358”
},
“schema”: “iglu:nl.basjes/yauaa_context/jsonschema/1-0-2”

Similarly,
I have the following information to build com_snowplowanalytics_snowplow_ua_parser_context_1 as well
{
“data”: [
{
“data”: {
“deviceFamily”: “Generic Smartphone”,
“osFamily”: “Android”,
“osMajor”: “10”,
“osMinor”: null,
“osPatch”: null,
“osPatchMinor”: null,
“osVersion”: “Android 10”,
“useragentFamily”: “Facebook”,
“useragentMajor”: “358”,
“useragentMinor”: “0”,
“useragentPatch”: “0”,
“useragentVersion”: “Facebook 358.0.0”
},
“schema”: “iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0”
}

2 questions:

  1. Does this make sense to take this approach of building these sources myself from these contexts columns and make the package work ?
  2. Only thing I don’t understand is all of these sources have the following columns that I have no idea where they are coming from ?
    root_id
    root_tstamp
    ref_root
    ref_tree
    ref_parent

root_id seems to be the actual event_id of the corresponding event of that row. what are those other columns are and where can I fetch those from ?

Hi there, welcome to the community!

this package works out of the box only if data in snowflake was loaded using the loader.

This will be quite a big challenge. The loader does a lot of work in formatting the data to be more usable, and the models are built for that input.

  1. Does this make sense to take this approach of building these sources myself from these contexts columns and make the package work ?

Honestly, it doesn’t really make sense to me.

It is possible to do it, but this requires that you manually instrument in SQL what the Open Source Snowflake loader does already in Scala.

It’s definitely less efficient to do it in SQL, and Snowflake charges by compute, so you’re paying extra to do it too.

Seems like a lot of work to be honest.

Regarding your actual plan for working with the data - I think this ticket outlines the task and issues you might encounter. We also have a blog due to be released soon on the topic (in the next couple of days I think, so good timing!)

  1. Only thing I don’t understand is all of these sources have the following columns that I have no idea where they are coming from ?
    root_id
    root_tstamp
    ref_root
    ref_tree
    ref_parent

These are all fields that are set by the Redshift loader. If they’re in your data then you’ve got yet more challenges - the format of data in Redshift and Snowflake are very different, because those databases are very different.

This makes the request quite confusing to be honest… It seems like the proposed architecture is:

Enriched stream → Redshift Loader → Redshift → Snowpipe → Snowflake → Heavy SQL processing to transform the data → Data model

Whereas the architecture only needs to be:

Enriched stream → Snowflake Loader → Snowflake → Data model

2 Likes

I know this is mostly echoing the sentiment from @Colm above but I’ll include it.

No - I wouldn’t do this unless there is a very specific requirement that prevents you from using the loader. The dbt package will work out of the box on Snowflake as long as the loader has been used. If you use Snowpipe you will lose a sizeable amount of functionality that the Snowflake loader provides - including the formatting and column additions.

The ‘default’ models in the repo apply to Redshift so these columns should only be in this model. For the Snowflake models these should be in a seperate folder per model e.g., for base events.

2 Likes

The blog is out:

https://snowplowanalytics.com/blog/2022/04/27/snowplows-rdb-loader-or-snowflakes-snowpipe/