Cloud Storage Loader Output Scheme

mariah_rogers · July 30, 2021, 12:12am

Hello there!

I am trying to ingest our Snowplow events into Snowflake eventually, with a full pipeline implemented in GCP (Scala Stream Collector → Stream Enrich PubSub → Cloud Storage Loader). Since the Snowflake Loader doesn’t currently support GCP, we are going to sink our stream into a bucket via the Cloud Storage Loader, and then read it to Snowflake via Snowpipe.

I am curious if anyone can explain what the data output format will be when it gets sunk into the bucket? Is it a super wide file? Shredded into atomic, context and custom event tables? CSV? TSV? Any clarification would be much appreciated!

Thanks so much!

mike · July 30, 2021, 12:32am

Yep - the data format that pops out of enriched is in a wide TSV format - this format is consumable by any of the analytics SDKs as well as the shredder process.

The Snowflake model has its own dedicated shredder that runs on Spark but I imagine it’s probably portable to GCP with some changes. There’s also stream shredder (not dependent on Spark) but that’s not in a production ready state yet.

Topic		Replies	Views
When will Snowflake loading in GCP be available? Questions	1	775	June 1, 2022
Loading stream enriched data into Snowflake For engineers	6	1306	March 7, 2018
Snowflake loader with snowplow s3 loader - gzip? AWS real-time pipeline	2	1069	May 11, 2020
Data in S3 in JSON format (quickstart-examples) For engineers	5	1524	April 26, 2022
S3 Loader infrastructure AWS real-time pipeline	3	1242	November 20, 2020

Cloud Storage Loader Output Scheme

Related topics