Does snowplow support realtime streaming into aws snowflake?

Yolanda_Ou · February 23, 2023, 11:51am

Hi guys,

How are you? Just wondering, does snowplow support realtime streaming into snowflake on AWS? I am not certain whether this is the relevant feature Transformer Kinesis | Snowplow Documentation .

If it supports, just wondering if it’s ok to ingest data into the same kinesis streams from other places like lambda or whatever can wire up with kinesis, such as sns Amazon SNS adds support for message archiving and analytics via Kinesis Data Firehose subscriptions, to stitch data together before they finally go into snowflake?

Please rectify me if anything is not understood correctly, and thanks in advance!

josh · February 23, 2023, 11:45pm

Hi @Yolanda_Ou we support partial streaming in that we can “transform” data for loading into Snowflake and save it to S3 in near real-time but the actual loading operation into Snowflake is still done in essentially micro-batches.

With this architecture you will see some latency depending on the window you define (by default it is 10 minutes). So what will happen is that over 10 minutes data is transformed and then the loader is informed to load that window → if you have tighter requirements about loading latency you can reduce this window.

So the rough overall flow is: Collector → (raw stream) → Enrich → (enriched stream) → Transformer → (S3 batches) → Loader → Snowflake

Could you elaborate on what you mean by ingesting data into the same Kinesis Streams from other places? The Snowplow Pipeline expects Snowplow data to be in the streams and to be formatted in a particular way so normally everything should enter via the Collector following the tracker protocol.

Hope this helps!

Yolanda_Ou · February 24, 2023, 2:06am

Hi @josh ,

Thank you very much for the explanation! Highly appreciated!

As for micro-batches, I am more familiar with snowplow in GCP(pubsub+dataflow), is it also a similar pattern with that in snowflake, a closer to realtime micro-batches from general perspective?

Could the whole journey of Collector → (raw stream) → Enrich → (enriched stream) → Transformer → (S3 batches) → Loader → Snowflake be around 10 minutes? If the window can be tightened, to what extent it can be? Is the bottleneck of the journey (S3 batches) → Loader ?

I am thinking of:
send data(other than via snowplow tracker) into the kinesis steam where snowplow data sits → diverge kinesis with 2 consumers, one is specific for snowplow, another one takes firebose, do a transformation with snowplow events data in some schema, in lambda(since it can be triggered when 3mb buffer is full and save some lambda runs to be economic) ,and then deliver that to other destinations, like a SNS topic, if it’s not too expensive.

rather than: send the data as an event via tracker → wait for data to arrive at snowflake → use some other process to load data from snowflake and send to sns to broadcast.

Something uncertain for me is whether the kinesis quota is ok to allow 2 consumers since there’s a limit of number of requests/ records/ data volume consumed per shard Quotas and Limits - Amazon Kinesis Data Streams.

Please let me know if anything is not clear, and thank you so much for responding again!!

josh · February 24, 2023, 5:59am

Yes - dropping the window will result in faster loads into Snowflake if that is what you are looking to achieve! Finding the right balance of speed and cost optimal settings is important as the more frequently loading occurs the more the Snowflake warehouse is active so you will end up spending more credits.

I am not sure I entirely follow what you are looking to achieve but it is possible to read directly from the Enriched stream so you do not need to wait for data to land in Snowflake to leverage it. We recently released an OS tool called Snowbridge which you can use to plug in directly on top of the Enriched stream and access the data in real-time.

Something uncertain for me is whether the kinesis quota is ok to allow 2 consumers since there’s a limit of number of requests/ records/ data volume consumed per shard Quotas and Limits - Amazon Kinesis Data Streams.

If you fully saturate the ingress of a stream you will max out at two consumers → to get around this you can either scale up the stream to increase the READ bandwidth or you can use Enhanced Fan-Out for your extra consumers.

Please let me know if anything is not clear, and thank you so much for responding again!!

It might be easier to help however if you explained the problem you are trying to solve instead of the solution you want to try and implement!

Topic		Replies	Views
Data Comparison AWS real-time pipeline	0	1056	February 14, 2019
Snowplow Real-time Processing with ELK AWS real-time pipeline	1	1233	May 7, 2019
RFC: making the Snowplow pipeline real-time end-to-end and deprecating support for batch processing modules RFCs	19	4620	February 22, 2020
February 2022 Office Hours: Kinesis, Snowflake & Snowplow Loader Upcoming events	2	926	February 18, 2022
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	6038	July 12, 2016

Does snowplow support realtime streaming into aws snowflake?

Related topics