How are you? Just wondering, does snowplow support realtime streaming into snowflake on AWS? I am not certain whether this is the relevant feature Transformer Kinesis | Snowplow Documentation .
Hi @Yolanda_Ou we support partial streaming in that we can “transform” data for loading into Snowflake and save it to S3 in near real-time but the actual loading operation into Snowflake is still done in essentially micro-batches.
With this architecture you will see some latency depending on the window you define (by default it is 10 minutes). So what will happen is that over 10 minutes data is transformed and then the loader is informed to load that window → if you have tighter requirements about loading latency you can reduce this window.
So the rough overall flow is: Collector → (raw stream) → Enrich → (enriched stream) → Transformer → (S3 batches) → Loader → Snowflake
Could you elaborate on what you mean by ingesting data into the same Kinesis Streams from other places? The Snowplow Pipeline expects Snowplow data to be in the streams and to be formatted in a particular way so normally everything should enter via the Collector following the tracker protocol.
Thank you very much for the explanation! Highly appreciated!
As for micro-batches, I am more familiar with snowplow in GCP(pubsub+dataflow), is it also a similar pattern with that in snowflake, a closer to realtime micro-batches from general perspective?
Could the whole journey of Collector → (raw stream) → Enrich → (enriched stream) → Transformer → (S3 batches) → Loader → Snowflake be around 10 minutes? If the window can be tightened, to what extent it can be? Is the bottleneck of the journey (S3 batches) → Loader ?
I am thinking of:
send data(other than via snowplow tracker) into the kinesis steam where snowplow data sits → diverge kinesis with 2 consumers, one is specific for snowplow, another one takes firebose, do a transformation with snowplow events data in some schema, in lambda(since it can be triggered when 3mb buffer is full and save some lambda runs to be economic) ,and then deliver that to other destinations, like a SNS topic, if it’s not too expensive.
rather than: send the data as an event via tracker → wait for data to arrive at snowflake → use some other process to load data from snowflake and send to sns to broadcast.
Something uncertain for me is whether the kinesis quota is ok to allow 2 consumers since there’s a limit of number of requests/ records/ data volume consumed per shard Quotas and Limits - Amazon Kinesis Data Streams.
Please let me know if anything is not clear, and thank you so much for responding again!!
Yes - dropping the window will result in faster loads into Snowflake if that is what you are looking to achieve! Finding the right balance of speed and cost optimal settings is important as the more frequently loading occurs the more the Snowflake warehouse is active so you will end up spending more credits.
I am not sure I entirely follow what you are looking to achieve but it is possible to read directly from the Enriched stream so you do not need to wait for data to land in Snowflake to leverage it. We recently released an OS tool called Snowbridge which you can use to plug in directly on top of the Enriched stream and access the data in real-time.
Something uncertain for me is whether the kinesis quota is ok to allow 2 consumers since there’s a limit of number of requests/ records/ data volume consumed per shard Quotas and Limits - Amazon Kinesis Data Streams.
If you fully saturate the ingress of a stream you will max out at two consumers → to get around this you can either scale up the stream to increase the READ bandwidth or you can use Enhanced Fan-Out for your extra consumers.
Please let me know if anything is not clear, and thank you so much for responding again!!
It might be easier to help however if you explained the problem you are trying to solve instead of the solution you want to try and implement!