Here at GetNinjas we are currently using Snowplow as a solution to track client’s events and some kind of transaction events, e.g: Some user paid an invoice then my app generates an event through Snowplow just to help my BI team.
Nowadays, we are doing lots of stuff when an user pays an invoice, e.g: giving them some credits in our platform, sending them a receipt, etc. This stuff is triggered by an event which comes from AWS SQS, so basically, my App sends an event through SQS and we have a daemon getting these messages and processing them and doing all stuff described above.
We are using SQS because it’s reliable and we trust the event will land in the right destination, basically, because AWS ensures that for us.
For us, events sent through SQS could be, perfectly, sent using Snowplow, in other words, we are planning to use Snowplow as our event hub for transactional events. Perhaps it could not be the best decision, because of Snowplow characteristics like the buffer.
Moreover, we are planning to run Snowplow on Docker, and by container’s characteristics (someone can shut down one and start another one) we are afraid we may lose few events when we suffer some kind of outage in our clusters.
Is our worry reasonable? Can we use Snowplow, safely, as an event hub solution? Is there anything wrong with our architecture? Do you have any tips?
Your concerns aren’t specific to Docker; all kinds of Snowplow deployment have a risk of event loss when a collector fails. If you are using Snowplow as an event bus rather than solely for analytics, then the cost of that risk is higher, but the principles are the same.
The solution is: make sure you don’t have any outages That requires an approach the same as for any high-availability and fault-tolerant system: detailed failure mode analysis and planning, multiple redundancy deployments, automated failover, intelligent client retrying strategies (i.e. in your trackers) and so on.
For Snowplow case the critical process is the ingestion step; everything downstream can be retried but if you don’t capture the events in the first place they are lost forever.
A couple of Snowplow-specific tips:
Run the CloudFront collector as a last resort; you can use Route53 DNS health checking to failover to CloudFront if none of your primary collectors are healthy.
Applications in different regions can post to the same Kinesis stream, so you can run collectors in multiple regions with the same sink. Again you can use Route53 DNS to failover to another region when the primary deployment is unhealthy.
Make sure your trackers retry when a request fails. Some trackers (JS) do this already, others don’t yet so depending on which you use this may require some custom engineering.
We have a similar use case so I’d be interested to hear additions to this list from other Snowplowers too.
Is there possible to have the collector to emit some external confirmation after the cache flush? I can imagine a flow where messages are marked as “processed” by the collector and later I can detect the lost ones and queue it back again.
@spatialy - I think it’s important to stress that high availability for your Kinesis streams / Kafka topics is an essential part of running a robust Snowplow pipeline; you can’t rely on these retry mechanisms for sustained outages.
This kind of monitoring and scaling is an important part of what we deliver with Snowplow Insights:
In our experience the events that not reach the collector are stored in the client browser local storage and are send in mass when successfully connect again in a future event like the client revisit our pages (the intended behavior as per the docs)
For the events in the collector buffer, we lost the data. A mitigation approach is to reach a balance between the parameters that control the buffer, and this highly depend on the impact the data lost have in your business
Just thinking out loud here, but it could be interesting to have a two-track system, where:
Low-materiality events (e.g. page pings, ad impressions) are handled as they are today (no-ack, cached)
High-materiality events (e.g. ecommerce transactions) are handled in a cacheless, transactional way - meaning that the connection to the tracker stays open until the event has been confirmed as being dispatched, and is acked back to the tracker with 200
Sounds a bit complicated. Remember you can have thousands of tracker instances live, maybe 2-10 collector instances, and no server affinity. I don’t see how you get the right collector acks back to the right trackers, asynchronously/post-cache flush, in a performant way.
Interesting approach suggestion. Maybe add some special route in the collector the events can be handled sync/async, need to see how much refactoring is needed. (here I are useless for now, I start with scala programming recently)
We have something similar but more for data security constraint over certain data-points, that’s ones are loaded to another collector server from a second tracker instance.
@alex I plan to put some queue between the client and the collector. RabbitMQ, for example, only trashes the message after an ACK by default, not on fetch. But I need to make the collector “receive”->“flush”->“ack”.
There can exist some dealer to pass stuff from the queue to the collector, no problem. But the collector should later inform the dealer (or any webhook maybe) when it is safe to ACK on the queue and trash the message.