i would like to deactivate the bad streams for the collector and the enrichment process.
Is there a way to do that? I have to state a stream in the configuration-files, otherwise
i am running into an error.
We would like to reduce cost in the AWS.
We have to quite simple events we log which are implemented and tested and work fine.
We would like to activate the bad streams whenever we need to debug but turn them off the rest of the time.
I’ve been thinking about this quite a bit lately- especially for low to mid-level volume aws-based pipelines. On the lowest end of scale (all streams are one shard) operational costs could almost be cut in half if bad events didn’t go to streams and went to fs (or equivalent) instead.
It’s good practice to sink bad events to s3, and I have yet to ever consume bad events directly from kinesis. It’s easy to alert on bad event volume via kinesis, but that is fairly easy and more cost effective to instrument in other locations.
I think there’d be some reluctance around having this turned off entirely because that does introduce the possibility of data loss (via either the collector bad or enriched bad streams) with no real way of retrieving this data if it goes into bad. Unfortunately there’s not a huge number of options in terms of ensuring this data is persisted (as the filesystem is unreliable) but it might be more cost effective consider sinking to Kafka or another queue instead.
Under normal circumstances not many events are directed into the bad streams at all - which is ultimately the reasoning behind wanting to slim them down and reduce cost.
For context: if a company runs pipelines in two primary regions (which ours does), two environments (which ours does), runs the s3 sink process (which adds another stream), and all non-“good” streams are sharded to 1 (the minimum):
4 underutilized streams per pipeline (collector bad, enricher bad, pii, sink bad) * 2 environments * 2 regions * $11/mo * 12 mo = >$2000 per year.
I do agree that this figure is completely trivial for larger production pipelines. Our larger pipelines are over 100 shards each, and the $ of one shard here is laughable. For staging/development environments however, this is unnecessary cost. For smaller companies, cutting monthly operational cost in half is almost mandatory. Unnecessarily spending >$250/year (prod env only) or >$500/year (prod and stage envs) probably won’t happen.
Regarding filesystem - I already have Fluentbit sitting on all machines forwarding logs to a centralized stream, and then on to other tooling (elasticsearch, graylog). If “bad” was logged, it would be easy to point the forwarder at the log and not worry too much about filesystems being potentially unreliable.
Thanks for the input and perspective on your use cases!
I think the nub of it is that Snowplow is first and foremost designed to be loss-averse and run at scale.
I can’t speak to how likely it is that this kind of change is added to the roadmap, since I’m not really close enough to this to offer an informed opinion. However, if you’d like to create an issue on snowplow/snowplow, that team will assess it based on demand vs complexity to deliver.
It’s definitely possible to reduce cost by logging to the stdout sink / nsq sink for collector bad and enriched bad with the caveat that you wouldn’t want to do so in production systems as this is really designed for debugging only. This increases the risk of data loss (i.e., what happens when EBS writes fail / increase dramatically in latency? How do you handle expected / unexpected instance termination events? What happens if the centralised stream is unavailable? What happens during an AZ failure?).