We’ve setup the streaming Snowplow components - enricher, shredder and rdb loader and we’re loading events in Redshift. However, we observe something weird happening when there are no events going through the pipeline. For example during periods with very low traffic or when we’re redeploying our website or putting it briefly in maintenance mode, the Shredder component is sending new messages to the SQS queue and the RDB loader is consuming those messages and is then trying to load S3 folders which don’t exist in Redshift. And as a result, the RDB loader container fails immediately when that happens.
Let me illustrate it with some logs. We currently had a prolonged downtime for our service which sends events to Snowplow and as a result no events where passing through the pipeline.
We’ve currently configured the shredder component with a 2 minute “windowing”. But then this is what happend in the same minute when RDB loader received the message:
The weird thing is that there is no such folder in S3. It looks like when there are no events this folder is eirther not created or it’s empty and S3 is automatically hiding/removing it as a result. For example, during this specific period we had a continuous flow of events between 2021-06-29-06-00-00 and 2021-06-29-06-16-00, then a small batch of events at 2021-06-29-06-26-00 and no new events after that for the remainder of that hour. And this is what we have as folders in S3:
Sorry I missed the fact that you are using the streaming version of shredder. This version is still in alpha and is not production-ready yet. It’s possible that the state used by the app during a window does not get reinitialized as it should when there is no data. We will work on the next phase of development this quarter and will check that.