There is a major Amazon S3 outage which started at approximately 17:43 UTC today. It is still ongoing. The AWS status page is claiming that this only affects us-east-1, however we believe that this is affecting all regions.
Hi @dean - the impact of 1 is that events will fail validation and be stored in the bad bucket.
We have paused all of our batch pipelines for Managed Service customers to prevent this from occurring. We recommend open source users of Snowplow also pause their batch pipelines until the underlying issue is fixed by Amazon.
Hi @travisdevitt - if the hourly rotation by the Clojure Collector fails, it will try again on the next hour. As long as you provisioned your Clojure Collector instances with sufficient hard disk headroom (so you don’t max out the local disks), you should see the events finally being rotated once the underlying issue is fixed by Amazon.
As of 20:42 UTC, Iglu Central is back online (although the AWS service dashboard continues to report S3 and CloudFront issues in us-east-1). We are continuing to investigate what service outages are ongoing.
We believe that there are still issues in writing files to S3.
When this has recovered, you will want to resume any Snowplow pipelines which failed partway through. We strongly recommend deleting the enriched/shredded data belonging to any such partial pipeline run and resuming that run from the start of the EMR stage; this is to recover any events which incorrectly failed validation during that partial run, due to the Iglu Central outage.