There is a major Amazon S3 outage which started at approximately 17:43 UTC today. It is still ongoing. The AWS status page is claiming that this only affects us-east-1, however we believe that this is affecting all regions.
Iglu Central is served by CloudFront and backed by an S3 bucket in us-east-1. Iglu Central is not currently available, which means that all Snowplow events will fail validation
Snowplow AWS batch pipelines are failing as they attempt to read from or write to Amazon S3; Snowplow AWS real-time pipelines are failing to sink data from Kinesis to Amazon S3
We are on batch pipeline with Clojure collector…will be interesting to see if Elastic Beanstalk log rotations are retried or if our raw logs are lost for good. Anyone know?
Hi @dean - the impact of 1 is that events will fail validation and be stored in the bad bucket.
We have paused all of our batch pipelines for Managed Service customers to prevent this from occurring. We recommend open source users of Snowplow also pause their batch pipelines until the underlying issue is fixed by Amazon.
Hi @travisdevitt - if the hourly rotation by the Clojure Collector fails, it will try again on the next hour. As long as you provisioned your Clojure Collector instances with sufficient hard disk headroom (so you don’t max out the local disks), you should see the events finally being rotated once the underlying issue is fixed by Amazon.
As of 20:42 UTC, Iglu Central is back online (although the AWS service dashboard continues to report S3 and CloudFront issues in us-east-1). We are continuing to investigate what service outages are ongoing.
We believe that there are still issues in writing files to S3.
When this has recovered, you will want to resume any Snowplow pipelines which failed partway through. We strongly recommend deleting the enriched/shredded data belonging to any such partial pipeline run and resuming that run from the start of the EMR stage; this is to recover any events which incorrectly failed validation during that partial run, due to the Iglu Central outage.
02:11 PM PST As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.
It seems the missing logfiles from yesterday appear now in the associated s3 buckets. We noticed a delay between 8 and 12 hours. Hopefully no data will be lost.