Iglu Central is down & associated S3 issues

alex · February 28, 2017, 6:20pm

There is a major Amazon S3 outage which started at approximately 17:43 UTC today. It is still ongoing. The AWS status page is claiming that this only affects us-east-1, however we believe that this is affecting all regions.

You can track the issue here:

https://status.aws.amazon.com/

There are two major impacts on Snowplow:

Iglu Central is served by CloudFront and backed by an S3 bucket in us-east-1. Iglu Central is not currently available, which means that all Snowplow events will fail validation
Snowplow AWS batch pipelines are failing as they attempt to read from or write to Amazon S3; Snowplow AWS real-time pipelines are failing to sink data from Kinesis to Amazon S3

We will update this thread as we learn more.

dean · February 28, 2017, 7:15pm

What’s the downstream impact of 1? Will data be lost?

travisdevitt · February 28, 2017, 7:42pm

We are on batch pipeline with Clojure collector…will be interesting to see if Elastic Beanstalk log rotations are retried or if our raw logs are lost for good. Anyone know?

alex · February 28, 2017, 7:44pm

Hi @dean - the impact of 1 is that events will fail validation and be stored in the bad bucket.

We have paused all of our batch pipelines for Managed Service customers to prevent this from occurring. We recommend open source users of Snowplow also pause their batch pipelines until the underlying issue is fixed by Amazon.

alex · February 28, 2017, 7:46pm

Hi @travisdevitt - if the hourly rotation by the Clojure Collector fails, it will try again on the next hour. As long as you provisioned your Clojure Collector instances with sufficient hard disk headroom (so you don’t max out the local disks), you should see the events finally being rotated once the underlying issue is fixed by Amazon.

mike · February 28, 2017, 7:46pm

It looks like some other services have been severely impacted in us-east-1 as well including EFS, EC2, autoscaling and RDS.

If you have a login to the Amazon console you can see this information here with respective updates.

mike · February 28, 2017, 7:59pm

From the AWS Twitter account

https://twitter.com/awscloud

For S3, we believe we understand root cause and are working hard at repairing. Future updates across all services will be on dashboard.

alex · February 28, 2017, 8:44pm

As of 20:42 UTC, Iglu Central is back online (although the AWS service dashboard continues to report S3 and CloudFront issues in us-east-1). We are continuing to investigate what service outages are ongoing.

alex · February 28, 2017, 9:08pm

We believe that there are still issues in writing files to S3.

When this has recovered, you will want to resume any Snowplow pipelines which failed partway through. We strongly recommend deleting the enriched/shredded data belonging to any such partial pipeline run and resuming that run from the start of the EMR stage; this is to recover any events which incorrectly failed validation during that partial run, due to the Iglu Central outage.

mike · February 28, 2017, 10:21pm

Apparently back up now.

02:11 PM PST As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.

ecoron · March 1, 2017, 7:52am

Maybe one note for Cloudfront collector users.

It seems the missing logfiles from yesterday appear now in the associated s3 buckets. We noticed a delay between 8 and 12 hours. Hopefully no data will be lost.

NirSivan · March 1, 2017, 3:41pm

Looking at a 24 hour delay here, snow plow in folder is containing files with 2017-02-28 timestamp

ecoron · March 1, 2017, 3:57pm

Yes, delay is much bigger, and there was a very huge amount of very small sized log files then normal, but at the end it seems nothing is lost…

alex · March 1, 2017, 10:51pm

Thanks @NirSivan and @ecoron for the additional information for CloudFront Collector users!

alex · March 2, 2017, 6:36pm

AWS has published a detailed post-mortem on the outage here:

https://aws.amazon.com/message/41926/

Topic		Replies	Views
Enrich schema resolver did not restart Enrichment	5	1605	February 18, 2020
Iglu server connection issue Troubleshooting	6	835	September 12, 2023
Suggested best practices for recovering from EmrEtlRunner failures? AWS batch pipeline (Legacy)	5	2813	July 22, 2016
Schema Violation - Repo Failure For engineers	3	830	February 23, 2022
[scala] [enrich] exception while sync'ing Kinesis shards and leases AWS real-time pipeline	8	4472	February 19, 2020

Iglu Central is down & associated S3 issues

Related topics