Yesterday night some of our batch pipeline users experienced an issue during EMR cluster bootstap.
The issue manifested itself as EmrEtlRunner failure with following logs:
W, [2020-01-15T16:25:50.618136 #16932] WARN -- : Job failed. 2 tries left...
W, [2020-01-15T16:25:50.624329 #16932] WARN -- : Bootstrap failure detected, retrying in 81 seconds...
The reason of this outage is that Maven Central, a registry for Java assets turned off an access to hosted assets over HTTP https://blog.sonatype.com/central-repository-moving-to-https
This change unfortunately remained unnoticed by us and we couldn’t address it before it manifested itself globally. EmrEtlRunner uses Maven Central in order to download Apache Common Codec library to replace a legacy one, bundlded with EMR AMI. After Maven Central closed the HTTP access, bootstrap scripts started to fail, preventing all clusters from starting.
This issue affected users who use transient (non-persistent) AWS EMR clusters with EmrEtlRunner.
It did not affect any GCP pipelines nor real-time AWS pipelines loading data to Snowflake.
It’s worth to mention that this incident impacted legacy pipelines the most. RT pipelines were not affected at all, and this is currently a recommended setup, releases older than R102 received the fix with bigger delay because we have no observability over older pipelines. We encourage all our OSS users to use latest versions of Snowplow pipeline.
Timeline
- 4:00 PM UTC our Support Engineers noticed several failures across batch pipelines
- 5:00 PM UTC we identified the issue and started to prepare a hotfix
- 6:00 PM UTC we prepared and rolled out
snowplow-ami5-bootstrap-0.1.0.sh
hotfix for all regions exceptap-southeast-2
. This script used by all Snowplow R102+ releases - 6:50 PM UTC we noticed
ap-southeast-2
was missed and fixed it as well. At this moment all our managed pipelines were fixed and recovered - 9:07 PM UTC we received first report from OSS users, telling us that their pipeline is still failing, which was due an old EmrEtlRunner, which is not longer in use inside Snowplow
- 11:00 PM UTC we rolled out
snowplow-ami4-bootstrap-0.1.0.sh
hotfix, which unfortunately fixed only pre-R82 EmrEtlRunners - 11:00 AM UTC we rolled out
snowplow-ami4-bootstrap-0.2.0.sh
hotfix, which fixed remaining EmrEtlRunners
What’s next
We’re planning to do an exhaustive audit of our components in order to find even slightest dependencies on 3rd-party data/service providers and exclude as many of them as possible.