We are longtime Snowplow users and have been successfully running the batch enrichment process for a long time. Just over a year ago we moved to Snowflake as our primary data warehouse. We have had the EMR ETL Runner (v0.34.0) and the Dataflow Runer (v0.4.1) running in production successfully since their release.
In the last month we have started getting transient failures provisioning EMR clusters in AWS with an “Internal Error” as the only message. There are no logs, none of the Snowplow steps are ever added to the cluster. It happens with both ETL runner and dataflow runner. AWS support’s only suggestion was to bump the EMR version to the latest. We are currently on:
I’d do a test run on 5.32.0 first but given that it’s still Spark 2.47 I can’t imagine you running into any issues. That said generally AWS support should be able to give a more specific reason as to why an API request is failing. In the past I’ve seen this recur with certain node type / region combinations.
What node types and region are you running in at the moment?
us-east-1 is a bit of an ephemeral plane for EC2 generations - the generations are typically first born here and last to die (or as AWS prefer to call it ‘retired’*). I imagine due to the age of the m1 / m2 generation you are going to get intermittent failures due to a dwindling supply and occasional spot instance jobs in the region that are provisioning these instances.
Switching to a newer generation (m4 or above) should result the internal error issues - even if you keep the EMR version the same.
*They call them Blade Runners. They are tasked with finding, identifying and retiring old EC2 instances before they rise up against their creators.