Recommended/Supported EMR Versions?

Hi there,

We are longtime Snowplow users and have been successfully running the batch enrichment process for a long time. Just over a year ago we moved to Snowflake as our primary data warehouse. We have had the EMR ETL Runner (v0.34.0) and the Dataflow Runer (v0.4.1) running in production successfully since their release.

In the last month we have started getting transient failures provisioning EMR clusters in AWS with an “Internal Error” as the only message. There are no logs, none of the Snowplow steps are ever added to the cluster. It happens with both ETL runner and dataflow runner. AWS support’s only suggestion was to bump the EMR version to the latest. We are currently on:

Release label:emr-5.9.0
Hadoop distribution:Amazon 2.7.3
Applications:Spark 2.2.0

Are there any compatibility issues with newer EMR versions or can we safely bump up to 5.32.0 without upgrading our other Snowplow components?

Thanks for reading!


I’d do a test run on 5.32.0 first but given that it’s still Spark 2.47 I can’t imagine you running into any issues. That said generally AWS support should be able to give a more specific reason as to why an API request is failing. In the past I’ve seen this recur with certain node type / region combinations.

What node types and region are you running in at the moment?

Thanks @mike for the reply. We are in us-east-1 for everything.

For the EMR ETL Runner we are currently using
Master: 1 x m1.medium
Core: 2 x m4.4xlarge

For Dataflow Runner we have:
Master: 1 x m2.xlarge
Core: 2 x m2.xlarge

The failures are maybe 1 in 30 on average, so most of the time it works. It looks like it might be related to availability of those instance types in the region.

Ah ok.

us-east-1 is a bit of an ephemeral plane for EC2 generations - the generations are typically first born here and last to die (or as AWS prefer to call it ‘retired’*). I imagine due to the age of the m1 / m2 generation you are going to get intermittent failures due to a dwindling supply and occasional spot instance jobs in the region that are provisioning these instances.

Switching to a newer generation (m4 or above) should result the internal error issues - even if you keep the EMR version the same.

*They call them Blade Runners. They are tasked with finding, identifying and retiring old EC2 instances before they rise up against their creators.

1 Like