Recommended/Supported EMR Versions?

Brandon_Kane · March 30, 2021, 1:13pm

Hi there,

We are longtime Snowplow users and have been successfully running the batch enrichment process for a long time. Just over a year ago we moved to Snowflake as our primary data warehouse. We have had the EMR ETL Runner (v0.34.0) and the Dataflow Runer (v0.4.1) running in production successfully since their release.

In the last month we have started getting transient failures provisioning EMR clusters in AWS with an “Internal Error” as the only message. There are no logs, none of the Snowplow steps are ever added to the cluster. It happens with both ETL runner and dataflow runner. AWS support’s only suggestion was to bump the EMR version to the latest. We are currently on:

Release label:emr-5.9.0
Hadoop distribution:Amazon 2.7.3
Applications:Spark 2.2.0

Are there any compatibility issues with newer EMR versions or can we safely bump up to 5.32.0 without upgrading our other Snowplow components?

Thanks for reading!

Brandon

mike · March 30, 2021, 11:44pm

I’d do a test run on 5.32.0 first but given that it’s still Spark 2.47 I can’t imagine you running into any issues. That said generally AWS support should be able to give a more specific reason as to why an API request is failing. In the past I’ve seen this recur with certain node type / region combinations.

What node types and region are you running in at the moment?

Brandon_Kane · March 31, 2021, 2:58pm

Thanks @mike for the reply. We are in us-east-1 for everything.

For the EMR ETL Runner we are currently using
Master: 1 x m1.medium
Core: 2 x m4.4xlarge

For Dataflow Runner we have:
Master: 1 x m2.xlarge
Core: 2 x m2.xlarge

The failures are maybe 1 in 30 on average, so most of the time it works. It looks like it might be related to availability of those instance types in the region.

mike · March 31, 2021, 7:53pm

Ah ok.

us-east-1 is a bit of an ephemeral plane for EC2 generations - the generations are typically first born here and last to die (or as AWS prefer to call it ‘retired’*). I imagine due to the age of the m1 / m2 generation you are going to get intermittent failures due to a dwindling supply and occasional spot instance jobs in the region that are provisioning these instances.

Switching to a newer generation (m4 or above) should result the internal error issues - even if you keep the EMR version the same.

*They call them Blade Runners. They are tasked with finding, identifying and retiring old EC2 instances before they rise up against their creators.

Topic		Replies	Views
Shred problems using Batch Troubleshooting	1	949	December 5, 2020
EMR jobflow failing on Hadoop Enrich step after a few seconds AWS batch pipeline (Legacy)	5	2436	April 29, 2016
Snowplow EMR jobflow error Troubleshooting	6	1675	January 9, 2018
Application configuration with dataflow-runner Troubleshooting	3	1422	December 22, 2017
Emr etl runner fails without useful error on step "Elasticity Spark Step: Enrich Raw Events" Troubleshooting	3	3297	July 25, 2018

Recommended/Supported EMR Versions?

Related topics