I inherited a snowplow pipeline from 2017, which includes EMR ETL Runner with the following conf (copying only the relevant part):
emr:
jobflow:
master_instance_type: m4.large
core_instance_count: 8
core_instance_type: r4.xlarge
core_instance_bid: # In USD. Adjust bid, or leave blank for on-demand core instances
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 40 # Gigabytes
volume_type: "gp2"
# volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
yarn.nodemanager.vmem-check-enabled: "false"
yarn.nodemanager.resource.memory-mb: "28672"
yarn.scheduler.maximum-allocation-mb: "28672"
spark:
maximizeResourceAllocation: "false"
spark-defaults:
spark.dynamicAllocation.enabled: "false"
spark.executor.instances: "23"
spark.yarn.executor.memoryOverhead: "2048"
spark.executor.memory: "8G"
spark.executor.cores: "1"
spark.yarn.driver.memoryOverhead: "2048"
spark.driver.memory: "8G"
spark.driver.cores: "1"
spark.default.parallelism: "46"
As you can see it uses a lot of resources, at least compared to the sample config I saw on github.
But, for context, we receive several millions of visits per day.
Now I just updated the entire pipeline: installed Stream Enrich Kinesis, upgraded EMR ETL Runner to 1.0.4, switched it to Stream Enrich mode, updated the conf to EMR 6.1, RDB Shredder/Loader to 0.18.1, Redshift JSON schema to 4.0.0, etc… And I’m running EMR ETL Runner every 2 hours.
So, since I removed the enriching step from EMR ETL Runner and that RDB Shredder/Loader received a lot of performance improvements over the years, I’m wondering if we still need all the resources listed in that config.
From the enricher Kinesis stream monitoring, I see we receive ~12,000 records/min. So that’s ~1,440,000 records to shred and load every 2 hours. With the current config it takes 17 min.
Anyone running EMR ETL Runner in a similar context? Any recommendations?
If not, what would be the way to start optimizing? Decreasing the nb of instances? Downgrading the instances type? What metrics should I look at in the EMR jobs monitoring?
Thanks in advance.