EmrEtlRunner sizing

ihor · June 21, 2019, 10:19pm

@trung, are you using Spark enrichment or Hadoop? We have switched to Spark starting from R89 release. The Spark framework differs from Hadoop in that it does most of the processing in memory while Hadoop relies on writing temp data to HDFS. Thus, Spark is more productive as it eliminates “costly” writing to the disk.

In other words, for Spark configuration, you would want to use instances like r4 that comes with a higher volume of memory. This, however, is not enough for high volume data processing. Spark requires tuning to ensure the resources are used to their maximum (not idle unnecessarily).

It’s hard to know for sure what the best configuration would be just knowing the volume of events. No event is the same; lots depends on the volume and complexity of self-describing events. We have a rough correlation between the size of enriched files and the EMR cluster/Spark configuration. If you provide what the typical payload size of compresed/uncompressed files is we can provide a better-adjusted configuration.

For now, if we simply replace your 2x c4.large for 1x r4.xlarge the Spark configuration will be as below

  jobflow:
    master_instance_type: "m4.large"
    core_instance_count: 1
    core_instance_type: "r4.xlarge"
    core_instance_ebs:
      volume_size: 40
      volume_type: "gp2"
      ebs_optimized: true
    . . .
  configuration:
    yarn-site:
      yarn.nodemanager.vmem-check-enabled: "false"
      yarn.nodemanager.resource.memory-mb: "27648"
      yarn.scheduler.maximum-allocation-mb: "27648"
    spark:
      maximizeResourceAllocation: "false"
    spark-defaults:
      spark.dynamicAllocation.enabled: "false"
      spark.executor.instances: "2"
      spark.yarn.executor.memoryOverhead: "1024"
      spark.executor.memory: 8G
      spark.executor.cores: "1"
      spark.yarn.driver.memoryOverhead: "1024"
      spark.driver.memory: 8G
      spark.driver.cores: "1"
      spark.default.parallelism: "8"

You can read more about Spark configuration in this post: Learnings from using the new Spark EMR Jobs. In particular, the following spreadsheet could be used to find the optimal configuration: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/.

Topic		Replies	Views
Optimizing and reducing shredding/loading costs For engineers	4	1109	January 20, 2021
Processing a big file in EMR or split it up? AWS batch pipeline (Legacy)	2	2961	March 17, 2018
Core_Instance_Count not increasing Enrichment	5	1901	June 18, 2020
Recommended ec2 instances for EMR ETL Runner in 2018 AWS batch pipeline (Legacy)	2	1977	September 4, 2018
Learnings from using the new Spark EMR Jobs AWS batch pipeline (Legacy)	8	13531	August 23, 2017

EmrEtlRunner sizing

Related topics