Optimal setup for Spark jobs

Hi All,

We are in the process of trying out the Spark workflow and would like some feedback on our setup.

I have read the post from Rick at OneSpot on tuning the spark settings, it does seem to reach full CPU, but not sure on the node usage being reported by spark

We have a 6 x r3.8xlarge cluster running, and the resource manager is showing

My concern is the vCores used vs vCores total, seems we aren;t using all, but we are using all memory and I am seeing ~100% cpu usage in EC2 monitoring.

The batch is 1 day, approx 100gb and 20gb bad (dont ask ;-( )


Also we are using EMR 91 with the following config

    yarn.resourcemanager.am.max-attempts: "1"
    maximizeResourceAllocation: "true"
    spark.executor.instances: "59"
    spark.yarn.executor.memoryOverhead: "3072"
    spark.executor.memory: "21G"
    spark.yarn.driver.memoryOverhead: "3072"
    spark.driver.memory: "21G"
    spark.executor.cores: "3"
    spark.driver.cores: "3"
    spark.default.parallelism: "354"
    spark.dynamicAllocation.enabled: "false"