Optimizing and reducing shredding/loading costs

@guillaume, just a number of events is not a clear determining factor to the size of the EMR cluster and the corresponding Spark configuration. No events are the same. Lots of custom events and contexts would require more computing power than the standard Snowplow events (not the self-describing events). The shredding step is more demanding than the enrichment (when run in batch mode). Removing the enrichment step effect the size of the required EMR cluster indirectly. That is if enrichment removal resulted in a more frequent run of EmrEtlRunner than you need a smaller cluster after such a migration (as there expected fewer data processed per run). If the frequency of the batch job hasn’t changed and assuming the same data volume I would say the EMR cluster size would still be the same.

There is only a rough correlation between the data being processed and the EMR cluster required for its processing. The “formula” depends on the size of the enriched files. Let’s say the max payload at some point during the day is 1 GB of gzipped enriched files per EmrEtlRunner run. This would roughly translate into 10 GB of the plain files (basically x10 times; files got ungzipped when processed on EMR cluster). Knowing the volume of data to process per run and utilizing this guide you can come up with the corresponding EMR cluster size and its Spark configuration.

Thus, for this 1 GB of gzipped enriched files, I would use the following configuration

    jobflow:
      master_instance_type: m4.large
      core_instance_count: 1
      core_instance_type: r5.8xlarge
      core_instance_ebs:
        volume_size: 60
        volume_type: gp2
        ebs_optimized: true
    configuration:
      yarn-site:
        yarn.nodemanager.vmem-check-enabled: "false"
        yarn.nodemanager.resource.memory-mb: "256000"
        yarn.scheduler.maximum-allocation-mb: "256000"
      spark:
        maximizeResourceAllocation: "false"
      spark-defaults:
        spark.dynamicAllocation.enabled: "false"
        spark.executor.instances: "9"
        spark.yarn.executor.memoryOverhead: "3072"
        spark.executor.memory: "22G"
        spark.executor.cores: "3"
        spark.yarn.driver.memoryOverhead: "3072"
        spark.driver.memory: "22G"
        spark.driver.cores: "3"
        spark.default.parallelism: "108"

Note also a newer generation of EC2 nodes, r5. You could upgrade to these types if your AMI version was also bumped to 6.1.0

  emr:
    ami_version: 6.1.0
2 Likes