EMR runnning slow on m5.xlarge AMI 6.5.0 Shredder 2.1.0

mgloel · February 11, 2022, 9:46am

Hi,

we are facing some problems with the shredder again. We recently updated to shredder and rdbloader 2.1.0 and also updated the AMI version to 6.5.0 because that was a requirement from our IT security.

The shredder jobs usually runs for about 20-30 minutes. At some point it starts slowing down considerably (see screenshot).

The influx of data is constant and rather small though less than 1GB per day).

Our current emr-config looks like this:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
  "data": {
    "name": "com.oneapp",
    "logUri": "LOGURI",
    "region": "eu-west-1",
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID",
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    },
    "roles": {
      "jobflow": "EMR_EC2_DefaultRole",
      "service": "EMR_DefaultRole"
    },
    "ec2": {
      "amiVersion": "6.5.0",
      "instances": {
          "core": {
              "count": 1,
              "type": "r5.xlarge"
          },
          "master": {
              "ebsConfiguration": {
                  "ebsBlockDeviceConfigs": [],
                  "ebsOptimized": true
              },
              "type": "m5.xlarge"
          },
          "task": {
              "bid": "0.015",
              "count": 0,
              "type": "m5.xlarge"
          }
      },
      "keyName": "EMR_ECS_KEY_PAIR",
      "location": {
          "vpc": {
              "subnetId": "AWS_SUBNET_PUBLIC_ID"
          }
      }
    },
    "tags": [
      {
        "key": "client",
        "value": "com.oneapp"
      },
      {
        "key": "job",
        "value": "main"
      },
      {
        "key": "GITC-VulnScanTool",
        "value": "tenable_io"
      }
    ],
    "bootstrapActionConfigs": [],
    "configurations": [
      {
        "classification": "spark",
        "configurations": [],
        "properties": {
            "maximizeResourceAllocation": "false"
        }
      },
      {
        "classification": "spark-defaults",
        "configurations": [],
        "properties": {
            "spark.default.parallelism": "8",
            "spark.driver.maxResultSize": "0",
            "spark.driver.cores": "1",
            "spark.driver.memory": "9G",
            "spark.dynamicAllocation.enabled": "false",
            "spark.executor.cores": "1",
            "spark.executor.instances": "2",
            "spark.executor.memory": "9G",
            "spark.yarn.driver.memoryOverhead": "1024",
            "spark.yarn.executor.memoryOverhead": "1024"
        }
      },
      {
        "classification": "yarn-site",
        "configurations": [],
        "properties": {
            "yarn.nodemanager.resource.memory-mb": "24576",
            "yarn.nodemanager.vmem-check-enabled": "false",
            "yarn.scheduler.maximum-allocation-mb": "24576"
        }
      }
    ],
    "applications": [ "Hadoop", "Spark" ]
  }
}

What causes the shredder to slow down so much?

mike · February 13, 2022, 8:54pm

Are you able to share the monitoring IO? It may be running our of memory or disk.

mgloel · February 16, 2022, 8:05am

Hi Mike,

here is the IO monitor and the stderr as well as controller output:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
22/02/16 05:50:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/02/16 05:50:04 WARN DependencyUtils: Skip remote jar s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar.
22/02/16 05:50:04 INFO RMProxy: Connecting to ResourceManager at ip-11-222-232-22.eu-west-1.compute.internal/11.222.232.22:8032
22/02/16 05:50:04 INFO Client: Requesting a new application from cluster with 1 NodeManagers
22/02/16 05:50:04 INFO Configuration: resource-types.xml not found
22/02/16 05:50:04 INFO ResourceUtils: Unable to find 'resource-types.xml'.
22/02/16 05:50:04 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
22/02/16 05:50:04 INFO Client: Will allocate AM container, with 10240 MB memory including 1024 MB overhead
22/02/16 05:50:04 INFO Client: Setting up container launch context for our AM
22/02/16 05:50:04 INFO Client: Setting up the launch environment for our AM container
22/02/16 05:50:04 INFO Client: Preparing resources for our AM container
22/02/16 05:50:04 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/02/16 05:50:07 INFO Client: Uploading resource file:/mnt/tmp/spark-9520bfdb-47a1-426e-b91d-1e94aee590ac/__spark_libs__1394439949027560815.zip -> hdfs://ip-11-222-232-22.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1644986589982_0002/__spark_libs__1394439949027560815.zip
22/02/16 05:50:09 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
22/02/16 05:50:09 INFO Client: Uploading resource s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar -> hdfs://ip-11-222-232-22.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1644986589982_0002/snowplow-rdb-shredder-2.1.0.jar
22/02/16 05:50:10 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar' for reading
22/02/16 05:50:11 INFO Client: Uploading resource file:/mnt/tmp/spark-9520bfdb-47a1-426e-b91d-1e94aee590ac/__spark_conf__5175405484479654466.zip -> hdfs://ip-11-222-232-22.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1644986589982_0002/__spark_conf__.zip
22/02/16 05:50:12 INFO SecurityManager: Changing view acls to: hadoop
22/02/16 05:50:12 INFO SecurityManager: Changing modify acls to: hadoop
22/02/16 05:50:12 INFO SecurityManager: Changing view acls groups to: 
22/02/16 05:50:12 INFO SecurityManager: Changing modify acls groups to: 
22/02/16 05:50:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
22/02/16 05:50:12 INFO Client: Submitting application application_1644986589982_0002 to ResourceManager
22/02/16 05:50:12 INFO YarnClientImpl: Submitted application application_1644986589982_0002
22/02/16 05:50:13 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:13 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1644990612187
	 final status: UNDEFINED
	 tracking URL: http://ip-11-222-232-22.eu-west-1.compute.internal:20888/proxy/application_1644986589982_0002/
	 user: hadoop
22/02/16 05:50:14 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:15 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:16 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:17 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:18 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:19 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:20 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:21 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:21 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: ip-11-222-232-20.eu-west-1.compute.internal
	 ApplicationMaster RPC port: 43267
	 queue: default
	 start time: 1644990612187
	 final status: UNDEFINED
	 tracking URL: http://ip-11-222-232-22.eu-west-1.compute.internal:20888/proxy/application_1644986589982_0002/
	 user: hadoop
22/02/16 05:50:22 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:23 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:24 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:25 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:26 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:27 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:28 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:29 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:30 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:31 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:32 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:33 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:34 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)

controller


2022-02-16T05:50:01.974Z INFO Ensure step 2 jar file command-runner.jar
2022-02-16T05:50:01.974Z INFO StepRunner: Created Runner for step 2
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.rdbloader.shredder.batch.Main --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRpY3MuaWdsdS9yZXNvbHZlci1jb25maWcvanNvbnNjaGVtYS8xLTAtMCIsCiAgImRhdGEiOiB7CiAgICAiY2FjaGVTaXplIjogNTAwLAogICAgInJlcG9zaXRvcmllcyI6IFsKICAgICAgewogICAgICAgICJuYW1lIjogIlMzLXNjaGVtYXMtcmVnaXN0cnkiLAogICAgICAgICJwcmlvcml0eSI6IDAsCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyJjb20ub25lYXBwIl0sCiAgICAgICAgImNvbm5lY3Rpb24iOiB7CiAgICAgICAgICAiaHR0cCI6IHsKICAgICAgICAgICAgInVyaSI6ICJodHRwczovL2QzMmlwbGFubndzb3pjLmNsb3VkZnJvbnQubmV0IgogICAgICAgICAgfQogICAgICAgIH0KICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsCiAgICAgICAgInByaW9yaXR5IjogMSwKICAgICAgICAidmVuZG9yUHJlZml4ZXMiOiBbICJjb20uc25vd3Bsb3dhbmFseXRpY3MiIF0sCiAgICAgICAgImNvbm5lY3Rpb24iOiB7CiAgICAgICAgICAiaHR0cCI6IHsKICAgICAgICAgICAgInVyaSI6ICJodHRwOi8vaWdsdWNlbnRyYWwuY29tIgogICAgICAgICAgfQogICAgICAgIH0KICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCAtIE1pcnJvciAwMSIsCiAgICAgICAgInByaW9yaXR5IjogMiwKICAgICAgICAidmVuZG9yUHJlZml4ZXMiOiBbImNvbS5zbm93cGxvd2FuYWx5dGljcyJdLAogICAgICAgICJjb25uZWN0aW9uIjogewogICAgICAgICAgImh0dHAiOiB7CiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSIKICAgICAgICAgIH0KICAgICAgICB9CiAgICAgIH0KICAgIF0KICB9Cn0K --config ewogICMgUGF0aCB0byBlbnJpY2hlZCBhcmNoaXZlIChtdXN0IGJlIHBvcHVsYXRlZCBzZXBhcmF0ZWx5IHdpdGggcnVuPVlZWVktTU0tREQtaGgtbW0tc3MgZGlyZWN0b3JpZXMpIGZvciBTMyBpbnB1dAogICJpbnB1dCI6ICJzM246Ly9zcC1lbnJpY2hlZC0yLTEtMC1naXRjLWxpdmUvYXJjaGl2ZS8iLAoKICAjIFBhdGggdG8gc2hyZWRkZWQgYXJjaGl2ZQogICJvdXRwdXQiOiB7CiAgICAjIFBhdGggdG8gc2hyZWRkZWQgb3V0cHV0CiAgICAicGF0aCI6ICJzM246Ly9zcC1zaHJlZGRlZC0yLTEtMC1naXRjLWxpdmUvYXJjaGl2ZS8iLAogICAgIyBTaHJlZGRlciBvdXRwdXQgY29tcHJlc3Npb24sIEdaSVAgb3IgTk9ORQogICAgIyBPcHRpb25hbCwgZGVmYXVsdCB2YWx1ZSBHWklQCiAgICAiY29tcHJlc3Npb24iOiAiR1pJUCIsCiAgICAjIFRoaXMgZmllbGQgaXMgb3B0aW9uYWwgaWYgaXQgY2FuIGJlIHJlc29sdmVkIHdpdGggQVdTIHJlZ2lvbiBwcm92aWRlciBjaGFpbi4KICAgICMgSXQgY2hlY2tzIHBsYWNlcyBsaWtlIGVudiB2YXJpYWJsZXMsIHN5c3RlbSBwcm9wZXJ0aWVzLCBBV1MgcHJvZmlsZSBmaWxlLgogICAgIyBodHRwczovL3Nkay5hbWF6b25hd3MuY29tL2phdmEvYXBpL2xhdGVzdC9zb2Z0d2FyZS9hbWF6b24vYXdzc2RrL3JlZ2lvbnMvcHJvdmlkZXJzL0RlZmF1bHRBd3NSZWdpb25Qcm92aWRlckNoYWluLmh0bWwKICAgICJyZWdpb24iOiAiZXUtd2VzdC0xIgogIH0KCiAgIyBRdWV1ZSB1c2VkIHRvIGNvbW11bmljYXRlIHdpdGggTG9hZGVyCiAgInF1ZXVlIjogewogICAgIyBUeXBlIG9mIHRoZSBxdWV1ZS4gSXQgY2FuIGJlIGVpdGhlciBzcXMgb3Igc25zCiAgICJ0eXBlIjogInNxcyIsCiAgICAjIE5hbWUgb2YgdGhlIHNxcyBxdWV1ZQogICAgInF1ZXVlTmFtZSI6ICJzcC1zcXMtcXVldWUtZ2l0Yy1saXZlLmZpZm8iLAogICAgIyBSZWdpb24gb2YgdGhlIFNRUyBxdWV1ZS4KICAgICMgT3B0aW9uYWwgaWYgaXQgY2FuIGJlIHJlc29sdmVkIHdpdGggQVdTIHJlZ2lvbiBwcm92aWRlciBjaGFpbi4KICAgICJyZWdpb24iOiAiZXUtd2VzdC0xIgogIH0KICAjIFNOUyBleGFtcGxlOgogIyAicXVldWUiOiB7CiAgICAjIFR5cGUgb2YgdGhlIHF1ZXVlLiBJdCBjYW4gYmUgZWl0aGVyIHNxcyBvciBzbnMKICAgICMidHlwZSI6ICJzbnMiLAogICAgIyBBUk4gb2YgU05TIHRvcGljCiAgICAjInRvcGljQXJuIjogImFybjphd3M6c25zOmV1LXdlc3QtMToxNDI4OTIyMjgwNjY6c3Atc25zLXRvcGljLWdpdGMtbGl2ZS5maWZvIiwKICAgICMgUmVnaW9uIG9mIHRoZSBTTlMgdG9waWMKICAgICMicmVnaW9uIjogImV1LXdlc3QtMSIKICAjfQoKICAjIENvbmZpZ3VyZSB0aGUgd2F5IGluLWJhdGNoIGRlZHVwbGljYXRpb24gaXMgcGVyZm9ybWVkCiAgImRlZHVwbGljYXRpb24iOiB7CiAgICAjIFN5bmV0aHRldGljIGRlZHVwbGljYXRpb24gcmVhc3NpZ25zIG5ldyBpZHMgdG8gZXZlbnRzIHdpdGggc2FtZSBpZC1maW5nZXJwcmludHQgcGFpcgogICAgIyBEaWZmZXJlbnQgb3B0aW9ucyBjYW4gYmUgdHJpZWQgaWYgc3ludGhldGljIGRlZHVwbGljYXRpb24gYWZmZWN0cyBwZXJmb3JtYW5jZQogICAgInN5bnRoZXRpYyI6IHsKICAgICAgIyBDYW4gYmUgTk9ORSAoZGlzYWJsZSksIEJST0FEQ0FTVCBhbmQgSk9JTiAoZGlmZmVyZW50IGxvdy1sZXZlbCBpbXBsZW1lbnRhdGlvbnMpCiAgICAgICJ0eXBlIjogIkJST0FEQ0FTVCIKICAgICAgIyBEbyBub3QgZGVkdXBsaWNhdGUgcGFpcnMgd2l0aCBsZXNzLW9yLWVxdWFsIGNhcmRpbmFsaXR5CiAgICAgICJjYXJkaW5hbGl0eSI6IDEKICAgIH0KICB9CgogICMgU2NoZW1hLXNwZWNpZmljIGZvcm1hdCBzZXR0aW5ncyAocmVjb21tZW5kZWQgdG8gbGVhdmUgYWxsIHRocmVlIGdyb3VwcyBlbXB0eSBhbmQgdXNlIFRTViBhcyBkZWZhdWx0KQogICJmb3JtYXRzIjogewogICAgIyBGb3JtYXQgdXNlZCBieSBkZWZhdWx0IChUU1Ygb3IgSlNPTikKICAgICMgT3B0aW9uYWwsIGRlZmF1bHQgdmFsdWUgVFNWCiAgICAiZGVmYXVsdCI6ICJKU09OIiwKICAgICMgU2NoZW1hcyB0byBiZSBzaHJlZGRlZCBhcyBKU09OcywgY29ycmVzcG9uZGluZyBKU09OUGF0aCBmaWxlcyBtdXN0IGJlIHByZXNlbnQuIEF1dG9taWdyYXRpb25zIHdpbGwgYmUgZGlzYWJsZWQKICAgICMgT3B0aW9uYWwsIGRlZmF1bHQgdmFsdWUgW10KICAgICJqc29uIjogWyBdLAogICAgIyBTY2hlbWFzIHRvIGJlIHNocmVkZGVkIGFzIFRTVnMsIHByZXNlbmNlIG9mIHRoZSBzY2hlbWEgb24gSWdsdSBTZXJ2ZXIgaXMgbmVjZXNzYXJ5LiBBdXRvbWlnYXJ0aW9ucyBlbmFibGVkCiAgICAjIE9wdGlvbmFsLCBkZWZhdWx0IHZhbHVlIFtdCiAgICAidHN2IjogWyBdLAogICAgIyBTY2hlbWFzIHRoYXQgd29uJ3QgYmUgbG9hZGVkCiAgICAjIE9wdGlvbmFsLCBkZWZhdWx0IHZhbHVlIFtdCiAgICAic2tpcCI6IFsgXQogIH0sCgogICMgT2JzZXJ2YWJpbGl0eSBhbmQgcmVwb3J0aW5nIG9wdGlvbnMKICAjIm1vbml0b3JpbmciOiB7CiAgICAjIE9wdGlvbmFsLCBmb3IgdHJhY2tpbmcgcnVudGltZSBleGNlcHRpb25zCiAgIyAgInNlbnRyeSI6IHsKICAgIyAgICJkc24iOiAiaHR0cDovL3NlbnRyeS5hY21lLmNvbSIKICAgICN9CiAjIH0KfQo='
INFO Environment:
  PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/aws/puppet/bin/
  SECURITY_PROPERTIES=/emr/instance-controller/lib/security.properties
  HISTCONTROL=ignoredups
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=eu-west-1
  LANG=en_US.UTF-8
  MAIL=/var/spool/mail/hadoop
  LOGNAME=hadoop
  PWD=/
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-2KY1L1IFX4486/tmp
  _=/etc/alternatives/jre/bin/java
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  SHELL=/bin/bash
  QTINC=/usr/lib64/qt-3.3/include
  USER=hadoop
  HADOOP_LOGFILE=syslog
  HOSTNAME=ip-11-222-232-22
  QTDIR=/usr/lib64/qt-3.3
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-2KY1L1IFX4486
  EMR_STEP_ID=s-2KY1L1IFX4486
  QTLIB=/usr/lib64/qt-3.3/lib
  HOME=/home/hadoop
  SHLVL=1
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-2KY1L1IFX4486/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-2KY1L1IFX4486/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-2KY1L1IFX4486
INFO ProcessRunner started child process 4936
2022-02-16T05:50:01.976Z INFO HadoopJarStepRunner.Runner: startRun() called for s-2KY1L1IFX4486 Child Pid: 4936
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO Process 4936 still running
INFO Process 4936 still running
INFO Process 4936 still running
INFO Process 4936 still running

anton · February 16, 2022, 10:00am

Hey @mgloel,

One more very useful source of information in newer AMIs is Spark History Server. You can check it in: Application user interfaces → Spark History UI.

In my experience it’s more reliable than Monitoring tab. But I think you’d have to interpret it on yourself as there’s quite a lot of details. What you’re after is big difference in step duration (between your old and new jobs) and/or failing executors.

mgloel · February 16, 2022, 11:45am

Thanks for the tip @anton
The spark history looks quite consistent. This is just an excerpt but it keeps repeaing in similar fashion.:

{"Event":"SparkListenerTaskEnd","Stage ID":40,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":163683,"Index":64713,"Attempt":0,"Launch Time":1645007530247,"Executor ID":"1","Host":"ip-11-222-232-20.eu-west-1.compute.internal","Locality":"RACK_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1645007530333,"Failed":false,"Killed":false,"Accumulables":[{"ID":899,"Name":"internal.metrics.executorDeserializeTime","Update":1,"Value":55836,"Internal":true,"Count Failed Values":true},{"ID":900,"Name":"internal.metrics.executorDeserializeCpuTime","Update":1031126,"Value":67246190184,"Internal":true,"Count Failed Values":true},{"ID":901,"Name":"internal.metrics.executorRunTime","Update":83,"Value":7644180,"Internal":true,"Count Failed Values":true},{"ID":902,"Name":"internal.metrics.executorCpuTime","Update":11263882,"Value":912464174092,"Internal":true,"Count Failed Values":true},{"ID":903,"Name":"internal.metrics.resultSize","Update":1491,"Value":96290301,"Internal":true,"Count Failed Values":true},{"ID":917,"Name":"internal.metrics.shuffle.write.bytesWritten","Update":2707,"Value":298503562,"Internal":true,"Count Failed Values":true},{"ID":918,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":3,"Value":380812,"Internal":true,"Count Failed Values":true},{"ID":919,"Name":"internal.metrics.shuffle.write.writeTime","Update":433450,"Value":32475066535,"Internal":true,"Count Failed Values":true},{"ID":920,"Name":"internal.metrics.input.bytesRead","Update":1208,"Value":116379921,"Internal":true,"Count Failed Values":true},{"ID":921,"Name":"internal.metrics.input.recordsRead","Update":3,"Value":380812,"Internal":true,"Count Failed Values":true}]},"Task Executor Metrics":{"JVMHeapMemory":0,"JVMOffHeapMemory":0,"OnHeapExecutionMemory":0,"OffHeapExecutionMemory":0,"OnHeapStorageMemory":0,"OffHeapStorageMemory":0,"OnHeapUnifiedMemory":0,"OffHeapUnifiedMemory":0,"DirectPoolMemory":0,"MappedPoolMemory":0,"ProcessTreeJVMVMemory":0,"ProcessTreeJVMRSSMemory":0,"ProcessTreePythonVMemory":0,"ProcessTreePythonRSSMemory":0,"ProcessTreeOtherVMemory":0,"ProcessTreeOtherRSSMemory":0,"MinorGCCount":0,"MinorGCTime":0,"MajorGCCount":0,"MajorGCTime":0},"Task Metrics":{"Executor Deserialize Time":1,"Executor Deserialize CPU Time":1031126,"Executor Run Time":83,"Executor CPU Time":11263882,"Peak Execution Memory":0,"Result Size":1491,"JVM GC Time":0,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":0,"Local Blocks Fetched":0,"Fetch Wait Time":0,"Remote Bytes Read":0,"Remote Bytes Read To Disk":0,"Local Bytes Read":0,"Total Records Read":0},"Shuffle Write Metrics":{"Shuffle Bytes Written":2707,"Shuffle Write Time":433450,"Shuffle Records Written":3},"Input Metrics":{"Bytes Read":1208,"Records Read":3},"Output Metrics":{"Bytes Written":0,"Records Written":0},"Updated Blocks":[]}}
{"Event":"SparkListenerTaskStart","Stage ID":40,"Stage Attempt ID":0,"Task Info":{"Task ID":163685,"Index":64715,"Attempt":0,"Launch Time":1645007530468,"Executor ID":"1","Host":"ip-11-222-232-20.eu-west-1.compute.internal","Locality":"RACK_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Killed":false,"Accumulables":[]}}. .. ````

Would it make sense to switch to a bigger instance type? We ran r5.24xlarge to clean things up. However, this will get quite expensive.

mgloel · February 18, 2022, 4:21pm

WE opted for:

Master:
m4.large
Core:
r5.12xlarge

It runs smoothly and takes only 13 minutes. It might get a bit expensive on the long run though.

Topic		Replies	Views
Shredder slow after stopping for a day / Reshredding 12 hours of data Spark	0	986	September 8, 2021
EMR jobflow failing on Hadoop Enrich step after a few seconds AWS batch pipeline (Legacy)	5	2436	April 29, 2016
Dataflow Runner RDBLoader step taking long AWS real-time pipeline	2	1436	November 29, 2018
EmrEtlRunner running for days at Step "Shred Enriched Events" Enrichment	3	1362	May 9, 2018
How long is a reasonable run time for EmrEtlRunner? AWS batch pipeline (Legacy)	4	1442	May 12, 2017

EMR runnning slow on m5.xlarge AMI 6.5.0 Shredder 2.1.0

Related topics