Shred stage failure

I am not sure what is causing the failure - here are the logs

Warning: Skip remote jar s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar.
21/02/14 10:02:32 INFO RMProxy: Connecting to ResourceManager at ip-10-0-1-76.ec2.internal/10.0.1.76:8032
21/02/14 10:02:32 INFO Client: Requesting a new application from cluster with 1 NodeManagers
21/02/14 10:02:32 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (117760 MB per container)
21/02/14 10:02:32 INFO Client: Will allocate AM container, with 23552 MB memory including 3072 MB overhead
21/02/14 10:02:32 INFO Client: Setting up container launch context for our AM
21/02/14 10:02:32 INFO Client: Setting up the launch environment for our AM container
21/02/14 10:02:33 INFO Client: Preparing resources for our AM container
21/02/14 10:02:34 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/02/14 10:02:36 INFO Client: Uploading resource file:/mnt/tmp/spark-2cc64c00-166b-4b4e-9ce8-6a276d86dbdc/__spark_libs__8146219896873042640.zip → hdfs://ip-10-0-1-76.ec2.internal:8020/user/hadoop/.sparkStaging/application_1613296537408_0003/__spark_libs__8146219896873042640.zip
21/02/14 10:02:40 WARN RoleMappings: Found no mappings configured with ‘fs.s3.authorization.roleMapping’, credentials resolution may not work as expected
21/02/14 10:02:40 INFO Client: Uploading resource s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar → hdfs://ip-10-0-1-76.ec2.internal:8020/user/hadoop/.sparkStaging/application_1613296537408_0003/snowplow-rdb-shredder-0.13.1.jar
21/02/14 10:02:40 INFO S3NativeFileSystem: Opening ‘s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar’ for reading
21/02/14 10:02:44 INFO Client: Uploading resource file:/mnt/tmp/spark-2cc64c00-166b-4b4e-9ce8-6a276d86dbdc/__spark_conf__199173711611612250.zip → hdfs://ip-10-0-1-76.ec2.internal:8020/user/hadoop/.sparkStaging/application_1613296537408_0003/spark_conf.zip
21/02/14 10:02:44 INFO SecurityManager: Changing view acls to: hadoop
21/02/14 10:02:44 INFO SecurityManager: Changing modify acls to: hadoop
21/02/14 10:02:44 INFO SecurityManager: Changing view acls groups to:
21/02/14 10:02:44 INFO SecurityManager: Changing modify acls groups to:
21/02/14 10:02:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/02/14 10:02:44 INFO Client: Submitting application application_1613296537408_0003 to ResourceManager
21/02/14 10:02:44 INFO YarnClientImpl: Submitted application application_1613296537408_0003
21/02/14 10:02:45 INFO Client: Application report for application_1613296537408_0003 (state: ACCEPTED)
21/02/14 10:02:45 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1613296964561
final status: UNDEFINED
tracking URL: http://ip-10-0-1-76.ec2.internal:20888/proxy/application_1613296537408_0003/
user: hadoop
21/02/14 10:02:46 INFO Client: Application report for application_1613296537408_0003 (state: ACCEPTED)
21/02/14 10:02:47 INFO Client: Application report for application_1613296537408_0003 (state: ACCEPTED)
21/02/14 10:02:48 INFO Client: Application report for application_1613296537408_0003 (state: ACCEPTED)
21/02/14 10:02:49 INFO Client: Application report for application_1613296537408_0003 (state: ACCEPTED)
21/02/14 10:02:50 INFO Client: Application report for application_1613296537408_0003 (state: ACCEPTED)
21/02/14 10:02:51 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:51 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.0.1.216
ApplicationMaster RPC port: 0
queue: default
start time: 1613296964561
final status: UNDEFINED
tracking URL: http://ip-10-0-1-76.ec2.internal:20888/proxy/application_1613296537408_0003/
user: hadoop
21/02/14 10:02:52 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:53 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:54 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:55 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:56 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:57 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:58 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:02:59 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:00 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:01 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:02 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:03 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:04 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:05 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:06 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:07 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:08 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:09 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:10 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:11 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:12 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:13 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:14 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:15 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:16 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:17 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:18 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:19 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:20 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:21 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:22 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:23 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:24 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:25 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:26 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:27 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:28 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:29 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:30 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:31 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:32 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:03:33 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
.
.
.
21/02/14 10:35:11 INFO Client: Application report for application_1613296537408_0003 (state: RUNNING)
21/02/14 10:35:12 INFO Client: Application report for application_1613296537408_0003 (state: FINISHED)
21/02/14 10:35:12 INFO Client:
client token: N/A
diagnostics: Max number of executor failures (8) reached
ApplicationMaster host: 10.0.1.216
ApplicationMaster RPC port: 0
queue: default
start time: 1613296964561
final status: FAILED
tracking URL: http://ip-10-0-1-76.ec2.internal:20888/proxy/application_1613296537408_0003/
user: hadoop
Exception in thread “main” org.apache.spark.SparkException: Application application_1613296537408_0003 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/02/14 10:35:12 INFO ShutdownHookManager: Shutdown hook called
21/02/14 10:35:12 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-2cc64c00-166b-4b4e-9ce8-6a276d86dbdc
Command exiting with ret ‘1’

Hey @Tejas_Behra ,

First of all, I’d recommend using latest available shredder. You can check the documentation here.

Secondly, have you checked YARN logs? They could also help. You could browse this thread or this one to see an example.

Please let us know how it goes and if we can help any further.

Thanks @oguzhanunlu
I am seeing multiple directories getting created s3:/bucketname/emr_logs/job_id/containers/application_*/container_*/

Here is the logs from one of the stderr file

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/10/__spark_libs__6192384126052002426.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See SLF4J Error Codes for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/15 15:23:33 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 3358@ip-10-0-1-216
21/02/15 15:23:33 INFO SignalUtils: Registered signal handler for TERM
21/02/15 15:23:33 INFO SignalUtils: Registered signal handler for HUP
21/02/15 15:23:33 INFO SignalUtils: Registered signal handler for INT
21/02/15 15:23:34 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/15 15:23:34 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/15 15:23:34 INFO SecurityManager: Changing view acls groups to:
21/02/15 15:23:34 INFO SecurityManager: Changing modify acls groups to:
21/02/15 15:23:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/15 15:23:35 INFO TransportClientFactory: Successfully created connection to /10.0.1.216:45135 after 81 ms (0 ms spent in bootstraps)
21/02/15 15:23:35 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/15 15:23:35 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/15 15:23:35 INFO SecurityManager: Changing view acls groups to:
21/02/15 15:23:35 INFO SecurityManager: Changing modify acls groups to:
21/02/15 15:23:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/15 15:23:35 INFO TransportClientFactory: Successfully created connection to /10.0.1.216:45135 after 2 ms (0 ms spent in bootstraps)
21/02/15 15:23:35 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1613402356381_0002/blockmgr-d480ad78-2cb5-4f69-8c21-b416da495c07
21/02/15 15:23:35 INFO MemoryStore: MemoryStore started with capacity 11.8 GB
21/02/15 15:23:36 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.0.1.216:45135
21/02/15 15:23:36 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/02/15 15:23:36 INFO Executor: Starting executor ID 4 on host ip-10-0-1-216.ec2.internal
21/02/15 15:23:36 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 38111.
21/02/15 15:23:36 INFO NettyBlockTransferService: Server created on ip-10-0-1-216.ec2.internal:38111
21/02/15 15:23:36 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/02/15 15:23:36 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(4, ip-10-0-1-216.ec2.internal, 38111, None)
21/02/15 15:23:36 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(4, ip-10-0-1-216.ec2.internal, 38111, None)
21/02/15 15:23:36 INFO BlockManager: external shuffle service port = 7337
21/02/15 15:23:36 INFO BlockManager: Registering executor with local external shuffle service.
21/02/15 15:23:36 INFO TransportClientFactory: Successfully created connection to ip-10-0-1-216.ec2.internal/10.0.1.216:7337 after 25 ms (0 ms spent in bootstraps)
21/02/15 15:23:36 INFO BlockManager: Initialized BlockManager: BlockManagerId(4, ip-10-0-1-216.ec2.internal, 38111, None)
21/02/15 15:23:43 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
21/02/15 15:23:43 INFO MemoryStore: MemoryStore cleared
21/02/15 15:23:43 INFO BlockManager: BlockManager stopped
21/02/15 15:23:43 INFO ShutdownHookManager: Shutdown hook called

Hi @Tejas_Behra,

I don’t think that particular log is relevant. Is it the last container? We’re looking for some exception traceback.

Also, did you try to restart the job?