Our snowflake transformer has started failing with the following error message in the cluster log:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/10/__spark_libs__6496353209541393289.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/07/07 14:40:50 INFO SignalUtils: Registered signal handler for TERM
20/07/07 14:40:50 INFO SignalUtils: Registered signal handler for HUP
20/07/07 14:40:50 INFO SignalUtils: Registered signal handler for INT
20/07/07 14:40:51 INFO ApplicationMaster: Preparing Local resources
20/07/07 14:40:53 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1594132618622_0001_000001
20/07/07 14:40:53 INFO SecurityManager: Changing view acls to: yarn,hadoop
20/07/07 14:40:53 INFO SecurityManager: Changing modify acls to: yarn,hadoop
20/07/07 14:40:53 INFO SecurityManager: Changing view acls groups to:
20/07/07 14:40:53 INFO SecurityManager: Changing modify acls groups to:
20/07/07 14:40:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
20/07/07 14:40:53 INFO ApplicationMaster: Starting the user application in a separate Thread
20/07/07 14:40:53 INFO ApplicationMaster: Waiting for spark context initialization...
20/07/07 14:40:55 WARN JsonMetaSchema: Unknown keyword exclusiveMinimum - you should define your own Meta Schema. If the keyword is irrelevant for validation, just use a NonValidationKeyword
20/07/07 14:42:33 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:401)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
20/07/07 14:42:33 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
20/07/07 14:42:33 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
20/07/07 14:42:33 INFO ApplicationMaster: Deleting staging directory hdfs://ip-10-80-20-232.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1594132618622_0001
20/07/07 14:42:33 INFO ShutdownHookManager: Shutdown hook called
I’ve tried increasing the instance size to m2.xlarge (it was previously a medium) to no avail.
The only change we’ve made recently is altering the pathname of the streaming enrich output folders from YYYY-MM-DD-HH to YYYY-MM-DD-HH-mm to get around a problem of the loader missing data when it was loading incomplete buckets. However, the snowflake loader has been running for a couple of days succesfully since then. It’s a fairly low volume of events overall (less than 5k per day).
Advice appreciated!