Snowflake Transformer on failing on long job

iain · July 2, 2018, 8:23am

I’m running the Snowflake transformer on a large backlog of data, so the job is running for 6+ hours. It’s just failed with the following message:

Failure Message

18/07/02 04:30:09 INFO Client: Application report for application_1530484047344_0001 (state: FINISHED)
18/07/02 04:30:09 INFO Client: 
	 client token: N/A
	 diagnostics: User class threw exception: shadeaws.services.dynamodbv2.model.AmazonDynamoDBException: The security token included in the request is expired (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ExpiredTokenException; Request ID: 68RCJVDDVAOET7N9VGO6GJPRMFVV4KQNSO5AEMVJF66Q9ASUAAJG)
	 ApplicationMaster host: 172.31.40.159
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1530484225027
	 final status: FAILED
	 tracking URL: http://ip-172-31-43-69.eu-west-1.compute.internal:20888/proxy/application_1530484047344_0001/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1530484047344_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/02 04:30:09 INFO ShutdownHookManager: Shutdown hook called
18/07/02 04:30:09 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-18b2a698-11d7-4a93-a965-0d5c38c68f3f
Command exiting with ret '1'

I’m assuming I can just re-run to carry on where it left off? Is there anything I can do to avoid this error in future?

Thanks!

Iain

anton · July 2, 2018, 9:21am

Hi @iain,

Actual useful message should be in YARN logs. Somewhere in EMR logs:

[jobflow-id]/containers/application_1530484047344_0001/stderr.gz

I assume you’ll find there that your DynamoDB token has expired and Transformer just couldn’t write back to a table. Problem is that Transformer acquires a token once at the beginning and that token couldn’t be used again after several hours.

You cannot simply restart the pipeline because in that case transformer (and loader) will simply skip folder which was not marked as “processed”. So you need to manually fix the manifest table (and probably S3).

If indeed Transformer processed multiple folders and accidentally just stuck after Nth on DynamoDB, you can just delete S3 folder from snowflake stageUrl (not in enriched.archive!) and same record from manifest.

If Transformer processed only single folder and it took 6 hours then it will likely fail again, so you also will have to bump EMR cluster.

It is also possible to mark folder as “processed” manually to avoid processing it again (which can be appealing in case of bery big folder), but mutating manifest is dangerous and you easily can end up with inconsistent state, so we advice just to delete DynamoDB record and S3 folder and start over again.

iain · July 2, 2018, 11:42am

Thanks Anton, it was a multiple run job so I have deleted the DynamoDB record and staged data and run again.

Topic		Replies	Views
Snowflake Transformer fails Storage targets	3	966	March 4, 2021
Snowflake Transformer failing due to 'Timeout waiting for connection from pool' Troubleshooting	3	1699	May 21, 2020
Snowflake transformer failing with Futures Timed Out message For engineers	5	1720	October 6, 2020
Snowflake transformer fails in EMR step Troubleshooting	3	1817	December 10, 2020
Snowflake loader failing Storage targets	4	1080	March 16, 2021

Snowflake Transformer on failing on long job

Related topics