Monitoring for failed ETL jobs (batch pipeline)

travisdevitt · August 3, 2016, 10:36pm

Hey Snowplowers, I expect that once in awhile our EmrEtlRunner/StorageLoader applications will fail (for instance it happened a few weeks ago when our Elasticsearch cluster was down) and I want to make sure I am notified automatically when that happens.

Is there a way to get an email alert when the EMR job fails using CloudWatch (or something else) ?

mike · August 7, 2016, 3:34am

If you’re doing this without modifying the EMR job itself there’s a few possible options:

Write a script to run regularly and check your data sinks (Elasticsearch) and grab the maximum date, if that date exceeds a certain threshold then send an email (either direct or via SNS).
Run a different script (using cron or alternative) to try and grab any EMR jobs that have failed recently. To grab clusters that have been marked as failed in the last 6 hours you could use the aws-cli to do something similar to:
aws emr list-clusters --created-after $(date --date "6 hours ago" +%Y-%m-%dT%H:%M:%S) --failed

alex · August 7, 2016, 6:23pm

If you want to catch failures in EmrEtlRunner/StorageLoader when they occur, safest is to wrap the execution of both apps in a monitoring script.

For example, if you are running it in cron, then cronic is a pretty good monitoring wrapper. At the other end of the scale, if you are using something enterprise-y like Chronos on Mesos, that will have failure notification built-in.

Because a lot of the jobflow of EmrEtlRunner/StorageLoader doesn’t (currently) take place in EMR, it’s really important to capture the full stdout/stderr from a failed run, so you know precisely where to restart the failure from. Without that output, you often have to do some detective work to figure out where to resume from (“I can see data in Redshift but some data still in shredded/good, so presumably the archive of shredded events failed partway through?”).

travisdevitt · August 8, 2016, 11:13pm

great suggestions guys, thanks

mike · August 8, 2016, 11:31pm

Ah yes, thanks for that, I’ve edited that now.

Topic		Replies	Views
EmrEtlRunner::EmrExecutionError AWS batch pipeline (Legacy)	3	1770	October 5, 2017
Suggested best practices for recovering from EmrEtlRunner failures? AWS batch pipeline (Legacy)	5	2814	July 22, 2016
Cluster: Snowplow ETLTerminated with errorsShut down as step failed Duplicate	2	2455	October 10, 2017
EMR jobflow failing on Hadoop Enrich step after a few seconds AWS batch pipeline (Legacy)	5	2437	April 29, 2016
How to use persistent job flow option with EmrEtlRunner?	6	1126	June 10, 2020

Monitoring for failed ETL jobs (batch pipeline)

Related topics