Snowplow EMR jobflow error

malathi · December 27, 2017, 12:31pm

My EMR jobflow is failing with following error:
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-3VIIWNRZQ5WCR failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.

It fails before the data enrichment step.

snapshot of my config.yml :

emr:
ami_version: 3.6.0
enrich:
job_name: Snowplow canvas ETL
versions:
hadoop_enrich: 1.5.1
hadoop_shred: 0.7.0
hadoop_elasticsearch: 0.1.0

I am stuck with this problem from a few days and is blocking other business flows. Any help here would be appreciated.

Thanks in advance,
Malathi

BenFradet · December 28, 2017, 11:28am

The EMR AMI 3.6.0 has been deprecated, I would suggest moving to 3.11.0.

Also, please consider upgrading your Snowplow installation as this one is quite dated.

malathi · December 29, 2017, 7:55am

Thanks!

Could you help me how to upgrade snowplow.

Also, I need a help in moving the past few days data in archive to raw_logs folder.
If it is same file name format, I can move it. But the files in archive and raw_logs are having different file name format. So Could you help in this.

Thanks in advance,
Malathi

BenFradet · December 29, 2017, 5:17pm

Could you help me how to upgrade snowplow.

There is an upgrade guide in the wiki. However, in your case it might be easier to start from scratch from the last version.

Also, I need a help in moving the past few days data in archive to raw_logs folder.
If it is same file name format, I can move it. But the files in archive and raw_logs are having different file name format. So Could you help in this.

Could you describe your problem in more details?

malathi · January 2, 2018, 6:00am

The current location of the logs (After the build failed for past few days) : s3:/my-bucket/archive/2017-12-29/
Format of the file name : ..raw_logs.gz

Format required in raw_logs : .gz

I wanted to know if just moving the logs from archive to raw_logs will do ? or is anything else required ?

alex · January 2, 2018, 12:55pm

No, your best bet is to move the archived files to the processing location, and then run the pipeline skipping staging. This is because all the filename-changing happens in the staging phase (therefore skip staging as that renaming has already been done).

malathi · January 9, 2018, 9:41am

Thanks for the valuable suggestion. By upgrading snowplow, we could get the latest data to redshift.
But our job failed for last 15 days, all those data is in archive folder, but not available in redshift.
It will be great if you can suggest how to get last 15days data (Those when the job failed) in redshift?

We tried --skip staging, but to no effect. Any help here will be appreciated.

Thanks again,
Malathi

Topic		Replies	Views
Snowplow::EmrEtlRunner::EmrExecutionError Enrichment	3	1202	April 25, 2019
EMR failing : Enriched HDFS -> S3: FAILED Troubleshooting	4	2007	April 11, 2017
Shred problems using Batch Troubleshooting	1	949	December 5, 2020
EMR jobflow failing on Hadoop Enrich step after a few seconds AWS batch pipeline (Legacy)	5	2436	April 29, 2016
Facing Error while executing the emr etl command	9	2022	January 8, 2020

Snowplow EMR jobflow error

Related topics