Snowplow not staging any logs and is not running the EMR jobs

masterlittle · July 6, 2017, 9:23am

I am trying to run snowplow emr-etl-runner using cloudfront collector. I can se the raw logs in the relevant s3 bucket. But when I try to run the job with ---->
./snowplow-emr-etl-runner --config config/c.yml --resolver config/iglu_resolver.json --debug

I just get the following text —>

[2017-07-06T09:21:34.135000 #2007] DEBUG -- : Staging raw logs...

And nothing else. When I skip the staging process, the EMR is not starting up and I’m just getting this output — >

D, [2017-07-06T09:22:44.957000 #2325] DEBUG -- : Initializing EMR jobflow

I managed to run the process the first time without any hiccups but it is refusing to run now. Please help

mjensen · July 6, 2017, 4:56pm

are there any logs in shredded good folder? or enriched good folder?

masterlittle · July 6, 2017, 8:09pm

enriched good folder has only the files from the previous run. No file from this month is there.

mjensen · July 6, 2017, 9:01pm

last resort for me was to do the following if it gets really stuck. i’m assuming at this point you only have files in “processing” and “enriched” folders.

delete all files from enriched good and shredded good.
and do a full run again using skip staging so it just processes the logs in processing folder only.

depends on how big your log files are and worth rerunning from scratch.

you can also try and just run shred at this point since you have enriched files intact as long as you know enriched files are 100% done.

egor · July 7, 2017, 3:23am

Hi @masterlittle.

This DAG is illustrating different steps of the Snowplow batch pipeline and recovering process for the steps.

If the pipeline fails during the EMR you should determine at which step it happened and take right actions.

As @mjensen’s already said: if you want to run the pipeline from the top (don’t use --skip staging) all buckets should be empty (processing, enriched and shredded good). If you want to skip staging (files are already present in processing) - enriched and shredded good buckets should be empty for the run.

Hope this helps,
Egor

masterlittle · July 8, 2017, 2:47pm

Hi. Thanks for the help. I emptied the buckets and it works fine with one caveat.

The job runs fine from top to bottom except at the last stage when the files need to be moved to the enriched and shredded archive bucket. For some reason this step never takes place. I have to manually copy the files to the archive so that I can run the next job. What could be the issue?

I am not using any additional enrichment or shredder storage. I just want my parsed files in an s3 bucket.

Topic		Replies	Views
Shred problems using Batch Troubleshooting	1	949	December 5, 2020
EmrEtlRunner stops with no error Enrichment	5	1642	June 20, 2017
Emr etl runner fails without useful error on step "Elasticity Spark Step: Enrich Raw Events" Troubleshooting	3	3298	July 25, 2018
No logs to process: No Snowplow enriched stream logs to process since last run Enrichment	2	1041	October 13, 2020
EmrEtlRunner::EmrExecutionError in the 3rd stage of the process AWS batch pipeline (Legacy)	4	2298	October 23, 2017

Snowplow not staging any logs and is not running the EMR jobs

Related topics