Processing folder not empty - but no error on the ETL script!

kjain · June 14, 2017, 3:57pm

Hi there!

This is a strange one I think.
I manually ran the etl script: snowplow-emr-etl-runner to process the logs and prepare for loading into Redshift. Since I was gping this to process a large backlog and didnt know long it would take, I decided to run this manually.

After about 8-10 hrs the process completed successfully and I can see the data in etl/processing.

I then manually ran the storageloader and it successfully processed and then stored this data.

But then as I prepared to start another ETL batch I noticed that the etc > processing bucket still contained all the files from the previous sessions (about 10,000 of them). I thought these should have been moved into the archive bucket and since I didn’t see any errors, I don’t understand why are they still here.

I checked the data from some of these files against the data stored in Redshift and verified that this matched with the last processed data.

What could be the issue? I could manually move these files over to the archive/raw bucket but it would be great to get some insight into why this may have happened.

Thanks very much!

ihor · June 14, 2017, 4:28pm

Hi @kjain,

I wonder what version of the EmrEtlRunner you were running. If I’m not mistaking some RC (release candidate) version had this issues. You might need to ensure you are using an official release rather than RC. The apps could be obtained from here (http://dl.bintray.com/snowplow/snowplow-generic/).

kjain · June 14, 2017, 4:47pm

Thanks for your quick reply ihor!
The version I am currently working with has been extracted from:
snowplow_emr_r83_bald_eagle.zip

ihor · June 14, 2017, 6:52pm

@kjain, the only thing that comes to my mind is the EmrEtlRunner failed to archive the raw events for some reason which went unnoticed. Archiving is the last step and is done by Sluice application outside of the EMR cluster (in the release r83 you are using). It doesn’t stop running StorageLoader but prevents you from starting EmrEtlRunner again.

See the dataflow diagram for more details: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps

I believe it’s a one-off issue and could be disregarded. If the files from that run are still in the “processing” bucket you could run the EmrEtlRunner with the --skip staging,emr option to see if the files will be archived. Do make sure you do not clash with any subsequent run you might have started already.

kjain · June 14, 2017, 6:55pm

Thanks for the advice Ihor!

I guess I’ll just manually move these files over to the archive/raw bucket (since that seems to be the step skipped) and run another ETL batch to check if it happens again.

Topic		Replies	Views
EmrEtlRunner stops with no error Enrichment	5	1642	June 20, 2017
After updating to v87 remaining processing_$folder$ Troubleshooting	5	1532	March 2, 2017
Processing logs for a specific time period AWS batch pipeline (Legacy)	5	1518	November 14, 2016
EmrEtlRunner running for days at Step "Shred Enriched Events" Enrichment	3	1362	May 9, 2018
Can't load data back into redshift Troubleshooting	7	1886	June 4, 2018

Processing folder not empty - but no error on the ETL script!

Related topics