Error on EmrEtlRunner, S3 not empty

Germanaz0 · December 15, 2016, 2:20pm

I am getting this error

Snowplow::EmrEtlRunner::DirectoryNotEmptyError (Should not stage files for enrichment, s3://my-super-secret-bucket/enriched/good/ is not empty):

It may be because some emr error, don’t know which caused the following error, but now I configured the cronjob to send mails everytime, so the next time I will know why that error happened.

My question is very simple, if I erase the “Good” folder, may I lose all the tracked data since that error ?

Second question, is it possible to restore the emr-runner and re-process the non empty folders ?

ihor · December 15, 2016, 9:02pm

Hi @Germanaz0,

You would typically get the “DirectoryNotEmptyError” under 2 circumstances:

The pipeline job was kicked off while the previous run was still in progress
The previous run has failed to result in the events files not archived

The former scenario is a legitimate condition. We don’t want any pipeline runs clashing. All that is required is to wait for the currently running job to complete then you can safely kick off another run from the top.

In the latter scenario, you would have to identify at what step the pipeline failed. There could be quite a few break points to consider. The best is to check the logs to identify the failure point. However, you also could determine it by checking the following buckets whether they are empty.

Here’re a few scenarios:

Failure at “staging” step, problem spinning EMR cluster or while enriching the events:

processing is not empty
enriched/good is empty
shredded/good is empty

Failure during EMR job post “enrichment” step (while copying files to S3) or at “shredding” step:

processing is not empty
enriched/good is not empty
shredded/good is empty

Failure during EMR job post “shredding” step (while copying files to S3) or during archiving raw file :

processing is not empty
enriched/good is not empty
shredded/good is not empty

Failure at data load step:

processing is empty
enriched/good is not empty
shredded/good is not empty

Failure at archiving step post data load:

processing is empty
enriched/good is either empty or not
shredded/good is not empty

To understand the recovery steps in each scenario, please, refer to the BarchPipeline Steps wiki page.

Answering the 2nd question:

if I erase the “Good” folder, may I lose all the tracked data since that error?

No, you won’t loose the events as long as you still have the corresponding events in either your processing bucket or the archived raw events bucket.

Snowplow was designed to be robust and reliable. Safeguarding the events is the primary objective. Again, you can refer to the mentioned wiki page to see how the reliability is achieved. In short, the (event/log) files are moved to processing bucket. Once the events have been enriched (dimension widen) the processing (raw) events get archived (as so do enriched and shredded events post data load).

Going back to your scenario, provided the previous run did fail, you need to determine if the job failed at EMR step or during the data load. I believe the processing bucket would be empty but shredded/good (as well as enriched/good) are not.

Then, If the failure, say, occurred at EMR you can simply delete the enriched/good (and possibly shredded/good) and rerun the EmrEtlRunner with the --skip staging option. However, my guess the EMR job did complete (including “archive_raw” step, hence, the error refers to enriched bucket rather than processing). In this case, the failure took place either before the data got loaded or after (during the final archiving step). Therefore, you could rerun the StorageLoader without either skipping any step (if failed before data loaded) or with --skip download,load option to complete the archiving.

Do check the actual failure reason. You might need to fix the underlying problem before rerunning.

Hopefully, this helps.

Germanaz0 · December 16, 2016, 2:02pm

Thanks a lot for the helpful reply, it is almost a guide that you wrote, appreciate it.

Topic		Replies	Views
EmrEtlRunner skip issues configuration Enrichment	10	3390	July 31, 2016
EMR job failing Troubleshooting	4	952	November 15, 2021
EmrEtlRunner stops with no error Enrichment	5	1642	June 20, 2017
Should not stage files for enrichment Enrichment	1	1733	April 12, 2016
EMR job writes empty files in enriched.bad and shredded.bad buckets Enrichment	4	1475	April 10, 2017

Error on EmrEtlRunner, S3 not empty

Related topics