Hi @Germanaz0,
You would typically get the “DirectoryNotEmptyError” under 2 circumstances:
- The pipeline job was kicked off while the previous run was still in progress
- The previous run has failed to result in the events files not archived
The former scenario is a legitimate condition. We don’t want any pipeline runs clashing. All that is required is to wait for the currently running job to complete then you can safely kick off another run from the top.
In the latter scenario, you would have to identify at what step the pipeline failed. There could be quite a few break points to consider. The best is to check the logs to identify the failure point. However, you also could determine it by checking the following buckets whether they are empty.
Here’re a few scenarios:
- Failure at “staging” step, problem spinning EMR cluster or while enriching the events:
processing
is not empty
enriched/good
is empty
shredded/good
is empty
- Failure during EMR job post “enrichment” step (while copying files to S3) or at “shredding” step:
processing
is not empty
enriched/good
is not empty
shredded/good
is empty
- Failure during EMR job post “shredding” step (while copying files to S3) or during archiving
raw
file :
processing
is not empty
enriched/good
is not empty
shredded/good
is not empty
- Failure at data load step:
processing
is empty
enriched/good
is not empty
shredded/good
is not empty
- Failure at archiving step post data load:
processing
is empty
enriched/good
is either empty or not
shredded/good
is not empty
To understand the recovery steps in each scenario, please, refer to the BarchPipeline Steps wiki page.
Answering the 2nd question:
if I erase the “Good” folder, may I lose all the tracked data since that error?
No, you won’t loose the events as long as you still have the corresponding events in either your processing
bucket or the archived raw events bucket.
Snowplow was designed to be robust and reliable. Safeguarding the events is the primary objective. Again, you can refer to the mentioned wiki page to see how the reliability is achieved. In short, the (event/log) files are moved to processing
bucket. Once the events have been enriched (dimension widen) the processing
(raw) events get archived (as so do enriched
and shredded
events post data load).
Going back to your scenario, provided the previous run did fail, you need to determine if the job failed at EMR step or during the data load. I believe the processing
bucket would be empty but shredded/good
(as well as enriched/good
) are not.
Then, If the failure, say, occurred at EMR you can simply delete the enriched/good
(and possibly shredded/good
) and rerun the EmrEtlRunner with the --skip staging
option. However, my guess the EMR job did complete (including “archive_raw” step, hence, the error refers to enriched
bucket rather than processing
). In this case, the failure took place either before the data got loaded or after (during the final archiving step). Therefore, you could rerun the StorageLoader without either skipping any step (if failed before data loaded) or with --skip download,load
option to complete the archiving.
Do check the actual failure reason. You might need to fix the underlying problem before rerunning.
Hopefully, this helps.