Enrich/good folder contains empty 'run=[date]_$folder$' files

hanskohls · January 12, 2018, 11:19am

I have just setup the snowplow enrich process using R97 Knossos.

The process completes successfully, but the enrich/good folder contains no folders, and just empty files with the name pattern

run=[timestamp]_$folder$

In enrich/bad I can see matching folders, with two files (_SUCCESS and part-....-.txt).

In archive/enrich I can see a matching folder with the same ifle pattern as above, but also an empty file with the run name and _$folder$ appended.

I’ve noticed there has been a similar issue which should now be fixed here: https://github.com/snowplow/snowplow/issues/3139

anton · January 12, 2018, 11:53am

Hello @hanskohls,

These _$folder$ files are harmless. We had plans to remove them, but pushed back this ticket.

enrich/good should not contain data after pipeline finished, folders got archived into archive/enrich by S3DistCp step that leaves these ghost _$folder$ files.

If data is present in archive/enrich then I don’t see reasons for you to worry.

BenFradet · January 15, 2018, 4:34pm

For more information, check out the documentation from AWS:

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

bhavin · January 25, 2018, 6:13pm

is there a way to use the aws cli to delete all the _folder files?

knservis · January 26, 2018, 10:39am

What I have done before is:

export BASE_RM_PATH=example-bucket/example-path; for f in $(aws s3 ls --recursive s3://$BASE_RM_PATH/ | grep '_\$folder\$' | perl -nae 'print "$F[3]\n";'); do echo "aws s3 rm s3://$BASE_RM_PATH/$f"; done

once happy with the result, simply remove the “echo” and quotes to execute

bhavin · January 26, 2018, 2:27pm

thanks @knservis . i tried using the include with remove but didnt quite get it working…

knservis · January 26, 2018, 2:47pm

Did you run it as is (replacing BASE_RM_PATH=example-bucket/example-path for the correct path)? If yes, did you get the output you expected ( a whole series of aws s3 rm statements with the expected files to be deleted)? @bhavin

bhavin · January 29, 2018, 7:41pm

ah… no i ment i tried running the s3 ls rm --recursive --include “.path.” to filter and remove only one file from all the folders

knservis · January 30, 2018, 10:08am

@bhavin Please let us know if you tried my suggestion. If it worked or if it didn’t or if you decided to do something else or nothing at all - let us know as this will help others that are reading this thread.

bhavin · February 6, 2018, 7:13pm

hey @knservis… I had to put in a slight modification.
from perl -nae 'print "$F[3]\n"; to perl -nae 'print ( (split("/" , @F[3] ))[-1] , "\n") ;' since we are using the same $BASE_RM_PATH we only need the file name run=date... if we dont do that the script will repeat the prefix twice.

# modified version

export BASE_RM_PATH=<s3bucket>/<prefix>;

for f in $(aws s3 ls --recursive s3://$BASE_RM_PATH/ | grep '_\$folder\$' | perl -nae 'print ( (split("/" , @F[3] ))[-1] , "\n") ;');
do
echo "aws s3 rm s3://$BASE_RM_PATH/$f";
done

i ended up using the one liner

echo "enter s3path:"; \
read s3path; \
aws s3 ls --recursive s3://$s3path/ \
| awk -F '/' '/_\$folder\$/  { print $3 }' \
| xargs -I {} echo aws s3 rm s3://$s3path/{}

but what I really wanted to do is use the --recursive and --include & --exclude flag for rm and let aws cli do the work for me, which will be faster and I wouldn’t have to worry about intermediate errors and clean up or tracking, etc.
( finally i got it this time … )

read s3path; \
aws s3 rm --dryrun s3://$s3path/ \
--recursive \
--exclude '*' \
--include "*_\$folder$" \
;

let me know what you think!. and thanks for the pointer above…

knservis · February 9, 2018, 8:52am

That exclude include trick in the last example seems to work well and it will be faster than listing and then doing an rm for each. That’s very helpful @bhavin thanks.

mjensen · August 27, 2018, 12:46pm

@bhavin works awesome thanks,
once we upgrade to latest version i won’t need this, but until we do, this is very helpful

Topic		Replies	Views
Enriched good and bad buckets are empty in the enrich AWS batch pipeline (Legacy)	7	2129	December 4, 2017
EMR job writes empty files in enriched.bad and shredded.bad buckets Enrichment	4	1475	April 10, 2017
Empty_Dummy files written to S3 Collectors	5	1073	July 27, 2020
Error on EmrEtlRunner, S3 not empty Enrichment	2	2068	December 16, 2016
Output (enriched/good and enriched/bad) are all empty! AWS batch pipeline (Legacy)	2	1746	February 27, 2017

Enrich/good folder contains empty 'run=[date]_$folder$' files

Related topics