These _$folder$ files are harmless. We had plans to remove them, but pushed back this ticket.
enrich/good should not contain data after pipeline finished, folders got archived into archive/enrich by S3DistCp step that leaves these ghost _$folder$ files.
If data is present in archive/enrich then I don’t see reasons for you to worry.
export BASE_RM_PATH=example-bucket/example-path; for f in $(aws s3 ls --recursive s3://$BASE_RM_PATH/ | grep '_\$folder\$' | perl -nae 'print "$F[3]\n";'); do echo "aws s3 rm s3://$BASE_RM_PATH/$f"; done
once happy with the result, simply remove the “echo” and quotes to execute
Did you run it as is (replacing BASE_RM_PATH=example-bucket/example-path for the correct path)? If yes, did you get the output you expected ( a whole series of aws s3 rm statements with the expected files to be deleted)? @bhavin
@bhavin Please let us know if you tried my suggestion. If it worked or if it didn’t or if you decided to do something else or nothing at all - let us know as this will help others that are reading this thread.
hey @knservis… I had to put in a slight modification.
from perl -nae 'print "$F[3]\n"; to perl -nae 'print ( (split("/" , @F[3] ))[-1] , "\n") ;' since we are using the same $BASE_RM_PATH we only need the file name run=date... if we dont do that the script will repeat the prefix twice.
# modified version
export BASE_RM_PATH=<s3bucket>/<prefix>;
for f in $(aws s3 ls --recursive s3://$BASE_RM_PATH/ | grep '_\$folder\$' | perl -nae 'print ( (split("/" , @F[3] ))[-1] , "\n") ;');
do
echo "aws s3 rm s3://$BASE_RM_PATH/$f";
done
but what I really wanted to do is use the --recursive and --include & --exclude flag for rm and let aws cli do the work for me, which will be faster and I wouldn’t have to worry about intermediate errors and clean up or tracking, etc.
( finally i got it this time … )
That exclude include trick in the last example seems to work well and it will be faster than listing and then doing an rm for each. That’s very helpful @bhavin thanks.