This is a strange one I think.
I manually ran the etl script: snowplow-emr-etl-runner to process the logs and prepare for loading into Redshift. Since I was gping this to process a large backlog and didnt know long it would take, I decided to run this manually.
After about 8-10 hrs the process completed successfully and I can see the data in etl/processing.
I then manually ran the storageloader and it successfully processed and then stored this data.
But then as I prepared to start another ETL batch I noticed that the etc > processing bucket still contained all the files from the previous sessions (about 10,000 of them). I thought these should have been moved into the archive bucket and since I didn’t see any errors, I don’t understand why are they still here.
I checked the data from some of these files against the data stored in Redshift and verified that this matched with the last processed data.
What could be the issue? I could manually move these files over to the archive/raw bucket but it would be great to get some insight into why this may have happened.
I wonder what version of the EmrEtlRunner you were running. If I’m not mistaking some RC (release candidate) version had this issues. You might need to ensure you are using an official release rather than RC. The apps could be obtained from here (http://dl.bintray.com/snowplow/snowplow-generic/).
@kjain, the only thing that comes to my mind is the EmrEtlRunner failed to archive the raw events for some reason which went unnoticed. Archiving is the last step and is done by Sluice application outside of the EMR cluster (in the release r83 you are using). It doesn’t stop running StorageLoader but prevents you from starting EmrEtlRunner again.
I believe it’s a one-off issue and could be disregarded. If the files from that run are still in the “processing” bucket you could run the EmrEtlRunner with the --skip staging,emr option to see if the files will be archived. Do make sure you do not clash with any subsequent run you might have started already.
I guess I’ll just manually move these files over to the archive/raw bucket (since that seems to be the step skipped) and run another ETL batch to check if it happens again.