After EmrEtlRunner completes its task (shredding, importing to Postgres), it does not delete the raw-in data. My question is, do I need to schedule a job to delete the raw data? Or, should I just leave the raw data alone?
My worry is that if the raw data remains there, the next time EmrEtlRunner runs, it will proceed with the old data again, thus causing duplicated records. I don’t know. Maybe it is smart enough to skip the old data?
Really appreciate anyone who can clear my confusion.
The diagram for batch processing here gives you an idea of the steps going on through the EMR process. As part of this (step 12) is that the raw data from the EMR cluster is copied back to S3 into an archive bucket.
Raw data isn’t deleted (just in case for some reason you need to reprocess / enrich the data again) but it is possible to delete it (or move it to Glacier / other S3 storage) if required. EMREtlRunner uses a ‘staging’ bucket/S3 path (step 1 where data is moved from raw-in to raw-processing) so that each run will only ever process data once. Once the data has been processed (from raw-processing) that bucket is cleared out ready for the next run (which will again move files from raw:in to raw:processing).