I am wondering how Snowplow EmrEtlRunner differentiate between raw data that has been processed and those that are new? For example, I have data from 13 Sep - 15 Sep and I am running the EmrEtl job daily. When I run in on 15 Sep, How does it know now to process data from 13 Sep (since I already ran it on 14th Sep)?
@aditya, the processed files are being archived (moved away). You can examine the dataflow diagram to have a better understanding of how it works: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.
The confusion could arise if you do not configure your buckets correctly which might result archiving the raw data into the very same processing bucket: https://github.com/snowplow/snowplow/wiki/Common-configuration#s3.
So everytime the EMR ETL process runs, it will check in the archived folder ? But what exactly does it check?
So if I look into my 3 archive folders (raw, enriched, shredded), there are gonna be folders with common pattern to be found which are run=YYYY-MM-DD-HH-MM-SS
folders. Those timestamp will follow the timestamp of when I run my EMR-ETL process.
The difference between those 3 archive folders are:
- raw: in each of the
run=etl_timestamp
folders, there are collector instances folders that contains archived raw tomcat logs data. - enriched: in each of the
run=etl_timestamp
folders, there are tomcat logs data in csv format. - shredded: in each of the
run=etl_timestamp
folders, there are data that has been converted to JSON format and separated between the core events data and other supporting unstructed_event enriched data such as YAUAA or performance timing.
What exactly is being looked into the archive for snowplow to tell the limit to which it should process data? I think it couldn’t be the run-timestamp since it is not timestamp of the logs but only timestamp of the ETL run.
@aditya, it doesn’t check the archive folder - it doesn’t need to. Whatever is in raw:in
bucket is considered as not processed because whatever has been processed has been moved to a different bucket (as per dataflow diagram I provided). That’s why it is important to configure your buckets correctly (as per link I provided).
I get it now. So Snowplow is moving these log files with this pattern (_var_log_tomcat8_rotated_localhost_access_log.txt{epoch_second}.gz
) in the filename from raw:in
to raw:archive
.
It is confusing sometimes to look at this file among the heap of tomcat and localhost logs.
Thanks for the clarification!
That is correct, @aditya. Here’s another post explaining this How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket?.