How Snowplow EmrEtlRunner differentiate time boundary between jobs?

I am wondering how Snowplow EmrEtlRunner differentiate between raw data that has been processed and those that are new? For example, I have data from 13 Sep - 15 Sep and I am running the EmrEtl job daily. When I run in on 15 Sep, How does it know now to process data from 13 Sep (since I already ran it on 14th Sep)?

@aditya, the processed files are being archived (moved away). You can examine the dataflow diagram to have a better understanding of how it works: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.

The confusion could arise if you do not configure your buckets correctly which might result archiving the raw data into the very same processing bucket: https://github.com/snowplow/snowplow/wiki/Common-configuration#s3.

So everytime the EMR ETL process runs, it will check in the archived folder ? But what exactly does it check?

So if I look into my 3 archive folders (raw, enriched, shredded), there are gonna be folders with common pattern to be found which are run=YYYY-MM-DD-HH-MM-SS folders. Those timestamp will follow the timestamp of when I run my EMR-ETL process.

The difference between those 3 archive folders are:

  • raw: in each of the run=etl_timestamp folders, there are collector instances folders that contains archived raw tomcat logs data.
  • enriched: in each of the run=etl_timestamp folders, there are tomcat logs data in csv format.
  • shredded: in each of the run=etl_timestamp folders, there are data that has been converted to JSON format and separated between the core events data and other supporting unstructed_event enriched data such as YAUAA or performance timing.

What exactly is being looked into the archive for snowplow to tell the limit to which it should process data? I think it couldn’t be the run-timestamp since it is not timestamp of the logs but only timestamp of the ETL run.

@aditya, it doesn’t check the archive folder - it doesn’t need to. Whatever is in raw:in bucket is considered as not processed because whatever has been processed has been moved to a different bucket (as per dataflow diagram I provided). That’s why it is important to configure your buckets correctly (as per link I provided).

I get it now. So Snowplow is moving these log files with this pattern (_var_log_tomcat8_rotated_localhost_access_log.txt{epoch_second}.gz) in the filename from raw:in to raw:archive.

It is confusing sometimes to look at this file among the heap of tomcat and localhost logs.

Thanks for the clarification!

That is correct, @aditya. Here’s another post explaining this How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket?.

1 Like