How Snowplow EmrEtlRunner differentiate time boundary between jobs?

aditya · September 16, 2019, 4:35pm

I am wondering how Snowplow EmrEtlRunner differentiate between raw data that has been processed and those that are new? For example, I have data from 13 Sep - 15 Sep and I am running the EmrEtl job daily. When I run in on 15 Sep, How does it know now to process data from 13 Sep (since I already ran it on 14th Sep)?

ihor · September 16, 2019, 11:28pm

@aditya, the processed files are being archived (moved away). You can examine the dataflow diagram to have a better understanding of how it works: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.

The confusion could arise if you do not configure your buckets correctly which might result archiving the raw data into the very same processing bucket: https://github.com/snowplow/snowplow/wiki/Common-configuration#s3.

aditya · September 17, 2019, 6:23am

So everytime the EMR ETL process runs, it will check in the archived folder ? But what exactly does it check?

So if I look into my 3 archive folders (raw, enriched, shredded), there are gonna be folders with common pattern to be found which are run=YYYY-MM-DD-HH-MM-SS folders. Those timestamp will follow the timestamp of when I run my EMR-ETL process.

The difference between those 3 archive folders are:

raw: in each of the run=etl_timestamp folders, there are collector instances folders that contains archived raw tomcat logs data.
enriched: in each of the run=etl_timestamp folders, there are tomcat logs data in csv format.
shredded: in each of the run=etl_timestamp folders, there are data that has been converted to JSON format and separated between the core events data and other supporting unstructed_event enriched data such as YAUAA or performance timing.

What exactly is being looked into the archive for snowplow to tell the limit to which it should process data? I think it couldn’t be the run-timestamp since it is not timestamp of the logs but only timestamp of the ETL run.

ihor · September 17, 2019, 4:39pm

@aditya, it doesn’t check the archive folder - it doesn’t need to. Whatever is in raw:in bucket is considered as not processed because whatever has been processed has been moved to a different bucket (as per dataflow diagram I provided). That’s why it is important to configure your buckets correctly (as per link I provided).

aditya · September 18, 2019, 9:01am

I get it now. So Snowplow is moving these log files with this pattern (_var_log_tomcat8_rotated_localhost_access_log.txt{epoch_second}.gz) in the filename from raw:in to raw:archive.

It is confusing sometimes to look at this file among the heap of tomcat and localhost logs.

Thanks for the clarification!

ihor · September 18, 2019, 3:45pm

That is correct, @aditya. Here’s another post explaining this How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket?.

Topic		Replies	Views
Processing folder not empty - but no error on the ETL script! Enrichment	4	1344	June 14, 2017
No Snowplow logs to process since last run For engineers	1	976	June 27, 2018
ETL runner overwriting processing logs Enrichment	4	1478	May 17, 2017
How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket? Enrichment	2	1867	September 8, 2016
Processing logs for a specific time period AWS batch pipeline (Legacy)	5	1518	November 14, 2016

How Snowplow EmrEtlRunner differentiate time boundary between jobs?

Related topics