How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket?

ihor · September 7, 2016, 8:56pm

Prior to processing the raw events the corresponding log files are moved to processing bucket (out of in bucket).

NOTE: Users of the Clojure collector will accumulate lots of log files in their Clojure collector log bucket. That's because the logs used for Snowplow are just 1 of a number of different log files e.g.:

“httpd_rotated_access_log”

“httpd_rotated_elasticbeanstalk-access_log”

“httpd_rotated_error_log”

“rotated_catalina”

Only files with a naming convention similar to “var_log_tomcat8_rotated_localhost_access_log” are moved out of in bucket into processing. The rest are ignored and can be safely deleted.

Once processed (enriched and shredded), the files are moved to archive bucket.

Thus, there cannot be confusion here. If for some reason the enrichment process fails at any point, the files/logs will not be archived which will prevent the subsequent job run. EmrEtlRunner expects to have all three buckets (processing, enriched, and shredded) to be empty before proceeding.

Please, take a look at the diagram to understand the workflow and steps to take if a failure occurs.

Hopefully, this explains.

–Ihor

Topic		Replies	Views
Rerunning logs (new to Snowplow) For engineers	2	1321	December 19, 2019
ETL runner overwriting processing logs Enrichment	4	1369	May 17, 2017
No Snowplow logs to process since last run For engineers	1	876	June 27, 2018
EmrEtlRunner stops with no error Enrichment	5	1523	June 20, 2017
Elasticity Spark Step: Enrich Raw Events never ends Troubleshooting	7	1542	April 5, 2018

How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket?

Related Topics