According to the docs for rolling mode:
EmrEtlRunner processes whatever raw Snowplow event logs it finds in the In Bucket
However, if you do not archive the raw logs, how does the EmrEtlRunner determine which logs are the “latest” that it has not yet processed?
We’re using the Clojure collector on ElasticBeanstalk and there are lots of older files in the raw “in” bucket.
Prior to processing the
raw events the corresponding log files are moved to
processing bucket (out of
NOTE: Users of the Clojure collector will accumulate lots of log files in their Clojure collector log bucket. That's because the logs used for Snowplow are just 1 of a number of different log files e.g.:
Only files with a naming convention similar to “var_log_tomcat8_rotated_localhost_access_log” are moved out of
in bucket into
processing. The rest are ignored and can be safely deleted.
Once processed (enriched and shredded), the files are moved to
Thus, there cannot be confusion here. If for some reason the enrichment process fails at any point, the files/logs will not be archived which will prevent the subsequent job run. EmrEtlRunner expects to have all three buckets (
shredded) to be empty before proceeding.
Please, take a look at the diagram to understand the workflow and steps to take if a failure occurs.
Hopefully, this explains.
Thank you, @ihor !
That makes sense and clears up the confusion of why there are still so many log files left in the ElasticBeanstalk published logs.