Not all events going into Redshift using EmrEtlRunner

Hi all,

I am using EMR ETL + Redhsift + Clojure Collector and when I look in the Beanstalk S3 log bucket and look at the logs, I’m seeing what I expect which are pageviews from our site which we tagged up.

Yet when I run the EMR ETL + Storage Loader not everything is getting into Redshift despite the whole process running without error. Only data from from a dev site we tagged is coming in.

So a few questions:

In my config.yml, under monitoring:snowplow:app_id I have “snowplow” but I have not set an app_id in the Javascript tracker - could this be why?

This is how I initialised the tracker (I left the options blank):


window.snowplow('newTracker', 'mycljcoll', '', { 
// Initialise a tracker 
// I left this blank with no options..

I also had some general questions about the Clojure Collector + EMR ETL.

  • How does it keep track of what the last time it was ran?
  • Do logs going all the way back to when the collector started get stored on the Beanstalk S3 bucket? Or does the EMR ETL tool wipe them after it runs?
  • The documentation says to run a daily crontab, but I noticed on the Beanstalk S3 bucket that logs get rotated hourly. The EMR ETL + Storage Loader process takes under 30 mins - could I run this hourly?


Are there currently any records in your “bad” bucket?

Is there a particular reason that you do not want an app_id set in the tracker? Having an app id will come in handy down the road when you need to migrate a site or add an additional application. Check you bad bucket after EmrEtlRunner is executed and parse out the error message.

The Clojure collector is configured to rotate logs hourly to S3. This usually happens 10 minutes after the hour More information here. You can enable self monitoring, but we find that the Alex’s article on orchestrating batch processing pipelines can be the base of a very robust tool to orchestrate processing. Ex: Writing last time is was ran to a text file and reading as a step in a DAG. (We use SNS + SQS).

Snowplow is configured to store all of the raw log files in the archive bucket that is configured here

The length that EmrEtlRunner + StorageLoader varies greatly depending on the provisioned instance types, availability of spot instances, event volumes, and how much unstructured event shredding that needs to take place. You can theoretically run Snowplow as frequently as possible. If you try two instances of EmrEtlRunner at the same time from the same staging bucket, you will have failures. This can be handled - you just have to keep it in mind.

1 Like

Hi @timgriffin,

The monitoring:snowplow section of config.yml is not relevant here. In fact, you could completely remove snowplow: part from the configuration file. It is meant for monitoring (capturing events of) the EMR process itself. It does not affect the processing of the actual events captured with your collector.

The EmrEtlRunner checks the processing, enriched:good and shredded:good buckets to be empty before kicking off the Enrichment process. It does not track or label the logs it processed. If any of those buckets is not empty it means the previous EMR-ETL process is either failed or still in process. Also see the answer to the following question.

Please, refer to this diagram to understand the log files workflow. In short, the logs collected by the (clojure) collector and pushed to raw:in bucket on S3 are moved to a separate processing bucket as the first step in the whole EMR-ETL process.

On success of the enrichment process the enriched events are placed in enriched:good bucket and the shredded events are placed in shredded:good bucket. After that the original logs (located in processing bucket) are moved to archive bucket.

The StorageLoader orchestrates the shredded:good data to be loaded into Redshift. Once completed, the shredded:good and enriched:good are moved to enriched:archive and shredded:archive.

In other words, there should be no loss of data. You always can reprocess/reload the logs/events/data if the need arises, say, in case of a processing failure, corrupt data, etc.

Sure. If the (previous) process has not been completed yet then you will still have files in one or all of the processing, enriched, shredded buckets which will result in failure to start, which is OK as long the the whole previous process eventually complets successfully and thus clears the mentioned buckets for the next EME-ETL batch run to start.

Therefore it’s important to set all those bucket correctly (both on S3 and in config.yml) and not mix them up to ensure the correct flow.

Back to your initial question about missing events. Does my explanation gets you to think why the events might be “missing”? How do you verify they have not be loaded to Redshift?

Note that you might not get any error during the EMR process. However, if any bad data encountered in the event it could end up in one of your enriched:bad / shredded:bad buckets.


Thank you @digitaltouch and @ihor for your detailed and helpful responses.

I got to the bottom of the error and it’s embarrassing. I had updated aws:monitoring:snowplow:collector but I missed the critical one… aws:s3:buckets:raw:in and it was my old test bucket :sob:

The silver lining is you guys explained a lot about how the EMR ETL process works and I did learn something :slight_smile:

Thanks again!