Increasing execution time on batch mode instance


I’ve set up a Snowplow 1 hour batch instance and everything seems to work. I can get page views events with my clojure collector, i can shred an store them in a Redshift cluster.

However In this 2 days I’m noticing a constant increase of time in my EMR jobs. In a batch structure it obviously becomes a problem. I allocated a couple of m1.large instances but I can’t believe to need more computation resources to process “only” 3M events (265K peak/hour).
The only enrichment I’ve set up is geo-ip localization.

Am I missing some point?

Thank you for your time,

Hi @Frederico - can you share:

  • The EMR configuration part of your config.yml
  • The EMR run times of your last 5 runs
  • The number of events processed in each of your last 5 runs


Thanks for your answer, following what you asked:


ami_version: 4.5.0 # Don’t change this
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: tag
bootstrap: # Set this to specify custom boostrap actions. Leave empty otherwise
hbase: # Optional. To launch on cluster, provide version, “0.92.0”, keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, “1.1”, keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
master_instance_type: m1.large
core_instance_count: 2
core_instance_type: m1.large
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.large
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features


Number of events
Last 5 iterations accidentally manage an increasing number of events, however 2 days ago I got a much irregular series such as:

This data are only page views events enriched with the only ip_lookup.json

Thanks in advance

Thanks for sharing @Federico1. I would suggest updating to:

      master_instance_type: m1.medium
      core_instance_count: 3
      core_instance_type: m3.xlarge
      task_instance_count: 0
      task_instance_type: m1.small
      task_instance_bid: 0.25

Give that a go and please share your new job times.

Thanks for your answer, I applied your suggested changes and I’ll keep track of new execution times.
May I ask you which misconfiguration you spotted?

I’ll write you back in a while.

Thanks again

Hey @Frederico1:

  • Your master instance type was a little overprovisioned
  • Your core instance cluster was a little underprovisioned

Let us know how you get on!


after few days I’m still encountering the problem.

I’ve noticed that on failing I can find already processed data in processing folder, it seems to re-process each event from the dawn of the instance. I can’t remember to have ever set such configuration.

Have you ever encountered this (probably) misconfiguration?

Hi @Federico1, I’m wondering if you are falling foul of this:

Important 2: do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.

From this EmrEtlRunner configuration documentation.

Let me know?

Thanks for your suggestion, you were right.

Sorry for loosing your time.

No worries @Federico1 - thanks for letting us know what the problem was!