I’ve set up a Snowplow 1 hour batch instance and everything seems to work. I can get page views events with my clojure collector, i can shred an store them in a Redshift cluster.
However In this 2 days I’m noticing a constant increase of time in my EMR jobs. In a batch structure it obviously becomes a problem. I allocated a couple of m1.large instances but I can’t believe to need more computation resources to process “only” 3M events (265K peak/hour).
The only enrichment I’ve set up is geo-ip localization.
Am I missing some point?
Thank you for your time,
Hi @Frederico - can you share:
- The EMR configuration part of your
- The EMR run times of your last 5 runs
- The number of events processed in each of your last 5 runs
Thanks for your answer, following what you asked:
ami_version: 4.5.0 # Don’t change this
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles
service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
bootstrap: # Set this to specify custom boostrap actions. Leave empty otherwise
hbase: # Optional. To launch on cluster, provide version, “0.92.0”, keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, “1.1”, keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
task_instance_count: 0 # Increase to use spot instances
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features
Number of events
Last 5 iterations accidentally manage an increasing number of events, however 2 days ago I got a much irregular series such as:
This data are only page views events enriched with the only ip_lookup.json
Thanks in advance
Thanks for sharing @Federico1. I would suggest updating to:
Give that a go and please share your new job times.
Thanks for your answer, I applied your suggested changes and I’ll keep track of new execution times.
May I ask you which misconfiguration you spotted?
I’ll write you back in a while.
- Your master instance type was a little overprovisioned
- Your core instance cluster was a little underprovisioned
Let us know how you get on!
after few days I’m still encountering the problem.
I’ve noticed that on failing I can find already processed data in processing folder, it seems to re-process each event from the dawn of the instance. I can’t remember to have ever set such configuration.
Have you ever encountered this (probably) misconfiguration?
Hi @Federico1, I’m wondering if you are falling foul of this:
Important 2: do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.
From this EmrEtlRunner configuration documentation.
Let me know?
Thanks for your suggestion, you were right.
Sorry for loosing your time.
No worries @Federico1 - thanks for letting us know what the problem was!