I’ve set up a Snowplow 1 hour batch instance and everything seems to work. I can get page views events with my clojure collector, i can shred an store them in a Redshift cluster.
However In this 2 days I’m noticing a constant increase of time in my EMR jobs. In a batch structure it obviously becomes a problem. I allocated a couple of m1.large instances but I can’t believe to need more computation resources to process “only” 3M events (265K peak/hour).
The only enrichment I’ve set up is geo-ip localization.
emr:
ami_version: 4.5.0 # Don’t change this
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles
service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: tag
bootstrap: # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, “0.92.0”, keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, “1.1”, keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: m1.large
core_instance_count: 2
core_instance_type: m1.large
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.large
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features
executions
Number of events
733492
806176
957370
988477
1124883
Last 5 iterations accidentally manage an increasing number of events, however 2 days ago I got a much irregular series such as:
651477
1503891
1752706
36852
271039
This data are only page views events enriched with the only ip_lookup.json
Hi,
after few days I’m still encountering the problem.
I’ve noticed that on failing I can find already processed data in processing folder, it seems to re-process each event from the dawn of the instance. I can’t remember to have ever set such configuration.
Have you ever encountered this (probably) misconfiguration?
Hi @Federico1, I’m wondering if you are falling foul of this:
Important 2: do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.