Increasing execution time on batch mode instance

Federico1 · October 20, 2016, 8:01am

Hi,

I’ve set up a Snowplow 1 hour batch instance and everything seems to work. I can get page views events with my clojure collector, i can shred an store them in a Redshift cluster.

However In this 2 days I’m noticing a constant increase of time in my EMR jobs. In a batch structure it obviously becomes a problem. I allocated a couple of m1.large instances but I can’t believe to need more computation resources to process “only” 3M events (265K peak/hour).
The only enrichment I’ve set up is geo-ip localization.

Am I missing some point?

Thank you for your time,
Federico

alex · October 22, 2016, 7:12pm

Hi @Frederico - can you share:

The EMR configuration part of your config.yml
The EMR run times of your last 5 runs
The number of events processed in each of your last 5 runs

Thanks!

Federico1 · October 24, 2016, 10:21am

Thanks for your answer, following what you asked:

config.yml

emr:
ami_version: 4.5.0 # Don’t change this
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement: # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: tag
bootstrap: # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, “0.92.0”, keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, “1.1”, keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: m1.large
core_instance_count: 2
core_instance_type: m1.large
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.large
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features

executions

Number of events
733492
806176
957370
988477
1124883
Last 5 iterations accidentally manage an increasing number of events, however 2 days ago I got a much irregular series such as:
651477
1503891
1752706
36852
271039

This data are only page views events enriched with the only ip_lookup.json

Thanks in advance

alex · October 25, 2016, 9:51pm

Thanks for sharing @Federico1. I would suggest updating to:

    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 3
      core_instance_type: m3.xlarge
      task_instance_count: 0
      task_instance_type: m1.small
      task_instance_bid: 0.25

Give that a go and please share your new job times.

Federico1 · October 28, 2016, 10:03am

Thanks for your answer, I applied your suggested changes and I’ll keep track of new execution times.
May I ask you which misconfiguration you spotted?

I’ll write you back in a while.

Thanks again

alex · October 28, 2016, 10:15am

Hey @Frederico1:

Your master instance type was a little overprovisioned
Your core instance cluster was a little underprovisioned

Let us know how you get on!

Alex

Federico1 · October 31, 2016, 10:58am

Hi,
after few days I’m still encountering the problem.

I’ve noticed that on failing I can find already processed data in processing folder, it seems to re-process each event from the dawn of the instance. I can’t remember to have ever set such configuration.

Have you ever encountered this (probably) misconfiguration?

alex · October 31, 2016, 11:15am

Hi @Federico1, I’m wondering if you are falling foul of this:

Important 2: do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.

From this EmrEtlRunner configuration documentation.

Let me know?

Federico1 · November 2, 2016, 4:52pm

Thanks for your suggestion, you were right.

Sorry for loosing your time.

alex · November 5, 2016, 12:30am

No worries @Federico1 - thanks for letting us know what the problem was!

Topic		Replies	Views
Should I use different EC2 instance types for EMR besides the default? AWS batch pipeline (Legacy)	3	3994	December 22, 2016
Increasing EMR Speed For engineers	3	934	December 12, 2018
Learnings from using the new Spark EMR Jobs AWS batch pipeline (Legacy)	8	13562	August 23, 2017
Recommended ec2 instances for EMR ETL Runner in 2018 AWS batch pipeline (Legacy)	2	1978	September 4, 2018
Handling large volumes of duplicated event_ids AWS batch pipeline (Legacy)	3	1358	July 3, 2018

Increasing execution time on batch mode instance

Related topics