Core_Instance_Count not increasing

sp_user · June 2, 2020, 12:02pm

Hi Team,

I am setting up latest Snowplow ETL Batch processing.
While running the batch process for large data size around ~20GB.

It is failing at enrich step. Also one thing I noticed that even tough I am provisioning for 6 core nodes still it is provisioning that.

Below is my config.yml file details

emr:
ami_version: 5.9.0
region: us-east-1
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement:
ec2_subnet_id: XXX
ec2_key_name:XX
security_configuration:
bootstrap:
software:
hbase:
lingual:
# Adjust your Hadoop cluster below
jobflow:
job_name: Snowplow ETL QA
master_instance_type: r4.8xlarge
core_instance_count: 6
core_instance_type: r4.8xlarge
core_instance_bid:
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
#volume_size: 100 # Gigabytes
#volume_type: “gp2”
#volume_iops: 400 # Optional. Will only be used if volume_type is “io1”
task_instance_count: 0
task_instance_type: m3.4xlarge
bootstrap_failure_tries: 2
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: “1”
spark:
maximizeResourceAllocation: “false”
additional_info:
collectors:
format: thrift
enrich:
versions:
spark_enrich: 1.18.0
continue_on_unexpected_error: true
output_compression: NONE
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
hadoop_elasticsearch: 0.1.0

Not sure why it failing for large data set (small data set it working fine) . Also not sure why its not provisioning more core instance even tough it is configured as 6.

Please help with correct configuration in order to process large data set of 30-40GBs

egor · June 3, 2020, 3:46am

Hi @sp_user,

It appears you are using EMR ETL Runner R117 which has an issue with core_instances and ebs volume (see: https://github.com/snowplow/snowplow/issues/4285). The issue has been fixed in the latest version.

As for the correct configuration for large data sets: you will need to specify additional configuration settings to utilize as much resources as possible. I would recommend to read this thread to get a sense on how this can be done. You may consider to use one of configurations provided in the thread (e.g. 1x m4.xlarge & 5x r4.8xlarge).

In overall it’s better to run the job more often and process less data. It should be more robust and cost efficient model.

Hope this helps.

sp_user · June 3, 2020, 6:59am

Thanks Egor.

Will upgrade the version and keep you posted on the same.

sp_user · June 5, 2020, 6:54am

Hi Egor,

I have upgraded to R119 . Core instance count issue got resolved.

Just another query related to configuration :
– What type of instance and count will be use full to process 40 GB within 2 hours duration.
– Any Spark/Yarn configuration need to be done , if so please help me out with that.

Thanks in advance.

sp_user · June 17, 2020, 6:38am

Hi Egor,

I tried to execute the batch (20GB Load) with above configuration. But its stuck in Spark enrich step.

Below are log details.

20/06/17 05:01:51 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:51 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.1.4.246
ApplicationMaster RPC port: 0
queue: default
start time: 1592370106702
final status: UNDEFINED
tracking URL: http://ip-10-1-4-239.ec2.internal:20888/proxy/application_1592369903184_0002/
user: hadoop
20/06/17 05:01:52 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:53 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:54 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)
20/06/17 05:01:55 INFO Client: Application report for application_1592369903184_0002 (state: RUNNING)

Can you please suggest. Urgent.

Thanks in advance.

ihor · June 18, 2020, 10:50pm

@sp_user, I assume your enriched data is gzipped which means when uncompressed it could take up 400 GB which is too much to process with your EMR cluster within 2 hours. It’s best to split the payload into a few batches.

With that much data you collect, I would advise running your EMR job more frequently to reduce the volume of data to process in one go. We typically do not go above the following configuration (max EMR cluster we ever used).

    jobflow:
      master_instance_type: m4.xlarge
      core_instance_count: 10
      core_instance_type: r4.8xlarge
      core_instance_ebs:
        volume_size: 320
        volume_type: gp2
        ebs_optimized: true
    configuration:
      yarn-site:
        yarn.nodemanager.vmem-check-enabled: "false"
        yarn.nodemanager.resource.memory-mb: "245760"
        yarn.scheduler.maximum-allocation-mb: "245760"
      spark:
        maximizeResourceAllocation: "false"
      spark-defaults:
        spark.dynamicAllocation.enabled: "false"
        spark.executor.instances: "99"
        spark.yarn.executor.memoryOverhead: "4096"
        spark.executor.memory: "20G"
        spark.executor.cores: "3"
        spark.yarn.driver.memoryOverhead: "4096"
        spark.driver.memory: "20G"
        spark.driver.cores: "3"
        spark.default.parallelism: "1188"

The above configuration is aimed at the payload of ~10 GB of compressed (.gz) data.

Topic		Replies	Views
ETL very very slow in larger batches Troubleshooting	24	5429	January 29, 2018
Changing core instance type to m4 For engineers	3	932	July 10, 2018
Problems with Enrich / EMR process provisioning instances AWS batch pipeline (Legacy)	2	2201	May 14, 2018
Optimizing and reducing shredding/loading costs For engineers	4	1111	January 20, 2021
EMR failing in enrich step AWS batch pipeline (Legacy)	7	2138	June 9, 2019

Core_Instance_Count not increasing

Related topics