I am setting up latest Snowplow ETL Batch processing.
While running the batch process for large data size around ~20GB.
It is failing at enrich step. Also one thing I noticed that even tough I am provisioning for 6 core nodes still it is provisioning that.
Below is my config.yml file details
emr:
ami_version: 5.9.0
region: us-east-1
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement:
ec2_subnet_id: XXX
ec2_key_name:XX
security_configuration:
bootstrap:
software:
hbase:
lingual:
# Adjust your Hadoop cluster below
jobflow:
job_name: Snowplow ETL QA
master_instance_type: r4.8xlarge
core_instance_count: 6
core_instance_type: r4.8xlarge
core_instance_bid:
core_instance_ebs: # Optional. Attach an EBS volume to each core instance. #volume_size: 100 # Gigabytes #volume_type: “gp2” #volume_iops: 400 # Optional. Will only be used if volume_type is “io1”
task_instance_count: 0
task_instance_type: m3.4xlarge
bootstrap_failure_tries: 2
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: “1”
spark:
maximizeResourceAllocation: “false”
additional_info:
collectors:
format: thrift
enrich:
versions:
spark_enrich: 1.18.0
continue_on_unexpected_error: true
output_compression: NONE
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
hadoop_elasticsearch: 0.1.0
Not sure why it failing for large data set (small data set it working fine) . Also not sure why its not provisioning more core instance even tough it is configured as 6.
Please help with correct configuration in order to process large data set of 30-40GBs
As for the correct configuration for large data sets: you will need to specify additional configuration settings to utilize as much resources as possible. I would recommend to read this thread to get a sense on how this can be done. You may consider to use one of configurations provided in the thread (e.g. 1x m4.xlarge & 5x r4.8xlarge).
In overall it’s better to run the job more often and process less data. It should be more robust and cost efficient model.
I have upgraded to R119 . Core instance count issue got resolved.
Just another query related to configuration :
– What type of instance and count will be use full to process 40 GB within 2 hours duration.
– Any Spark/Yarn configuration need to be done , if so please help me out with that.
@sp_user, I assume your enriched data is gzipped which means when uncompressed it could take up 400 GB which is too much to process with your EMR cluster within 2 hours. It’s best to split the payload into a few batches.
With that much data you collect, I would advise running your EMR job more frequently to reduce the volume of data to process in one go. We typically do not go above the following configuration (max EMR cluster we ever used).