Hello, could you please help me understand what the issue is with EmrEtlRunner process (presumably) not being able to create an EMR cluster due to error “AccessDeniedException”?
Is there a way I can pinpoint exactly what is blocking it? i.e. Is it an AWS configuration issue (permissions)? Is it a config.yml issue (not trying to connect to AWS/EMR properly) or something else?
I am executing EmrEtlRunner using the following command:
./snowplow-emr-etl-runner run -c config/config.yml -n config/enrichments/ -r config/iglu_resolver.json --debug
The error response is:
uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
D, [2020-05-12T09:17:26.010199 #14847] DEBUG -- : Initializing EMR jobflow
ArgumentError: AWS EMR API Error (AccessDeniedException):
submit at uri:classloader:/gems/elasticity-6.0.14/lib/elasticity/aws_session.rb:44
run_job_flow at uri:classloader:/gems/elasticity-6.0.14/lib/elasticity/emr.rb:302
run at uri:classloader:/gems/elasticity-6.0.14/lib/elasticity/job_flow.rb:176
run at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:791
send_to at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43
call_with at uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76
block in redefine_method at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138
run at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:138
send_to at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43
call_with at uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76
block in redefine_method at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138
<main> at uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41
load at org/jruby/RubyKernel.java:994
<main> at uri:classloader:/META-INF/main.rb:1
require at org/jruby/RubyKernel.java:970
(root) at uri:classloader:/META-INF/main.rb:1
<main> at uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1
ERROR: org.jruby.embed.EvalFailedException: (ArgumentError) AWS EMR API Error (AccessDeniedException)
This is the config.yml being used (using snowplow_emr_r117_biskupin version of EmrEtlRunner):
aws:
# Credentials can be hardcoded or set in environment variables
access_key_id: XXXXXXXXXXXXXXXX
secret_access_key: XXXXXXXXXXXXXXXX
s3:
region: ap-southeast-2
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3://cc-snowplow-enrich-logs
encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
raw:
in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
- s3://cc-snowplow-logs
processing: s3://cc-snowplow-enrich-processing
archive: s3://cc-snowplow-enrich-archive
enriched:
good: s3://cc-snowplow-enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://cc-snowplow-enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://cc-snowplow-enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://cc-snowplow-shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://cc-snowplow-shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://cc-snowplow-shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3
emr:
ami_version: 5.9.0
region: ap-southeast-2 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement: ap-southeast-2a # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: subnet-20ea7578 # Set this if running in VPC. Leave blank otherwise
ec2_key_name: snowplow.etl.runner
security_configuration: # Specify your EMR security configuration if needed. Leave blank otherwise
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
job_name: SnowplowETL # Give your job a name
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
core_instance_bid: 0.015
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 100 # Gigabytes
volume_size: 100 # Gigabytes
volume_type: "gp2"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
spark:
maximizeResourceAllocation: "true"
additional_info: # Optional JSON string for selecting additional features
collectors:
format: tsv/com.amazon.aws.cloudfront/wd_access_log # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
versions:
spark_enrich: 1.18.0 # Version of the Spark Enrichment process
continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1 # Version of the Spark Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
protocol: http
port: 80
app_id: HIC # e.g. snowplow
collector: D2jxi0nwlekbfj.cloudfront.net # e.g. d3rkrsqld9gmqf.cloudfront.net
IAM permissions and key pair have been setup according to:
IAM permissions have been checked and even extended to have equivalent to full administrator access, but it still didn’t help. Also tried creating a whole new IAM secret/key and using that but also didn’t work.
This is what I see when running aws ec2 describe-key-pairs
{
"KeyPairs": [
{
"KeyName": "snowplow.etl.runner",
"KeyFingerprint": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
]
}
aws emr create-default-roles
command has been successfully executed via ec2 instance. Checked in AWS console and the 3 roles created are visible (EMR_AutoScaling_DefaultRole, EMR_DefaultRole, EMR_EC2_DefaultRole). Also tried deleting and recreating the default roles and this didn’t help.
Our AWS environment runs in a default VPC.
In the config file for the EMR settings, I tried putting a ‘placement’ value with and without a ‘ec2_subnet_id’ value. I also tried putting the ‘ec2_subnet_id’ value without a ‘placement’ value
Tried removing the aws config files and re-doing aws configure
on ec2 instance, still no luck.
Checked EMR section of AWS management console, and there is no logs or records of any EMR cluster being created or attempted to be created… This makes me think there is possibly something wrong with the emr section of the config.yml file?
I’ve trawled the documentation, forums, blogs and internet in general for 2 days straight and have been defeated by whatever the problem is.
I’d really appreciate your help so I can move on to the next (Storage) step and get our enriched events into Snowflake so we can do some analytics magic with it. The suspense is killing me
Cheers,
Ryan