Hey,
we are currently setting up the shredder on emr. However we get a “S3 access denied error” on the S3DistCp job on emr. Our playbook looks like this:
{
"schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
"data": {
"region": "eu-west-1",
"credentials": {
"accessKeyId": "AWS_ACCESS_KEY_ID",
"secretAccessKey": "AWS_SECRET_ACCESS_KEY"
},
"steps": [
{
"type": "CUSTOM_JAR",
"name": "S3DistCp enriched data archiving",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
"arguments": [
"--src", "SP_LOADER_URI",
"--dest", "SP_ENRICHED_URI"
]
},
{
"type": "CUSTOM_JAR",
"name": "RDB Shredder",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "command-runner.jar",
"arguments": [
"spark-submit",
"--class", "com.snowplowanalytics.snowplow.shredder.Main",
"--master", "yarn",
"--deploy-mode", "cluster",
"s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0",
"--iglu-config", "resolver",
"--config", "config"
]
}
],
"tags": [ ]
}
}
We checked:
- the IAM permissions of our ecs task that runs the job. It should have fullS3access
- the buckets should be accessible from all resources within our account
- our EMR cluster is not using any VPC endpoints for s3 currently.
However, we were a bit uncertain about this part in the json. :
“/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar”,
Is it the correct location? It’s not a s3 hosted asset like the rdb shredder?