S3distcp s3 access denied error dataflow runner

we are currently setting up the shredder on emr. However we get a “S3 access denied error” on the S3DistCp job on emr. Our playbook looks like this:

  "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
  "data": {
    "region": "eu-west-1",
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID",
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    "steps": [
        "type": "CUSTOM_JAR",
        "name": "S3DistCp enriched data archiving",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
        "arguments": [
            "--src", "SP_LOADER_URI",
            "--dest", "SP_ENRICHED_URI"

        "type": "CUSTOM_JAR",
        "name": "RDB Shredder",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "command-runner.jar",
        "arguments": [
            "--class", "com.snowplowanalytics.snowplow.shredder.Main",
            "--master", "yarn",
            "--deploy-mode", "cluster",
            "--iglu-config", "resolver",
            "--config", "config"
    "tags": [ ]

We checked:

  • the IAM permissions of our ecs task that runs the job. It should have fullS3access
  • the buckets should be accessible from all resources within our account
  • our EMR cluster is not using any VPC endpoints for s3 currently.

However, we were a bit uncertain about this part in the json. :


Is it the correct location? It’s not a s3 hosted asset like the rdb shredder?

Yes - this looks correct. S3Distcp is a AWS utility rather than a Snowplow one so will already be on the cluster.

It looks to me like your S3 path to the RDB loader may be incorrect


is just missing the .jar extension



Thanks Mike! There was a typo in the migration guide. Fixed it now.

1 Like