S3distcp s3 access denied error dataflow runner

Hey,
we are currently setting up the shredder on emr. However we get a “S3 access denied error” on the S3DistCp job on emr. Our playbook looks like this:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
  "data": {
    "region": "eu-west-1",
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID",
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    },
    "steps": [
      {
        "type": "CUSTOM_JAR",
        "name": "S3DistCp enriched data archiving",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
        "arguments": [
            "--src", "SP_LOADER_URI",
            "--dest", "SP_ENRICHED_URI"
        ]
      },

      {
        "type": "CUSTOM_JAR",
        "name": "RDB Shredder",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "command-runner.jar",
        "arguments": [
            "spark-submit",
            "--class", "com.snowplowanalytics.snowplow.shredder.Main",
            "--master", "yarn",
            "--deploy-mode", "cluster",
            "s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0",
            "--iglu-config", "resolver",
            "--config", "config"
        ]
      }
    ],
    "tags": [ ]
  }
}

We checked:

  • the IAM permissions of our ecs task that runs the job. It should have fullS3access
  • the buckets should be accessible from all resources within our account
  • our EMR cluster is not using any VPC endpoints for s3 currently.

However, we were a bit uncertain about this part in the json. :

“/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar”,

Is it the correct location? It’s not a s3 hosted asset like the rdb shredder?

Yes - this looks correct. S3Distcp is a AWS utility rather than a Snowplow one so will already be on the cluster.

It looks to me like your S3 path to the RDB loader may be incorrect

s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0

is just missing the .jar extension

s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar

2 Likes

Thanks Mike! There was a typo in the migration guide. Fixed it now.

1 Like