S3distcp s3 access denied error dataflow runner

mgloel · February 8, 2021, 8:23pm

Hey,
we are currently setting up the shredder on emr. However we get a “S3 access denied error” on the S3DistCp job on emr. Our playbook looks like this:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
  "data": {
    "region": "eu-west-1",
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID",
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    },
    "steps": [
      {
        "type": "CUSTOM_JAR",
        "name": "S3DistCp enriched data archiving",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
        "arguments": [
            "--src", "SP_LOADER_URI",
            "--dest", "SP_ENRICHED_URI"
        ]
      },

      {
        "type": "CUSTOM_JAR",
        "name": "RDB Shredder",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "command-runner.jar",
        "arguments": [
            "spark-submit",
            "--class", "com.snowplowanalytics.snowplow.shredder.Main",
            "--master", "yarn",
            "--deploy-mode", "cluster",
            "s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0",
            "--iglu-config", "resolver",
            "--config", "config"
        ]
      }
    ],
    "tags": [ ]
  }
}

We checked:

the IAM permissions of our ecs task that runs the job. It should have fullS3access
the buckets should be accessible from all resources within our account
our EMR cluster is not using any VPC endpoints for s3 currently.

However, we were a bit uncertain about this part in the json. :

“/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar”,

Is it the correct location? It’s not a s3 hosted asset like the rdb shredder?

mike · February 8, 2021, 9:09pm

Yes - this looks correct. S3Distcp is a AWS utility rather than a Snowplow one so will already be on the cluster.

It looks to me like your S3 path to the RDB loader may be incorrect

s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0

is just missing the .jar extension

s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar

anton · February 9, 2021, 11:35am

Thanks Mike! There was a typo in the migration guide. Fixed it now.

Topic		Replies	Views
RDB Shredder step fails in Dataflow Runner AWS real-time pipeline	4	1134	May 19, 2021
R35 Shredder - no data in shredded bucket Storage targets	11	1283	February 12, 2021
Trouble with s3distcp in EMR AWS batch pipeline (Legacy)	3	1096	July 9, 2020
RDB shredder failed? For engineers	27	3157	January 5, 2022
RDB Shredder 1.0.0 Iglu Config Error Troubleshooting	6	1234	May 28, 2021

S3distcp s3 access denied error dataflow runner

Related topics