Using IAM roles for Authentication with dataflow-runner not working

Hey everyone, I’ve been trying to use an AWS IAM Role instead of passing secrets around together with the dataflow-runner to launch the EMR. For more details: Support IAM roles · Issue #34 · snowplow/dataflow-runner · GitHub

I checked the codebase and it looks like IAM Roles are supported (there’re explicit tests for it here and the implementation here). I also used this example config as a reference. But I get the following error message:

My cluster.json config:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
  "data": {
    "name": "com.oneapp",
    "logUri": "LOGURI",
    "region": "AWS_DEFAULT_REGION",
    "credentials": {
      "accessKeyId": "iam",
      "secretAccessKey": "iam"
    },
    "roles": {
      "jobflow": "EMR_EC2_DefaultRole",
      "service": "EMR_DefaultRole"
    },
    "ec2": {
      "amiVersion": "6.10.0",
      "instances": {
        "core": {
          "count": 1,
          "type": "r5.12xlarge"
        },
        "master": {
          "ebsConfiguration": {
            "ebsBlockDeviceConfigs": [],
            "ebsOptimized": true
          },
          "type": "m4.large"
        },
        "task": {
          "bid": "0.015",
          "count": 0,
          "type": "m4.large"
        }
      },
      "keyName": "EMR_ECS_KEY_PAIR",
      "location": {
        "vpc": {
          "subnetId": "AWS_PUBLIC_SUBNET_ID"
        }
      }
    },
    "tags": [
      {
        "key": "client",
        "value": "com.oneapp"
      },
      {
        "key": "job",
        "value": "main"
      }
    ],
    "bootstrapActionConfigs": [],
    "configurations": [
      {
        "classification": "spark",
        "configurations": [],
        "properties": {
          "maximizeResourceAllocation": "false"
        }
      },
      {
        "classification": "spark-defaults",
        "configurations": [],
        "properties": {
          "spark.default.parallelism": "80",
          "spark.driver.cores": "5",
          "spark.driver.memory": "37G",
          "spark.dynamicAllocation.enabled": "false",
          "spark.executor.cores": "5",
          "spark.executor.instances": "8",
          "spark.executor.memory": "37G",
          "spark.yarn.driver.memoryOverhead": "5G",
          "spark.yarn.executor.memoryOverhead": "5G"
        }
      },
      {
        "classification": "yarn-site",
        "configurations": [],
        "properties": {
          "yarn.nodemanager.resource.memory-mb": "385024",
          "yarn.nodemanager.vmem-check-enabled": "false",
          "yarn.scheduler.maximum-allocation-mb": "385024"
        }
      }
    ],
    "applications": [
      "Hadoop",
      "Spark"
    ]
  }
}

My playbook.json config:

{
"schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
"data": {
"region": "AWS_DEFAULT_REGION",
"credentials": {
"accessKeyId": "iam",
"secretAccessKey": "iam"
},
"roles": {
"jobflow": "EMR_EC2_DefaultRole",
"service": "EMR_DefaultRole"
},
"steps": [
{
"type": "CUSTOM_JAR",
"name": "S3DistCp enriched data archiving",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
"arguments": [
"--src", "SP_LOADER_URI",
"--dest", "SP_ENRICHED_URIrun={{nowWithFormat "2006-01-02-15-04-05"}}/",
"--srcPattern", ".*",
"--outputCodec", "gz",
"--deleteOnSuccess"
]
},
{
"type": "CUSTOM_JAR",
"name": "RDB Transformer Shredder",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "command-runner.jar",
"arguments": [
"spark-submit",
"--class", "com.snowplowanalytics.snowplow.rdbloader.transformer.batch.Main",
"--master", "yarn",
"--deploy-mode", "cluster",
"s3://snowplow-hosted-assets/4-storage/transformer-batch/snowplow-transformer-batch-4.1.0.jar",
"--iglu-config", "{{base64File "resolver.json"}}",
"--config", "{{base64File "config.hocon"}}"
]
}
],
"tags": []
}
}

Note that some values in caps like AWS_DEFAULT_REGION are replaced with the values of the respective environment variables with the sed command.

It looks like it might be an issue with the aws sdk but I lack experience to pin it down. Can someone help me with that?

Maybe a little more context. Originally we baked the credentials into the config files but this is obviously a bad practice. The idea is to use IAM Roles for authenticating AWS services instead of passing secrets around.

I also came across a similar issue here on the forum but from what it looks like the bad practice of hardcoding your credentials is the recommended way. But we would like to avoid that now.

So the nil pointer dereference error suggests that a value we expect to be there is not provided (or at least that’s what my gut tells me).

I haven’t dug into the code but that’s normally indicative of something we should be handling when we parse the config - ie the code should give you a helpful error here rather than this.

But to look at unblocking in short term - I suspect that the cause might be something to do with the variable substitution. Is there a way you can test the sed part and examine if it is in fact templating in values as expected? Perhaps something is being passed as "" where a credential has some special character, or similar corruption of the value is happening.

1 Like

And actually AWS_DEFAULT_REGION is a suspect here based on the logs you shared - but it might just be coincidence

Thank you for your prompt reply @Colm !
We set up the module a couple of years ago with hardcoding the values with sed commands so I would assume this is not the problem here. We got the error when we tried to use IAM Roles instead of passing the credentials to the config directly.

Oh ok I misread you originally. Yeah I suspect you’re right then.

Struggling to pin down what this is - clutching at straws a little but what happens if you change the credentials values in your configs from "iam" to "default"?

1 Like

From the error message it seems like you might not have an IAM instance profile attached to the EC2 instance you are trying to launch this job from? It appears to be failing on the instance metadata lookup which would indicate to me that the process either cannot access the metadata service or an IAM role has not been properly attached to the server.

1 Like

Something I forgot to mention is that we deploy a customized Docker image where we run the dataflow-runner binary in AWS ECS with FARGATE type. From the logs, it looks like the service fails to even launch an EC2, and no EMR cluster is spun up.

We currently do have an EC2 instance profile with the EMR_EC2_DefaultRole attached to the EC2 instance group launched within the cluster. And they all had this Role attached so no changes made here. It worked with the credentials set to hardcoaded secrets before:

"credentials": {
"accessKeyId": "AWS_SECERET_ACCESS_ID",
"secretAccessKey": "AWS_SECERET_ACCESS_KEY"
},

The error was introduced by trying to authenticate the EMR service using the IAM roles already there in place since they are attached to both the EMR and EC2 instances anyway. We just wanted to eliminate the redundancy (and vulnerability) of hardcoding AWS secrets.

Is there something that I’m doing wrong in the configs (playbook and the emr config)?

P.S.: I also tried setting the values for the credentials to default but it resultes in the following error message in ECS:

level=error msg="NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors"
NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Hi @Kristina_Pianykh it looks like your ECS pod cannot access an IAM role to assume. You need to bind an IAM role to ECS task definition so that your pod can infer the credentials it needs.

Specifically the “task_role” needs to have the permissions you need to launch an EMR cluster (aka the same as what you had before). If using Terraform this is the role needed: Terraform Registry

Thank you for the suggestion @josh, I indeed was missing adding this Role to my task definition. Unfortunately, this still doesn’t resolve the issue, though:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x843a3c]
goroutine 20 [running]:
github.com/aws/aws-sdk-go/aws/ec2metadata.(*EC2Metadata).GetMetadataWithContext(0x0, {0xd161d0, 0xc0002eb5f0}, {0xba3365?, 0x19?})
	/home/runner/go/pkg/mod/github.com/aws/aws-sdk-go@v1.34.5/aws/ec2metadata/api.go:69 +0x13c
github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds.requestCredList({0xd161d0?, 0xc0002eb5f0?}, 0x126?)
	/home/runner/go/pkg/mod/github.com/aws/aws-sdk-go@v1.34.5/aws/credentials/ec2rolecreds/ec2_role_provider.go:142 +0x59
github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds.(*EC2RoleProvider).RetrieveWithContext(0xc0002eeb40, {0xd161d0, 0xc0002eb5f0})
	/home/runner/go/pkg/mod/github.com/aws/aws-sdk-go@v1.34.5/aws/credentials/ec2rolecreds/ec2_role_provider.go:98 +0x77
github.com/aws/aws-sdk-go/aws/credentials.(*Credentials).singleRetrieve(0xc0002eeb70, {0xd161d0, 0xc0002eb5f0})
	/home/runner/go/pkg/mod/github.com/aws/aws-sdk-go@v1.34.5/aws/credentials/credentials.go:261 +0x40d
github.com/aws/aws-sdk-go/aws/credentials.(*Credentials).GetWithContext.func1()
	/home/runner/go/pkg/mod/github.com/aws/aws-sdk-go@v1.34.5/aws/credentials/credentials.go:244 +0x79
github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).doCall(0xc0002eeb80, 0xc0000a1200, {0x0, 0x0}, 0x0?)
	/home/runner/go/pkg/mod/github.com/aws/aws-sdk-go@v1.34.5/internal/sync/singleflight/singleflight.go:97 +0x3b
created by github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).DoChan
	/home/runner/go/pkg/mod/github.com/aws/aws-sdk-go@v1.34.5/internal/sync/singleflight/singleflight.go:90 +0x315

My ECS task definition:

{
    "taskDefinitionArn": "arn:aws:ecs:eu-west-1:585714527727:task-definition/sp-shredder-task:3692",
    "containerDefinitions": [
        {
            "name": "sp_shredder",
            "image": "585714527727.dkr.ecr.eu-west-1.amazonaws.com/sp-shredder-gitc-dev:v1.0.0-e6d025214729aaed5425f35176696fe3b88c3fe1-20230609-082150",
            "cpu": 256,
            "memory": 2048,
            "portMappings": [
                {
                    "containerPort": 8000,
                    "hostPort": 8000,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "AWS_PUBLIC_SUBNET_ID",
                    "value": "subnet-0bffbfe8d66f83643"
                },
                {
                    "name": "SP_SHREDDED_URI",
                    "value": "s3n://sp-shredded-4-2-1-gitc-dev/archive/"
                },
                {
                    "name": "LOGURI",
                    "value": "s3n://aws-logs-585714527727-eu-west-1/"
                },
                {
                    "name": "STAGE",
                    "value": "gitc-dev"
                },
                {
                    "name": "SP_ENRICHED_URI",
                    "value": "s3n://sp-enriched-4-2-1-gitc-dev/archive/"
                },
                {
                    "name": "AWS_DEFAULT_REGION",
                    "value": "eu-west-1"
                },
                {
                    "name": "SP_LOADER_URI",
                    "value": "s3n://sp-loader-gitc-dev/"
                },
                {
                    "name": "EMR_ECS_KEY_PAIR",
                    "value": "emr-ecs-key-pair-shredder-gitc-dev"
                },
                {
                    "name": "SQS_QUEUE",
                    "value": "sp-sqs-queue-gitc-dev.fifo"
                },
                {
                    "name": "SP_SCHEMA_URI",
                    "value": "https://d1n5vzd3nnfpvo.cloudfront.net/"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/sp-shredder-gitc-dev",
                    "awslogs-region": "eu-west-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "family": "sp-shredder-task",
    "taskRoleArn": "arn:aws:iam::585714527727:role/EMR_DefaultRole",
    "executionRoleArn": "arn:aws:iam::585714527727:role/ecsTaskExecutionRole",
    "networkMode": "awsvpc",
    "revision": 3692,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "ecs.capability.task-eni"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2",
        "FARGATE"
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "256",
    "memory": "2048",
    "registeredAt": "2023-06-09T08:25:14.046Z",
    "registeredBy": "arn:aws:iam::585714527727:user/cicd-user-gitc-dev",
    "tags": [
        {
            "key": "SnowplowModule",
            "value": "shredder"
        },
        {
            "key": "Project",
            "value": "Cariad Analytics"
        },
        {
            "key": "Environment",
            "value": "gitc-dev"
        },
        {
            "key": "Costs",
            "value": "Snowplow"
        }
    ]
}

I understand now why the EC2s cannot be accessed, they are simply not launched. But the thing is that this setup doesn’t even start the EMR cluster so it just makes sense why EC2Metadata cannot be accessed. The question is, why doesn’t my ECS task trigger the EMR cluster?

EDIT: I also tried setting task_role_arn to accept a role with the full EMR access so that my ECS task can actually access EMR and create a cluster there but it results in absolutely the same error :frowning: