Schedule dataflow-runner(shredder) every 30min and ENV for shredder/loader

pramod.niralakeri · January 7, 2022, 10:04am

I’ve few questions/concerns.

How to schedule dataflow-runner (shredder) every 30min. (if not possible, how/when dataflow-runner will run/execute shredder)
For Shredder/Loader we want to read values from ENV.
ex :
Shredder : logUri, src and dest
Loader : schema, db host url etc.
Can Shredder run without AWS secrete key? provided EC2 instance has all the provisions
Can Loader load shredded data to Redshift, without username/password, using only AWS Role Arn

BenB · January 7, 2022, 4:10pm

It’s up to you. You could for instance use cron jobs or Nomad.

In cluster.json and playbook.json we can’t use environment variables apart for AWS credentials, so you would need to retrieve the values and update the config dynamically just before using them if you don’t want to store values directly in the files.

For the configuration file for loader and shredder you can use environment variables like this in the hocon: "host": ${REDSHIFT_HOST}

Indeed, what matters is that the EMR cluster and the IAM role used for shredder have sufficient permissions. If shredder is run with Dataflow Runner then Dataflow Runner needs to know about the secrete key.

No it can’t, at the moment only authentication with username/password is supported.

pramod.niralakeri · January 7, 2022, 5:13pm

Thanks @BenB that solves most of my doubts. one last thing

running shredder with Dataflow runner. but I can’t provide AWS secrete key, as the code live in some public repo or can’t even pass/replace from ENV dynamically. Do you see any better way?

BenB · January 7, 2022, 6:30pm

I’m not sure to understand. When you run the shredder with dataflow-runner (either with transient or persistent EMR cluster), you have to specify the region and creds :

    "region": "eu-central-1",
    "credentials": {
      "accessKeyId": "env",
      "secretAccessKey": "env"
    },

And then AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables need to be set when dataflow-runner command is run. That’s the only way Dataflow Runner can know where to create/use the EMR cluster for shredder (which region and which AWS account).

You have to set the 2 env variables at some point, you can’t run shredder without it.

Wherever they are stored that’s fine that doesn’t matter, but you need to retrieve them and set them on the machine where you run dataflow-runner, just before you run it.

Topic		Replies	Views
Replacement for Dataflow runner? For engineers	2	632	January 11, 2022
Dataflow Runner setup For engineers	3	928	February 11, 2022
Dataflow Runner RDBLoader step taking long AWS real-time pipeline	2	1436	November 29, 2018
Dataflow_runner encryption For engineers	2	725	June 8, 2020
R35 RDB Shredder config.hocon env var error with dataflow-runner For engineers	1	765	January 12, 2022

Schedule dataflow-runner(shredder) every 30min and ENV for shredder/loader

Related topics