It’s up to you. You could for instance use cron jobs or Nomad.
In cluster.json and playbook.json we can’t use environment variables apart for AWS credentials, so you would need to retrieve the values and update the config dynamically just before using them if you don’t want to store values directly in the files.
For the configuration file for loader and shredder you can use environment variables like this in the hocon: "host": ${REDSHIFT_HOST}
Indeed, what matters is that the EMR cluster and the IAM role used for shredder have sufficient permissions. If shredder is run with Dataflow Runner then Dataflow Runner needs to know about the secrete key.
No it can’t, at the moment only authentication with username/password is supported.
Thanks @BenB that solves most of my doubts. one last thing
running shredder with Dataflow runner. but I can’t provide AWS secrete key, as the code live in some public repo or can’t even pass/replace from ENV dynamically. Do you see any better way?
I’m not sure to understand. When you run the shredder with dataflow-runner (either with transient or persistent EMR cluster), you have to specify the region and creds :
And then AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables need to be set when dataflow-runner command is run. That’s the only way Dataflow Runner can know where to create/use the EMR cluster for shredder (which region and which AWS account).
You have to set the 2 env variables at some point, you can’t run shredder without it.
Wherever they are stored that’s fine that doesn’t matter, but you need to retrieve them and set them on the machine where you run dataflow-runner, just before you run it.