Enable Ganglia on Snowplow EMR clusters

Hi there,

config.yml takes an additional_info JSON field, but I’m not sure how I could configure the EMR cluster to include Ganglia.

Can you point me in the right direction?


Hi @rgabo,

I don’t think you can use the additional_info for this purpose. To be able to add Ganglia to EMR cluster you would have to engage --applications parameter as per http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-ganglia.html.

This can be achieved with Dataflow Runner. More info on the latest release is here: https://snowplowanalytics.com/blog/2017/03/31/dataflow-runner-0.2.0-released/.

To be more specific, the configuration file would need to include the value Ganglia on this line: https://github.com/snowplow/dataflow-runner/blob/master/config/cluster.json.sample#L84


Thanks, @ihor, I’ll keep an eye on Dataflow Runner development.

I have a specific requirement that let me to write this script to spin up the cluster and submit steps.
but you can just use this or similar script to start he cluster and then use the cluster id with data-flow-runner to submit the steps.