Application configuration with dataflow-runner

Fair warning: I am new to Snowplow and to AWS EMR.

I have been tasked with evolving an existing project that had previously used EmrEtlRunner. Part of the request was to move the EMR cluster configuration from EmrEtlRunner to Dataflow Runner, but I have run into a problem with the configuration.

Specifically, the Applications key from describe-cluster for the cluster created with EmrEtlRunner shows

    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      },
      {
        "Name": "Spark",
        "Version": "2.1.0"
      },
      {
        "Name": "Spark",
        "Version": "2.1.0"
      }
    ]

but the cluster I configure with Dataflow Runner shows only Hadoop:

    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      }
    ]

but I don’t understand why.

I specifically need Spark.

My cluster.json lists

    "applications": [ "Hadoop", "Spark" ]

Any suggestions on how I should proceed or what I should look at?

This info might help others.

  1. The problem with configuring applications on my EMR cluster went away when I switched from using run-transient to instead using the sequence of up, run, and down.

  2. When using the run command, dataflow-runner would abort, reporting that it had exceeded an AWS rate limit. This appeared to be caused by too-frequent polling of the AWS EMR status. I resolved that issue by using the --async flag to dataflow-runner run. (This required me to implement my own, less-frequent polling so I would know when to issue the dataflow-runner down command.)

I hope this info helps others who might encounter similar issues. If I get a spare moment, I might dig into the code and see if I can find a good fix for either issue. If so, I’ll submit a pull request.

Hi @cfraizer,

  1. run-transient relies on up, run and down behind the scenes, it would be interesting to track down the issue you had as it seems quite weird
  2. I logged #37 to make requests with an exponential backoff

Pull requests are always welcome!

Thanks, Ben. I agree that it’s quite weird.

Quick summary of run-transient issue:

  • With run-transient if I ran the AWS CLI command aws emr describe-cluster, it would show
    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      }
    ]
  • When I used the same cluster.json and playbook.json, the aws emr describe cluster results looked as I expected:
    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      },
      {
        "Name": "Spark",
        "Version": "2.1.0"
      }
    ]

The relevant portion of the cluster.json is

    "applications": [ "Hadoop", "Spark" ]