Application configuration with dataflow-runner

cfraizer · December 18, 2017, 3:17pm

Fair warning: I am new to Snowplow and to AWS EMR.

I have been tasked with evolving an existing project that had previously used EmrEtlRunner. Part of the request was to move the EMR cluster configuration from EmrEtlRunner to Dataflow Runner, but I have run into a problem with the configuration.

Specifically, the Applications key from describe-cluster for the cluster created with EmrEtlRunner shows

    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      },
      {
        "Name": "Spark",
        "Version": "2.1.0"
      },
      {
        "Name": "Spark",
        "Version": "2.1.0"
      }
    ]

but the cluster I configure with Dataflow Runner shows only Hadoop:

    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      }
    ]

but I don’t understand why.

I specifically need Spark.

My cluster.json lists

    "applications": [ "Hadoop", "Spark" ]

Any suggestions on how I should proceed or what I should look at?

cfraizer · December 22, 2017, 1:31pm

This info might help others.

The problem with configuring applications on my EMR cluster went away when I switched from using run-transient to instead using the sequence of up, run, and down.
When using the run command, dataflow-runner would abort, reporting that it had exceeded an AWS rate limit. This appeared to be caused by too-frequent polling of the AWS EMR status. I resolved that issue by using the --async flag to dataflow-runner run. (This required me to implement my own, less-frequent polling so I would know when to issue the dataflow-runner down command.)

I hope this info helps others who might encounter similar issues. If I get a spare moment, I might dig into the code and see if I can find a good fix for either issue. If so, I’ll submit a pull request.

BenFradet · December 22, 2017, 3:23pm

Hi @cfraizer,

run-transient relies on up, run and down behind the scenes, it would be interesting to track down the issue you had as it seems quite weird
I logged #37 to make requests with an exponential backoff

Pull requests are always welcome!

cfraizer · December 22, 2017, 3:44pm

Thanks, Ben. I agree that it’s quite weird.

Quick summary of run-transient issue:

With run-transient if I ran the AWS CLI command aws emr describe-cluster, it would show

    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      }
    ]

When I used the same cluster.json and playbook.json, the aws emr describe cluster results looked as I expected:

    "Applications": [
      {
        "Name": "Hadoop",
        "Version": "2.7.3"
      },
      {
        "Name": "Spark",
        "Version": "2.1.0"
      }
    ]

The relevant portion of the cluster.json is

    "applications": [ "Hadoop", "Spark" ]

Topic		Replies	Views
Dataflow Runner released New releases	2	1576	February 11, 2017
Splitting EmrEtlRunner into snowplowctl and Dataflow Runner RFCs	0	3569	June 15, 2016
Recommended/Supported EMR Versions? Enrichment	3	1197	March 31, 2021
Does Dataflow Runner replace EmrEtlRunner For engineers	6	2568	August 16, 2017
Dataflow Runner run-transient not working For engineers	6	819	February 17, 2022

Application configuration with dataflow-runner

Related topics