Fair warning: I am new to Snowplow and to AWS EMR.
I have been tasked with evolving an existing project that had previously used EmrEtlRunner. Part of the request was to move the EMR cluster configuration from EmrEtlRunner to Dataflow Runner, but I have run into a problem with the configuration.
Specifically, the Applications
key from describe-cluster
for the cluster created with EmrEtlRunner shows
"Applications": [
{
"Name": "Hadoop",
"Version": "2.7.3"
},
{
"Name": "Spark",
"Version": "2.1.0"
},
{
"Name": "Spark",
"Version": "2.1.0"
}
]
but the cluster I configure with Dataflow Runner shows only Hadoop
:
"Applications": [
{
"Name": "Hadoop",
"Version": "2.7.3"
}
]
but I don’t understand why.
I specifically need Spark.
My cluster.json
lists
"applications": [ "Hadoop", "Spark" ]
Any suggestions on how I should proceed or what I should look at?
This info might help others.
-
The problem with configuring applications on my EMR cluster went away when I switched from using run-transient
to instead using the sequence of up
, run
, and down
.
-
When using the run
command, dataflow-runner
would abort, reporting that it had exceeded an AWS rate limit. This appeared to be caused by too-frequent polling of the AWS EMR status. I resolved that issue by using the --async
flag to dataflow-runner run
. (This required me to implement my own, less-frequent polling so I would know when to issue the dataflow-runner down
command.)
I hope this info helps others who might encounter similar issues. If I get a spare moment, I might dig into the code and see if I can find a good fix for either issue. If so, I’ll submit a pull request.
Hi @cfraizer,
-
run-transient
relies on up
, run
and down
behind the scenes, it would be interesting to track down the issue you had as it seems quite weird
- I logged #37 to make requests with an exponential backoff
Pull requests are always welcome!
Thanks, Ben. I agree that it’s quite weird.
Quick summary of run-transient
issue:
- With
run-transient
if I ran the AWS CLI command aws emr describe-cluster
, it would show
"Applications": [
{
"Name": "Hadoop",
"Version": "2.7.3"
}
]
- When I used the same
cluster.json
and playbook.json
, the aws emr describe cluster
results looked as I expected:
"Applications": [
{
"Name": "Hadoop",
"Version": "2.7.3"
},
{
"Name": "Spark",
"Version": "2.1.0"
}
]
The relevant portion of the cluster.json
is
"applications": [ "Hadoop", "Spark" ]