Hi all,
I’m new to Snowplow and dataflow-runner so it may be a misunderstanding but I’ve copied the setup guide exactly from here for loading data into snowflake:
Yet I keep getting an error when attempting to run the command
./dataflow-runner run-transient --emr-config=cluster.json --emr-playbook=playbook.json
The error is "Cannot run program "spark-submit" (in directory "."): error=2, No such file or directory"
I can verify that through cloudtrail that the command is sent to AWS EMR, the clsuter starts but for some reason some of the parameters I sent through my playbook.json are not making it’s way into EMR so it continuously fails
My cluster.json is almost an exact copy of the tutorial
"name":"dataflow-runner - snowflake transformer",
"subnetId": "test"
"ebs_optimized": false,
"ebsBlockDeviceConfigs": [
"volumesPerInstance" : 1
"tags":[ ],
"bootstrapActionConfigs":[ ],
"applications":[ "Hadoop", "Spark" ]
My playbook.json is also an exact copy
"accessKeyId":"<%= ENV['AWS_ACCESS_KEY'] %>",
"secretAccessKey":"<%= ENV['AWS_SECRET_KEY'] %>"
"name":"Snowflake Transformer",
"{{base64File "./targets/snowflake.json"}}",
"{{base64File "resolver.json"}}",
"{{base64File "dynamodb.json"}}"
"name":"Snowflake Loader",
"{{base64File "./targets/snowflake.json"}}",
"{{base64File "./resolver.json"}}"
"tags":[ ]
In the end dataflow-runner has all the control with setting up EMR, so I have no visibility into whats going wrong, the only thing I know is the following request that was sent from dataflow-runner through cloudtrail which does not specify the need for spark anywhere, even though I’ve added it to my cluster.json
"requestParameters": {
"name": "dataflow-runner - snowflake transformer",
"logUri": "s3://logs/data-snowplow-emr-etl-runner/",
"releaseLabel": "emr-6.1.0",
"instances": {
"instanceGroups": [
"instanceRole": "MASTER",
"instanceType": "m4.large",
"instanceCount": 1
"instanceRole": "CORE",
"instanceType": "r4.xlarge",
"instanceCount": 1,
"ebsConfiguration": {
"ebsOptimized": false
"ec2KeyName": "test",
"placement": {
"availabilityZone": ""
"keepJobFlowAliveWhenNoSteps": true,
"terminationProtected": false,
"ec2SubnetId": "test"
"visibleToAllUsers": true,
"jobFlowRole": "x",
"serviceRole": "x"
Did I mess something up? Any help would be really appreciated