Dataflow runner docker container

Hey all,
Strange issue here, have install Snowplow collector, enrich, redshift, s3-sink, rdb-loader with TerraForm and docker containers (AWS Fargate)

To my understanding to run RDB-Loader (post 35) you recommend either dataflow runner or a boto3 script (R35 Upgrade Guide - Snowplow Docs)

For now due to external factors i can’t use a boto 3 script (or Lambda for that matter)
So instead of a ec2 server to run the script i planed to run dataflow-runner in a simple docker container

Based on the documentation if i use a Linux 64 i should be able to run the dataflow runner with no other dependencies.
Although i have no issues to get the container up and running and got a bash script as a launch script (using that for all docker images) I cant get dataflow-runner to run. Response i get is “not found” even though i have verified that its executable and I got the right permissions, some suggestions on stack is that this is because I am missing some dependencies. Have anyone got dataflow runner to work in a docker image? or any ideas would be welcome… Thanks F

Hi @fwahlqvist ,

That’s correct.

Could you share your script running dataflow runner please ? Have you tried doing all the steps manually on a Docker container to make sure that it’s working ?

Hey Ben, thanks for getting back to me

Docker-compose.yml is

version: "3"
    container_name: snowplow-dataflow-runner
    #  - .:/snowplow
    #restart: "unless-stopped"
      context: ./
      dockerfile: Dockerfile

Dockerfile is

FROM  snowplow/base-alpine as builder

#RUN apk update && apk upgrade && apk add bash && apk add bash-completion
WORKDIR /snowplow
COPY /snowplow/ 
COPY playbook.json /snowplow/playbook.json
COPY cluster.json /snowplow/cluster.json
RUN wget
RUN unzip

FROM snowplow/base-alpine
RUN apk update && apk upgrade && apk add bash

WORKDIR /snowplow
COPY --from=builder /snowplow /snowplow 
RUN chmod +x
#RUN chown  snowplow:snowplow
#RUN echo ${PATH}
#RUN ls -la
ENTRYPOINT [ "./" ] is

echo "in script"
ls -la
./dataflow-runner help
#./dataflow-runner run-transient --emr-config=cluster.json --emr-playbook=playbook.json
#run-transient  Launches, runs and then terminates an EMR cluster

And finally the output from the script part is

snowplow-dataflow-runner | in script
snowplow-dataflow-runner | ./ line 5: ./dataflow-runner: not found
snowplow-dataflow-runner | total 28652
snowplow-dataflow-runner | drwxr-xr-x    1 snowplow snowplow      4096 Feb 20 15:17 .
snowplow-dataflow-runner | drwxr-xr-x    1 root     root          4096 Feb 20 15:17 ..
snowplow-dataflow-runner | drwxr-xr-x    1 snowplow snowplow      4096 Oct 29 15:47 bin
snowplow-dataflow-runner | -rw-r--r--    1 root     root          1987 Feb 17 17:43 cluster.json
snowplow-dataflow-runner | drwxr-xr-x    2 snowplow snowplow      4096 Oct 29 15:47 config
snowplow-dataflow-runner | -rwxr-xr-x    1 root     root      20789708 Feb 20 15:17 dataflow-runner
snowplow-dataflow-runner | -rw-r--r--    1 root     root       8518063 Aug 24 15:55
snowplow-dataflow-runner | -rwxr-xr-x    1 root     root           214 Feb 20 15:17
snowplow-dataflow-runner | -rw-r--r--    1 root     root          1483 Feb 17 18:23 playbook.json
snowplow-dataflow-runner | /snowplow
snowplow-dataflow-runner exited with code 127 

Hopefully something simple I am missing but tried chaining directories, permissions and path etc.
Any insight is welcome …


Hey @fwahlqvist ,

I suspect that the Docker image you are using does not contain all the system libraries required by Dataflow runner binary.

Either you could use a bigger Linux image or you can troubleshoot which library is missing.

$ ldd dataflow-runner (0x00007fff7198c000) => /lib/x86_64-linux-gnu/ (0x00007fcec4640000) => /lib/x86_64-linux-gnu/ (0x00007fcec42a1000)
        /lib64/ (0x00007fcec485d000)

Maybe one of these libs is missing. Could you run strace -f -e open ./dataflow-runner and see what the ouput says please ?

Hey Ben,
Many thanks for the help, it turns out that the “” is not in the kernel of base-alpine so updated my docker to use base-debian, when running the code now it prints out the help command…
Many thanks !

1 Like


I’m facing the same problem, and it really works with the base-debian, but is there any smaller image that could run the dataflow-runner?
I’ve tried busybox:stable-glibc, but it wasn’t enough.

Hey @NirKirshbloom what has worked in the past for some other Go Projects is to build the binary with CGO_ENABLED=0.

Essentially you would edit this line to include this extra environment variable: dataflow-runner/Makefile at master · snowplow/dataflow-runner · GitHub

gox -osarch=linux/amd64 -output=$(bin_linux) ./$(merge_src_dir)

# becomes

CGO_ENABLED=0 gox -osarch=linux/amd64 -output=$(bin_linux) ./$(merge_src_dir)

You should then be able to run Dataflow Runner inside something like alpine:3.14 to keep the size of your image down.

I have not tried this with this project but if you have time to give it a try and report back would be great!

Hey @josh

Yes it works thanks!
I’ve added it to the make command instead of tempering with the Makefile content:
make cli-linux -e CGO_ENABLED=0
now its able 2 run on alpine:3.15 using multi stage build

adding the Dockerfile

FROM golang:bullseye as Builder


RUN apt update
RUN apt install unzip
RUN apt install zip


RUN unzip *

RUN make --directory=dataflow-runner-${DATAFLOW_RUNNER_VERSION} cli-linux -e CGO_ENABLED=0
RUN unzip dataflow-runner-${DATAFLOW_RUNNER_VERSION}/build/bin/*.zip

FROM alpine:3.15

WORKDIR /snowplow

COPY ./config /snowplow/config
COPY --from=Builder /src/dataflow-runner ./

ENTRYPOINT ./dataflow-runner run-transient --emr-config ./config/cluster.json --emr-playbook ./config/playbook.json --vars JSON_RESOLVER_VALUE,${RESOLVER_BASE64},CONFIG_HOCON_VALUE,${CONFIG_HOCON_BASE64},ENV_VAR,${ENVIRONMENT}
1 Like

Glad to hear its working! Have opened up a ticket to look at adding a container build to the project as well now.