System upgrade on snowplow/base-debian required

Dear Snowplow team,

I’m reaching out to you to request a system upgrade for the base image snowplow/base-debian. We’re using it for customization of the snowplow modules (loader, collector etc.) and the last time this image was updated was 3 years. Since then many vulnerabilities in the debian system packages (and the package management system itself) have been detected and resolved. Running the system upgrade every time on build in our custom image is costly. Please, update the image by upgrading the system.

Which Docker images are you using specifically?

Most (including collector, enrich etc) will have Ubuntu 20.04 or distroless versions of the Docker images available.

hey Mike, thanks for getting back to me. We’re using snowplow/base-debian for running the dataflow-runner binary used for the shredding step.
Here’s the link to the image: Docker

Here’s how our Dockerfile looks like

FROM snowplow/base-debian:0.2.2

ARG STAGE
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
ARG AWS_DEFAULT_REGION
ARG AWS_ACCOUNT_NUMBER
ARG SP_SCHEMA_JSONPATH_URI
ARG SP_SCHEMA_URI
ARG AWS_PUBLIC_SUBNET_ID
ARG SP_LOADER_URI
ARG SP_ENRICHED_URI
ARG SP_SHREDDED_URI
ARG EMR_ECS_KEY_PAIR
ARG LOGURI
ARG SQS_QUEUE
ARG SNS_TOPIC


ENV STAGE=$STAGE
ENV AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
ENV AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
ENV AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
ENV AWS_ACCOUNT_NUMBER=$AWS_ACCOUNT_NUMBER
ENV SP_SCHEMA_JSONPATH_URI=$SP_SCHEMA_JSONPATH_URI
ENV SP_SCHEMA_URI=$SP_SCHEMA_URI
ENV AWS_PUBLIC_SUBNET_ID=$AWS_PUBLIC_SUBNET_ID
ENV SP_LOADER_URI=$SP_LOADER_URI
ENV SP_ENRICHED_URI=$SP_ENRICHED_URI
ENV EMR_ECS_KEY_PAIR=$EMR_ECS_KEY_PAIR
ENV LOGURI=$LOGURI
ENV SQS_QUEUE=$SQS_QUEUE
ENV SNS_TOPIC=$SNS_TOPIC

WORKDIR /app
COPY src/ /app/

# hadolint ignore=SC1068, DL3008
RUN apt-get update && \
    apt-get install -yqq --no-install-recommends wget unzip git tar vim awscli &&\
    wget -q https://github.com/snowplow/dataflow-runner/releases/download/0.7.3/dataflow_runner_0.7.3_linux_amd64.zip && \
    unzip dataflow_runner_0.7.3_linux_amd64.zip && \
    apt-get clean && \
    rm -fr /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
    sh modify_configs.sh


CMD ["./dataflow-runner", "run-transient", "--emr-config", "emr-config.json", "--emr-playbook", "playbook.json"]

Where the modify_configs shell script does the variable substitution (environment variables in config files):

#!/bin/bash
sed -i -e 's|SP_SCHEMA_URI|'"$SP_SCHEMA_URI"'|' resolver.json
sed -i -e 's|STAGE|'"$STAGE"'|' emr-config.json
sed -i -e 's|AWS_ACCESS_KEY_ID|'"$AWS_ACCESS_KEY_ID"'|' emr-config.json
sed -i -e 's|AWS_SECRET_ACCESS_KEY|'"$AWS_SECRET_ACCESS_KEY"'|' emr-config.json
sed -i -e 's|AWS_PUBLIC_SUBNET_ID|'"$AWS_PUBLIC_SUBNET_ID"'|' emr-config.json
sed -i -e 's|LOGURI|'"$LOGURI"'|' emr-config.json
sed -i -e 's|AWS_DEFAULT_REGION|'"$AWS_DEFAULT_REGION"'|' emr-config.json
sed -i -e 's|EMR_ECS_KEY_PAIR|'"$EMR_ECS_KEY_PAIR"'|' emr-config.json
sed -i -e 's|AWS_DEFAULT_REGION|'"$AWS_DEFAULT_REGION"'|' playbook.json
sed -i -e 's|AWS_ACCESS_KEY_ID|'"$AWS_ACCESS_KEY_ID"'|' playbook.json
sed -i -e 's|AWS_SECRET_ACCESS_KEY|'"$AWS_SECRET_ACCESS_KEY"'|' playbook.json
sed -i -e 's|SP_LOADER_URI|'"$SP_LOADER_URI"'|' playbook.json
sed -i -e 's|SP_SHREDDED_URI|'"$SP_SHREDDED_URI"'|' playbook.json
sed -i -e 's|SP_ENRICHED_URI|'"$SP_ENRICHED_URI"'|' playbook.json
sed -i -e 's|SP_ENRICHED_URI|'"$SP_ENRICHED_URI"'|' config.hocon
sed -i -e 's|SP_SHREDDED_URI|'"$SP_SHREDDED_URI"'|' config.hocon
sed -i -e 's|SQS_QUEUE|'"$SQS_QUEUE"'|' config.hocon
sed -i -e 's|SNS_TOPIC|'"$SNS_TOPIC"'|' config.hocon
sed -i -e 's|AWS_ACCOUNT_NUMBER|'"$AWS_ACCOUNT_NUMBER"'|' config.hocon
sed -i -e 's|AWS_DEFAULT_REGION|'"$AWS_DEFAULT_REGION"'|' config.hocon
sed -i -e 's|STAGE|'"$STAGE"'|' config.hocon

Hi @Kristina_Pianykh

As your intention is just to run dataflow-runner inside a docker image, my advice is to use our official dataflow-runner image. It is available on docker hub here.

docker pull snowplow/dataflow-runner:0.7.3

That image is built upon a fairly new version of alpine linux.

Historically, we used to maintain the snowplow/base-debian image because we used it internally as a base all our other applications (collector, enrich, loaders etc). However, over the last few years we have switched to using 3rd party base images for those applications, e.g. alpine, eclipse-temurin and distroless.

For this reason, I think it is very unlikely that we update snowplow/base-debian ever again. We have ceased to need it ourselves, and I don’t think we provide any real benefit to the community by maintaining it as we were.

If you need any help finding our configuring our supported snowplow docker images, then I’d be happy to help you further.

3 Likes

Hi Mike,

Since this topic is hot, I am just wondering if snowplow takes into considerations to fix the current vulnerability CVE-2023-0286 on streamloader, mutator and repeater on the next release.

Thanks!

Our internal scans show that the latest image for streamloader (1.6.5-distroless) is not subject to this vulnerability as it uses openssl/libssl1.1@1.1.1n-0+deb11u4 on a Bullseye basis (Ubuntu 20.04) rather than the vulnerable version (Buster, OpenSSL 1.1.1n-0+deb10u3).

1 Like

Thanks for the clarification. We were using 1.6.4, hence the confusion

1 Like

Many thanks for the tip, we’ll definitely check out the dataflow-runner image. Appreciate the detailed explanation!