Hey, so I’m trying to finally automate the final steps of the Snowplow pipeline to push the shredded data into my Redshift DWH using the RDB Loader 1.0.0 and so far I’ve always run the same Docker command locally docker run snowplow/snowplow-rdb-loader:1.0.0 --iglu-config $(cat ./resolver.json | base64) --config $(cat ./config.hocon | base64)
and it always worked but I’m now trying to make it work within an Airflow task and I need to retrieve the codebase from Gitlab and replace the Environment Variables in the config.hocon
. I’ve essentially tried to build a custom image derived from the snowplow/snowplow-rdb-loader:1.0.0
:
Dockerfile
FROM snowplow/snowplow-rdb-loader:1.0.0
USER root
RUN apt-get update \
&& apt-get install -y gettext git
COPY snowplow_loader.sh .
USER snowplow
ENTRYPOINT ["/bin/sh"]
and then run the function after I’ve replaced the vars:
snowplow_loader.sh
# get gitlab code for snowplow
git clone https://$GITLAB_USERNAME:$GITLAB_ACCESS_TOKEN@gitlab.com/etc/etc.git
# replace env vars
envsubst < ./snowplow/sink_redshift/loader/config.hocon > ./snowplow/sink_redshift/loader/config_prod.hocon
cp -f ./snowplow/sink_redshift/loader/config_prod.hocon ./snowplow/sink_redshift/loader/config.hocon
# run function
/home/snowplow/bin/snowplow-rdb-loader --config $(cat ./snowplow/sink_redshift/loader/config.hocon | base64) --iglu-config $(cat ./snowplow/sink_redshift/loader/resolver.json | base64)
to which I get an error:
For reference these are my config files:
config.hocon
{
# Human-readable identificator, can be random
"name": "Snowplow Redshift Loader",
# Machine-readable unique identificator, must be UUID
"id": "fake",
# Data Lake (S3) region
"region": "fake",
# SQS topic name used by Shredder and Loader to communicate
"messageQueue": "fake",
# Shredder-specific configs
"shredder": {
"type": "batch",
# Path to enriched archive (must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories)
"input": "fake",
# Path to shredded output
"output": {
"path": "fake",
# Shredder output compression, GZIP or NONE
"compression": "GZIP"
}
},
# Schema-specific format settings (recommended to leave all three groups empty and use TSV as default)
"formats": {
# Format used by default (TSV or JSON)
"default": "TSV",
# Schemas to be shredded as JSONs, corresponding JSONPath files must be present. Automigrations will be disabled
"json": [ ],
# Schemas to be shredded as TSVs, presence of the schema on Iglu Server is necessary. Automigartions enabled
"tsv": [ ],
# Schemas that won't be loaded
"skip": [ ]
},
# Warehouse connection details
"storage" = {
# Database, redshift is the only acceptable option
"type": "redshift",
# Redshift hostname
"host": "fake",
# Database name
"database": "dev",
# Database port
"port": 5439,
# AWS Role ARN allowing Redshift to load data from S3
"roleArn": "fake",
# DB schema name
"schema": "atomic",
# DB user with permissions to load data
"username": "fake",
# DB password
"password": "$SNOWPLOW_DWH_PASSWORD",
# Custom JDBC configuration
"jdbc": {"ssl": false},
# MAXERROR, amount of acceptable loading errors
"maxError": 10
},
# Additional steps. analyze, vacuum and transit_load are valid values
"steps": ["analyze"],
# Observability and logging opitons
"monitoring": {
# Snowplow tracking (optional)
"snowplow": null,
# Sentry (optional)
"sentry": null
}
}
resolver.json
{
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3",
"data": {
"cacheSize": 500,
"repositories": [
{
"name": "Iglu Central",
"priority": 0,
"vendorPrefixes": [ "com.snowplowanalytics" ],
"connection": {
"http": {
"uri": "https://myserver.com/api/"
}
}
}
]
}
}
Also, if I invert the order of the arguments, then I get an error which is similar to the one I’ve mentioned but related to the resolver.json
so I suspect this is either something related to the base64 enconding which is somehow different from my local base64 enconding or the way I’m trying to run the function which is not correct but I wasn’t able to figure out the issue.
EDIT: after trying some things I’ve found scattered around the web I’ve managed to actually get it running by adding -w 0
to the base64 command. I’m waiting for my Shredder to complete running to test it fully but in theory this should be resolved. Not sure why locally I don’t need this extra flag though…