Summary
In R90 and R91, if you pass --skip shred
to EmrEtlRunner, for example during a recovery after a failed database load, then RDB Loader loads only atomic.events
, skipping tables for self-describing events and contexts.
The relevant bug report in snowplow/snowplow
is issue #3403; it was introduced in R90 and is also found in R91.
Who is affected
If you believe you could have resumed a failed pipeline in this way, the first step is to check your atomic.manifest
table:
SELECT * FROM atomic.manifest WHERE etl_tstamp BETWEEN '2017-07-27’ AND '2017-08-24’ AND shredded_cardinality = 0;
Assuming 2017-07-27
was the date your pipeline was upgraded to R90+. This query returns all loads done by RDB Loader that don’t have shredded types.
If this query returns no rows, then you have not been affected by this bug - but read on to avoid being caught by this bug before you upgrade to R92+.
If this query does return rows, if having zero shredded types is uncommon for your data-pipeline and if you have you done a pipeline recovery since the upgrade, then it’s likely that you have been affected by this bug.
How to recover
The easiest way to recover is using the new RDB Loader, version 0.13.0-rc4. Using this version, you can launch a “dry run” for a specific folder in your shredded archive, which generates all of the COPY
SQL statements that would be executed for a regular load into Redshift.
First, you need to download the RDB Loader to your local machine:
$ aws s3 cp s3://snowplow-hosted-assets/4-storage/rdb-loader/snowplow-rdb-loader-0.13.0-rc4.jar .
Then, the following script should allow you to launch RDB Loader from local machine:
#!/bin/bash
set -e
# Fail-fast with error-message
function die() {
echo "$@" 1>&2 ; exit 1;
}
# Checking what uuid generator is available
if command -v uuid >/dev/null 2>&1; then
UUID_APP=uuid
elif command -v uuidgen >/dev/null 2>&1; then
UUID_APP=uuidgen
else
die "Either uuid or uuidgen must be installed"
fi
JARFILE="$(pwd)/snowplow-rdb-loader-0.13.0-rc4.jar"
[ "$#" -eq 6 ] || die "Please specify paths to: config.yml, redshift.json, resolver.json, s3://archive-shredded-path, s3://log-path, run=id"
config_yml=${1}
redshift_json=${2}
resolver_json=${3}
folder_path=${4}/${6}
log_path=${5}/rdb-loader/${6}
[ ! -f $config_yml ] && die "Cannot find EmrEtlRunner config file ${config_yml}"
[ ! -f $redshift_json ] && die "Cannot find Redshift target file ${redshift_json}"
[ ! -f $resolver_json ] && die "Cannot find Iglu resolver config file ${resolver_json}"
config_yml_b64=$(cat ${config_yml} | base64 -w 0)
redshift_json_b64=$(cat ${redshift_json} | base64 -w 0)
resolver_json_b64=$(cat ${resolver_json} | base64 -w 0)
UUID=$($UUID_APP)
logkey=$log_path/$UUID
java -jar $JARFILE \
--dry-run \
--skip consistency_check \
--config $config_yml_b64 \
--target $redshift_json_b64 \
--folder $folder_path \
--logkey $logkey \
--resolver $resolver_json_b64
This script passes arguments to RDB Loader as EmrEtlRunner would do it, but it’s completely safe to invoke it - it does not perform any actions with Redshift and just prints load statements to local stdout. It accepts 6 required positional arguments:
- Path to
config.yml
- Path to Redshift storage target configuration. It won’t try to connect to Redshift, but will complain if configuration is invalid.
- Path to Iglu Resolver configuration JSON
- Path to shredded archive. This can be extracted from
config.yml
asaws.s3.buckets.shredded.archive
- Path to log folder. This can be extracted from
config.yml
asaws.s3.buckets.log
, but also any available bucket can be used - Run id (folder name) of flawed load, such as
run=2017-08-13-14-01-11
For example:
$ shredded-recovery.sh snowplow/config.yml snowplow/targets/redshift.json snowplow/resolver.json s3://acme-snowplow/archive/shredded/good s3://acme-snowplow/logs run=2017-08-13-14-01-11
Run this and you will see listed the “Performed SQL Queries” (don’t worry, nothing really performed thanks to the dry run). Now:
- Copy this SQL into your editor
- Delete everything but the
COPY
statements for the shredded types - i.e. omit the SQL statements foratomic.events
andatomic.manifest
(otherwise you will double-load the events intoatomic.events
)
Assuming your load had only two shredded types (should be much more in real-world scenario), statements should be look like following:
BEGIN;
COPY atomic.com_snowplowanalytics_snowplow_submit_form_1 FROM 's3://acme-snowplow/archive/shredded/good/run=2017-08-13-14-01-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=submit_form/format=jsonschema/version=1-'
CREDENTIALS 'aws_iam_role=arn:aws:iam::123454468890:role/RedshiftLoadRole' JSON AS 's3://snowplow-hosted-assets-us-east-1/4-storage/redshift-storage/jsonpaths/com.snowplowanalytics.snowplow/submit_form_1.json'
REGION AS 'us-east-1'
MAXERROR 1 TRUNCATECOLUMNS TIMEFORMAT 'auto'
ACCEPTINVCHARS ;
COPY atomic.com_snowplowanalytics_snowplow_change_form_1 FROM 's3://acme-snowplow/archive/shredded/good/run=2017-08-13-14-01-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=change_form/format=jsonschema/version=1-'
CREDENTIALS 'aws_iam_role=arn:aws:iam::123454468890' JSON AS 's3://snowplow-hosted-assets-us-east-1/4-storage/redshift-storage/jsonpaths/com.snowplowanalytics.snowplow/change_form_1.json'
REGION AS 'us-east-1'
MAXERROR 1 TRUNCATECOLUMNS TIMEFORMAT 'auto'
ACCEPTINVCHARS ;
COMMIT;
Also, it’s recommended to update manifest table to reflect the new state by updating the shredded_cardinality
column for the affected run:
UPDATE atomic.manifest SET shredded_cardinality = 2 WHERE etl_tstamp = '2017-07-03 06:58:25.891';
Now you have all statements RDB Loader had to perform for to load data into Redshift. You can save them into file shred-recovery.sql
and load manually via psql
:
$ psql -h acme-cluster.abcdefgh01xnr.us-east-1.redshift.amazonaws.com -p 5439 snowplow -U analyst -f shred-recovery-sql
After this load completed, Redshift should have a correct state with all shredded types. If you discovered more than one load with missing shredded types - it’s safe to repeat above steps while all folders are not loaded.
How to avoid this issue
If you are running R90 or R91, do not attempt to run the database load with an EmrEtlRunner command which:
- Either explicitly includes
--skip shred
, or - Implicitly skips shred, by resuming from the
rdb_load
orarchive_raw
steps
This bug will be addressed in the upcoming R92 Maiden Castle release.