Hi, we are testing an upgrade to our Snowplow pipeline using the new stream enrichments, transformer and RDB loader. The previous version of the pipeline used EMR for enrichments, shredding and loading data into Redshift.
We’re starting to see these errors in Cloudwatch for the RDB loader:
Aug 23 11:00:02 ip-10-0-1-138 sp-rdbloader: INFO Load: COPY com_snowplowanalytics_snowplow_ua_parser_context_1 FROM s3://<obfuscated>/transformed/run=2023-08-23-10-55-00-f1bc24ff-4150-4c38-abff-6cd3aca0c12e/output=good/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=tsv/model=1
Aug 23 11:00:04 ip-10-0-1-138 sp-rdbloader: Caused by: com.amazon.support.exceptions.ErrorException: [Amazon](500310) Invalid operation: Load into table 'com_snowplowanalytics_snowplow_ua_parser_context_1' failed. Check 'stl_load_errors' system table for details.;
It seems the reason why this is failing is due to null values in the fields for the os_family and useragent_version. When I ran igluctl table-check, there’s an obvious issue with the Redshift table schema:
Table for iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0 is not matched. Issues:
* Column doesn't match, expected: 'os_patch_minor VARCHAR(4096)', actual: 'os_patch_minor VARCHAR(64)'
* Column doesn't match, expected: 'os_version VARCHAR(4096)', actual: 'os_version VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'os_patch VARCHAR(4096)', actual: 'os_patch VARCHAR(64)'
* Column doesn't match, expected: 'useragent_minor VARCHAR(4096)', actual: 'useragent_minor VARCHAR(64)'
* Column doesn't match, expected: 'useragent_version VARCHAR(4096)', actual: 'useragent_version VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'os_major VARCHAR(4096)', actual: 'os_major VARCHAR(64)'
* Column doesn't match, expected: 'useragent_family VARCHAR(4096) NOT NULL', actual: 'useragent_family VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'useragent_major VARCHAR(4096)', actual: 'useragent_major VARCHAR(64)'
* Column doesn't match, expected: 'device_family VARCHAR(4096) NOT NULL', actual: 'device_family VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'useragent_patch VARCHAR(4096)', actual: 'useragent_patch VARCHAR(64)'
* Column doesn't match, expected: 'os_minor VARCHAR(4096)', actual: 'os_minor VARCHAR(64)'
* Column doesn't match, expected: 'os_family VARCHAR(4096) NOT NULL', actual: 'os_family VARCHAR(255) NOT NULL'
I believe the above might have been “human introduced”, but what is perplexing is how our current pipeline was loading data fine. Was there a change in the ua_parser behavior where it will now send nulls for some fields? Our hunch is “Other” was used as a catch-all value previously, but not 100% sure.
Another question to confirm my understanding of RDB loader. RDB loader will not attempt a schema change unless there’s a bump in the schema version? Specifically referring to the column size differences, but I believe that will be a manual DDL alter for us