Load into table *_ua_parser_context_1 failed

Hi, we are testing an upgrade to our Snowplow pipeline using the new stream enrichments, transformer and RDB loader. The previous version of the pipeline used EMR for enrichments, shredding and loading data into Redshift.

We’re starting to see these errors in Cloudwatch for the RDB loader:

Aug 23 11:00:02 ip-10-0-1-138 sp-rdbloader: INFO Load: COPY com_snowplowanalytics_snowplow_ua_parser_context_1 FROM s3://<obfuscated>/transformed/run=2023-08-23-10-55-00-f1bc24ff-4150-4c38-abff-6cd3aca0c12e/output=good/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=tsv/model=1

Aug 23 11:00:04 ip-10-0-1-138 sp-rdbloader: Caused by: com.amazon.support.exceptions.ErrorException: [Amazon](500310) Invalid operation: Load into table 'com_snowplowanalytics_snowplow_ua_parser_context_1' failed.  Check 'stl_load_errors' system table for details.;

It seems the reason why this is failing is due to null values in the fields for the os_family and useragent_version. When I ran igluctl table-check, there’s an obvious issue with the Redshift table schema:

Table for iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0 is not matched. Issues:

* Column doesn't match, expected: 'os_patch_minor VARCHAR(4096)', actual: 'os_patch_minor VARCHAR(64)'
* Column doesn't match, expected: 'os_version VARCHAR(4096)', actual: 'os_version VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'os_patch VARCHAR(4096)', actual: 'os_patch VARCHAR(64)'
* Column doesn't match, expected: 'useragent_minor VARCHAR(4096)', actual: 'useragent_minor VARCHAR(64)'
* Column doesn't match, expected: 'useragent_version VARCHAR(4096)', actual: 'useragent_version VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'os_major VARCHAR(4096)', actual: 'os_major VARCHAR(64)'
* Column doesn't match, expected: 'useragent_family VARCHAR(4096) NOT NULL', actual: 'useragent_family VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'useragent_major VARCHAR(4096)', actual: 'useragent_major VARCHAR(64)'
* Column doesn't match, expected: 'device_family VARCHAR(4096) NOT NULL', actual: 'device_family VARCHAR(255) NOT NULL'
* Column doesn't match, expected: 'useragent_patch VARCHAR(4096)', actual: 'useragent_patch VARCHAR(64)'
* Column doesn't match, expected: 'os_minor VARCHAR(4096)', actual: 'os_minor VARCHAR(64)'
* Column doesn't match, expected: 'os_family VARCHAR(4096) NOT NULL', actual: 'os_family VARCHAR(255) NOT NULL'

I believe the above might have been “human introduced”, but what is perplexing is how our current pipeline was loading data fine. Was there a change in the ua_parser behavior where it will now send nulls for some fields? Our hunch is “Other” was used as a catch-all value previously, but not 100% sure.

Another question to confirm my understanding of RDB loader. RDB loader will not attempt a schema change unless there’s a bump in the schema version? Specifically referring to the column size differences, but I believe that will be a manual DDL alter for us :slight_smile:

Wow, that’s a pipeline that’s been running since before April 2015!

I think it’s too far back to have any confidence in an explanation but I can think of a few possibilities one or more of which might explain it:

  • The version of the pipeline you’re using may have hardcoded the old schema before the patch in the above commit. Back then the loader was agnostic to schemas (I believe!), and tables were created manually.

  • The ua parser behaviour may have changed, yes it uses this library, so the old pipeline you’re using might have a fixed version of that library with different behaviour. (btw we prefer YAUAA these days for this same use case).

  • It is possible that the old loader also had these failures to load, because of the same restriction, but just did so quietly. I believe there’s a configuration option in the old batch EMR loader which sets the number of rows allowed to fail before the job is failed. Back in those days one could’ve easily set that to a high value to bypass some issue/jankiness, and then left it that way.

Having said that those are just theories off the top of my head.

Another question to confirm my understanding of RDB loader. RDB loader will not attempt a schema change unless there’s a bump in the schema version?

Correct! More specifically, it will attempt it when there is a bump in schema version, and it has received data against that schema.

Specifically referring to the column size differences, but I believe that will be a manual DDL alter for us :slight_smile:

Yeah that’ll be a manual DDL change.

One way to approach this would be to configure the new loader output to a different schema - then it’ll create all the tables and columns automatically for you. The advantage here is that you won’t ever have to deal with issues between the old tables and the new. The disadvantage is that you’d need to either migrate your history to the new tables, or have your data modeling consume from both until there’s enough history in the new tables to disregard the old ones.

You might find these resources useful too:

2 Likes

Thanks @Colm … the original issue is resolved, but it seems like loading data from our new pipeline is broken. It seems the new pipeline was dependent on the old column ordering. How does transformer determine the correct column order? I read that new versions of transformer/rdbloader were not dependent on jsonpaths, but I think I’m missing something. The order of columns currently does match the order of fields in the Iglu schema for ua_parser_context_1.

Is it possible the older table schema is cached in the transformer or elsewhere?

@Colm , snowplowers… any ideas? From the migration guides it seems like I might need to blacklist our legacy tables. Will the new loader work ok with that?