Hi,
We’ve finalized our Snowplow pipeline and were doing some QA to validate the process of creating new schema versions. During this, we took an existing schema that worked perfectly & added a 1-0-1 version of this.
Afterwards, we published the schema to the registry and fired off an event to see the result.
Our Redshift table correctly migrated to add the new column. We could also see the event that came in on our atomic events table. But when looking at the relational table for the schema, no new rows were added. We then tried to push events again on the old version 1-0-0, and these were also not added to the relational table.
So events are coming into the atomic events table, just not our relational schema table, after the change. This is the log for the loader:
INFO DataDiscovery: Received a new message
INFO DataDiscovery: Total 3 messages received, 2 loaded
INFO DataDiscovery: New data discovery at run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3 with following shredded types:
* iglu:com.xyz/context_platform_user/jsonschema/1-*-* TSV
* iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-* TSV
* iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-*-* TSV
* iglu:com.xyz/login/jsonschema/1-*-* TSV
INFO Load: Loading transaction for s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/ has started
INFO Migration: Migrating qa.com_xyz_login_1 (in-transaction)
INFO Migration: qa.com_xyz_login_1 migration completed
INFO Load: Loading s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/
INFO Load: COPY events FROM s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/
INFO Load: COPY com_xyz_context_platform_user_1 FROM s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/output=good/vendor=com.xyz/name=context_platform_user/format=tsv/model=1
INFO Load: COPY com_snowplowanalytics_snowplow_web_page_1 FROM s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/output=good/vendor=com.snowplowanalytics.snowplow/name=web_page/format=tsv/model=1
INFO Load: COPY com_snowplowanalytics_snowplow_ua_parser_context_1 FROM s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/output=good/vendor=com.snowplowanalytics.snowplow/name=ua_parser_context/format=tsv/model=1
INFO Load: COPY com_xyz_login_1 FROM s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/output=good/vendor=com.xyz/name=login/format=tsv/model=1
INFO Load: Folder [s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/] has been loaded (not committed yet)
INFO Load: Folder s3://xyz-analytics-dev-transform-output/transformed/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3/ loaded successfully
INFO snowplow.rdbloader.count_good = 1
INFO snowplow.rdbloader.latency_collector_to_load_min = 307
INFO snowplow.rdbloader.latency_collector_to_load_max = 307
INFO snowplow.rdbloader.latency_shredder_start_to_load = 312
INFO snowplow.rdbloader.latency_shredder_end_to_load = 12
INFO Load: 1 good events were loaded. It took minimum 307 seconds and maximum 307 seconds between the collector and warehouse for these events. It took 312 seconds between the start of transformer and warehouse and 12 seconds between the completion of transformer and warehouse
INFO Acking SQS message because processing is complete.
INFO Loader: Total 3 messages received, 3 loaded; Loader is in Idle state; Last state update at 2022-11-01 09:40:12.896
INFO snowplow.rdbloader.minimum_age_of_loaded_data = 482
I then tried to query the load errors table in Redshift and found the following:
{
"select * from STL_LOAD_ERRORS": [
{
"userid": 100,
"slice": 1,
"tbl": 28827334,
"starttime": "2022-11-01T08:40:04.346Z",
"session": 1075385666,
"query": 21167143,
"filename": "s3:\/\/xyz-analytics-dev-transform-output\/transformed\/run=2022-11-01-09-35-00-30da18b8-26df-43e1-9e0f-a74d90c912e3\/output=good\/vendor=com.xyz\/name=login\/format=tsv\/model=1\/sink-d69eafd2-65e7-4e8a-ae6d-5db9e1596721.txt.gz ",
"line_number": 1,
"colname": "type ",
"type": "varchar ",
"col_length": "17 ",
"position": 130,
"raw_line": "com.xyz\tlogin\tjsonschema\t1-0-1\te27216be-7d0c-4e6e-b70e-53784a848c75\t2022-11-01 09:35:05.547\tevents\t[\"events\",\"login\"]\tevents\txyz_ADMIN\tUSERNAME_PASSWORD ",
"raw_field_value": "USERNAME_PASSWORD ",
"err_code": 1214,
"err_reason": "Delimiter not found ",
"is_partial": 0,
"start_offset": 0
}
]
}
For some reason it is complaining about delimiter. But it didn’t before we introduced a new version to the schema. New schema version where we introduced a ‘object’ property:
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Login event data schema",
"self": {
"vendor": "com.xyz",
"name": "login",
"format": "jsonschema",
"version": "1-0-1"
},
"type": "object",
"properties": {
"platform": {
"enum": [
"XYZ_ADMIN",
"ACADEMY"
]
},
"type": {
"enum": [
"USERNAME_PASSWORD",
"OAUTH",
"SAML"
]
},
"object": {
"type": "string"
}
},
"required": ["type", "platform"],
"additionalProperties": false
}
I’m not entirely sure what we’ve done wrong. Events continue to come in, in other tables. But this one specifically after introducing a new version started to cause issues in the COPY phase.
Anyone has any hints where we could start to look?