Hi guys,
we are using the lake loader to store our events every 3 minutes on S3 in the Iceberg format (using AWS Glue, we adapted here slightly your solution to make this work).
What we figured out is that inside of the schema there are duplicated custom contexts with recovered inside that have newer custom context versions but we don’t know where they come from.
Example:
Lake Loader Output:
[io-compute-0] INFO com.snowplowanalytics.snowplow.lakes.processing.Processing - Non atomic columns: [contexts_com_snowplowanalytics_snowplow_web_page_1,contexts_de_mycompany_video_context_3,contexts_de_mycompany_video_context_3_recovered_3_0_2_6978f4f7,contexts_de_mycompany_video_context_3_recovered_3_1_0_4000cbbd,contexts_de_mycompany_video_context_3_recovered_3_1_1_96f4c485,contexts_de_mycompany_video_context_3_recovered_3_1_2_6f06a2f4]
This creates additionally to the contexts_de_mycompany_video_context_3
custom context also additional custom contexts contexts_de_mycompany_video_context_3_recovered_3_1_0_4000cbbd
. The schema of these two iceberg columns is completely the same and therefore, I don’t understand why the lake loader is splitting here the traffic.
Interestingly some schema versions it stores in the actual column but some others it creates new columns with the “recovered” column.
How does this come? Is the schema wrong over here, if so shouldn’t these events end up rather in the bad stream than creating new events?
Btw: we are running 6 Lake Loader in parallel. The custom context version that causes the additional column looks like that:
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Context for MyCompany video tracking",
"self": {
"vendor": "de.mycompany",
"name": "video_context",
"version": "3-1-2",
"format": "jsonschema"
},
"type": "object",
"properties": {
"video_xymatic_id": {
"type": "string",
"description": "Unique ID of the video",
"maxLength": 256
},
"video_name_vms": {
"type": "string",
"description": "Human readable name of the video",
"maxLength": 2048
},
"video_interaction": {
"type": "string",
"description": "How was the video started? (e.g. 'autostart' | 'clicktoplay')",
"maxLength": 64
},
"video_status": {
"type": "string",
"description": "What is the status of the video? (e.g. 'vod')",
"maxLength": 64
},
"video_play_type": {
"type": "string",
"description": "Play type of the video (e.g. 'firstplay')",
"maxLength": 64
},
"video_sound": {
"type": "boolean",
"description": "The video sound is available. Has to be a boolean, i.e. true or false"
},
"video_player_version": {
"type": "string",
"description": "Version of the video player",
"maxLength": 64
},
"video_player_type": {
"type": "string",
"description": "Type of video player ('widget' | 'standard')",
"maxLength": 64
},
"video_publish_date": {
"type": "string",
"description": "Publish date of the video",
"pattern": "^$|^\\d{4}\\-(0?[1-9]|1[012])\\-(0?[1-9]|[12][0-9]|3[01])$",
"maxLength": 10
},
"video_update_date": {
"type": "string",
"description": "Update date of the video",
"pattern": "^$|^\\d{4}\\-(0?[1-9]|1[012])\\-(0?[1-9]|[12][0-9]|3[01])$",
"maxLength": 10
},
"video_salesforce_partner_id": {
"type": "string",
"description": "Id of the salesforce partner",
"maxLength": 128
},
"video_creator_job_id": {
"type": "string",
"description": "Job Team Id of the video creator",
"maxLength": 64
},
"video_autoplay_setting_page": {
"type": "string",
"description": "Autoplay Setting in CMS active on current page (1 if true)",
"maxLength": 1
},
"video_ab_test_id": {
"type": "string",
"description": "A/B test id of the video",
"maxLength": 128
},
"video_player_template_name": {
"type": "string",
"description": "Name of player integration",
"maxLength": 256
}
},
"additionalProperties": true
}
Thank you for any advice
Cheers, Christoph