Duplicate user_identifier in snowplow_unified_users

Damien_Harley · July 31, 2024, 12:34pm

We are encountering duplicate user_identifier values in our derived.snowplow_unified_users table. This only started several days after a successful implementation of the Snowplow Unified package. There were no changes to the dbt project around the time the issue presented itself.

dbt error

[31mFailure in test unique_snowplow_unified_users_user_identifier (models/users/users.yml)[0m
[2024-07-25, 07:56:12 BST] {{ecs.py:131}} INFO - [2024-07-25 06:55:59,577] [0m06:55:59 Got 8792 results, configured to fail if != 0

Snowplow Unified: `version 0.4.4`

Warehouse: `snowflake`

snowplow__user_identifiers and snowplow__session_identifiers left as default

snowplow_unified:
  snowplow__conversion_events: redacted
  snowplow__backfill_limit_days: 10
  snowplow__start_date: '2023-12-01'
  snowplow__atomic_schema: redacted
  snowplow__database: redacted
  snowplow__enable_ua: true
  snowplow__enable_yauaa: true 
  snowplow__enable_mobile_context: true 
  snowplow__enable_screen_summary_context: true

Identifying where the duplicates occur

with user_aggs as (

    select
        user_identifier,
        count(user_identifier) over (
            partition by 
                user_identifier
        ) as count_user_identifier
    from
        scratch.snowplow_unified_users_aggs 
    qualify count_user_identifier > 1
),

user_sess_this_run as (

    select
        user_identifier,
        count(user_identifier) over (
            partition by 
                user_identifier
        ) as count_user_identifier
    from 
        scratch.snowplow_unified_users_sessions_this_run 
    qualify count_user_identifier > 1
),

unified_users_this_run as (

    select
        user_identifier,
        count(user_identifier) over (
            partition by 
                user_identifier
        ) as count_user_identifier
    from 
        scratch.snowplow_unified_users_this_run
    qualify count_user_identifier > 1
),

unified_users as (

    select
        user_identifier,
        count(user_identifier) over (
            partition by 
                user_identifier
        ) as count_user_identifier
    from 
        derived.snowplow_unified_users 
    qualify count_user_identifier > 1
)



select 
    'duplicates from snowplow_unified_users_aggs' as table_name,
    count(distinct user_identifier) as duplicates
from 
    user_aggs 

    union all 

select 
    'duplicates from unified_users_sessions_this_run' as table_name,
    count(distinct user_identifier) as duplicates
from 
    user_sess_this_run  

    union all 

select 
    'duplicates from snowlow_unified_users_this_run' as table_name,
    count(distinct user_identifier) as duplicates
from 
    unified_users_this_run 

    union all

select 
    'duplicates from derived.snowlow_unified_users' as table_name,
    count(distinct user_identifier) as duplicates
from 
    unified_users

The current run from scratch.unified_users_sessions_this_run naturally shows multiple different sessions by the same users over the period the run took place. Downstream scratch.snowplow_unfied_users_aggs appears to group by user_identifier without issue as no duplicates are found in this table. It appears as though all rows in snowplow_unified_users_this_run are being appended into derived.snowplow_unified_users.

Topic		Replies	Views
Duplicate `domain_userid` causing job failure For engineers	1	104	June 10, 2024
Unified package - user identifier stitching For data modelers & consumers	5	189	July 11, 2024
Snowplow dbt Attribution 0.2.1 released New releases	0	96	June 11, 2024
Dealing with duplicate domain_userIDs For data modelers & consumers	3	1442	October 19, 2017
Duplicate `screen_view_id` For data modelers & consumers	1	834	March 8, 2023

Duplicate user_identifier in snowplow_unified_users

dbt error

Snowplow Unified: version 0.4.4

Warehouse: snowflake

Identifying where the duplicates occur

Related topics

Snowplow Unified: `version 0.4.4`

Warehouse: `snowflake`