We are encountering duplicate user_identifier
values in our derived.snowplow_unified_users
table. This only started several days after a successful implementation of the Snowplow Unified package. There were no changes to the dbt project around the time the issue presented itself.
dbt error
[31mFailure in test unique_snowplow_unified_users_user_identifier (models/users/users.yml)[0m
[2024-07-25, 07:56:12 BST] {{ecs.py:131}} INFO - [2024-07-25 06:55:59,577] [0m06:55:59 Got 8792 results, configured to fail if != 0
Snowplow Unified: version 0.4.4
Warehouse: snowflake
snowplow__user_identifiers
and snowplow__session_identifiers
left as default
snowplow_unified:
snowplow__conversion_events: redacted
snowplow__backfill_limit_days: 10
snowplow__start_date: '2023-12-01'
snowplow__atomic_schema: redacted
snowplow__database: redacted
snowplow__enable_ua: true
snowplow__enable_yauaa: true
snowplow__enable_mobile_context: true
snowplow__enable_screen_summary_context: true
Identifying where the duplicates occur
with user_aggs as (
select
user_identifier,
count(user_identifier) over (
partition by
user_identifier
) as count_user_identifier
from
scratch.snowplow_unified_users_aggs
qualify count_user_identifier > 1
),
user_sess_this_run as (
select
user_identifier,
count(user_identifier) over (
partition by
user_identifier
) as count_user_identifier
from
scratch.snowplow_unified_users_sessions_this_run
qualify count_user_identifier > 1
),
unified_users_this_run as (
select
user_identifier,
count(user_identifier) over (
partition by
user_identifier
) as count_user_identifier
from
scratch.snowplow_unified_users_this_run
qualify count_user_identifier > 1
),
unified_users as (
select
user_identifier,
count(user_identifier) over (
partition by
user_identifier
) as count_user_identifier
from
derived.snowplow_unified_users
qualify count_user_identifier > 1
)
select
'duplicates from snowplow_unified_users_aggs' as table_name,
count(distinct user_identifier) as duplicates
from
user_aggs
union all
select
'duplicates from unified_users_sessions_this_run' as table_name,
count(distinct user_identifier) as duplicates
from
user_sess_this_run
union all
select
'duplicates from snowlow_unified_users_this_run' as table_name,
count(distinct user_identifier) as duplicates
from
unified_users_this_run
union all
select
'duplicates from derived.snowlow_unified_users' as table_name,
count(distinct user_identifier) as duplicates
from
unified_users
The current run from scratch.unified_users_sessions_this_run
naturally shows multiple different sessions by the same users over the period the run took place. Downstream scratch.snowplow_unfied_users_aggs
appears to group by user_identifier
without issue as no duplicates are found in this table. It appears as though all rows in snowplow_unified_users_this_run
are being appended into derived.snowplow_unified_users
.