Duplicate Session Ids Months Apart

ks107 · April 4, 2023, 6:05pm

Due to users restoring sessions from the past or in app browsers persisting session cookies, we’re seeing duplicate session/user ids come through months apart from each other. This is causing an issue in our dbt modelling as we only look at the past 30 days for duplicates and remove them. Anything larger would be beyond our computing capacity limit.

Has anyone experience this before? Any insight as to how to deal with it?

Ryan · April 5, 2023, 9:22am

Hi @ks107, I’m not sure if you’re using the snowplow dbt packages, but our approach to this is to quarantine sessions that are longer than some value (I think our default is 3 days) and after that we stop processing any new events for these sessions, to avoid massive table scans, because they are more often than not bots. Unfortunately if you want to properly process all events in a sessions without this limit there is not much you can do to avoid large table scans.

Colm · April 5, 2023, 9:56am

Just to add a little bit of extra context - these can be normal, non-bot events. If a user leaves a web page before an event is sent successfully, for example, the event will be stored and sent along with the next event that gets sent through. So if the user leaves and comes back 6 months later, you can have a legitimate 6 month late event.

Still, dealing with them is the same - they’re not worth the table scan so best left out as Ryan describes.

ks107 · April 5, 2023, 12:59pm

Thanks @Colm and @Ryan for the replies. We are using the snowplow dbt packages, so we should have this in place already. Will bring this quarantine approach back to my team and see if it’s something we can incorporate if we’re not already doing so.

ks107 · April 24, 2023, 8:23pm

@Colm and @Ryan so took this back to our team. We are indeed using the dbt models and we have the quarantine sessions set up similarly (3 days) however, our session_lookback_limit is set to 120 days due to performance load. So therefore, it seems that some sessions that are beyond that are squeaking through because they aren’t being flagged as being longer than 3 days.

Because of this we are getting a similar error to this bug from 2021 Fix for unexpired sessions · Issue #47 · snowplow/dbt-snowplow-web · GitHub where it “causes cause the snowplow_tests_page_view_in_session_values data test to fail. As not all page views within the session are now being processed together, metrics like page_view_in_session_index will be incorrectly calculated”

If we cannot increase our lookback limit, is there any other work arounds to this?

Ryan · April 25, 2023, 8:31am

@ks107 Unfortunately there isn’t really a great solution to this case. Can you provide some more details about the events themselves? For example do they have a large gap between dvce_created_tstamp and dvce_sent_tstamp/collector_tstamp that could be used to filter out late data (genuine late arriving events), or sometime else that looks off about the events (e.g. can be added to the bots filter macro we provide?).

The other option would be to increase that session_lookback_limit value if possible - we already make quite a few optimisations on the snowplow_web_base_sessions_lifecycle_manifest table but depending on your warehouse you may be able to optimise it even further - for example we cluster on BigQuery on session_id but do on Snowflake you could alter the table to enable search optimisation on that column, or in databricks perhaps something with zorder (although I am less familiar with this and the benefits it may offer). These will come with tradeoffs, databricks in particular will become less optimised for date filtering, while snowflake I think just incurs additional storage costs but I am not 100% sure.

Topic		Replies	Views
Concurrent Web sessions for the same user For data modelers & consumers	6	1715	December 23, 2021
Snowplow_dbt_web: Backfilling sessions_lifecycle_manifest size For data modelers & consumers	8	853	July 7, 2023
Duplicate `screen_view_id` For data modelers & consumers	1	837	March 8, 2023
Duplicate event_id	4	1508	July 21, 2021
Reconciling Snowplow and Google Analytics session numbers For data modelers & consumers	20	10032	June 10, 2020

Duplicate Session Ids Months Apart

Related topics