Due to users restoring sessions from the past or in app browsers persisting session cookies, we’re seeing duplicate session/user ids come through months apart from each other. This is causing an issue in our dbt modelling as we only look at the past 30 days for duplicates and remove them. Anything larger would be beyond our computing capacity limit.
Has anyone experience this before? Any insight as to how to deal with it?
Hi @ks107, I’m not sure if you’re using the snowplow dbt packages, but our approach to this is to quarantine sessions that are longer than some value (I think our default is 3 days) and after that we stop processing any new events for these sessions, to avoid massive table scans, because they are more often than not bots. Unfortunately if you want to properly process all events in a sessions without this limit there is not much you can do to avoid large table scans.
Just to add a little bit of extra context - these can be normal, non-bot events. If a user leaves a web page before an event is sent successfully, for example, the event will be stored and sent along with the next event that gets sent through. So if the user leaves and comes back 6 months later, you can have a legitimate 6 month late event.
Still, dealing with them is the same - they’re not worth the table scan so best left out as Ryan describes.
Thanks @Colm and @Ryan for the replies. We are using the snowplow dbt packages, so we should have this in place already. Will bring this quarantine approach back to my team and see if it’s something we can incorporate if we’re not already doing so.
@Colm and @Ryan so took this back to our team. We are indeed using the dbt models and we have the quarantine sessions set up similarly (3 days) however, our session_lookback_limit is set to 120 days due to performance load. So therefore, it seems that some sessions that are beyond that are squeaking through because they aren’t being flagged as being longer than 3 days.
Because of this we are getting a similar error to this bug from 2021 Fix for unexpired sessions · Issue #47 · snowplow/dbt-snowplow-web · GitHub where it “causes cause the snowplow_tests_page_view_in_session_values data test to fail. As not all page views within the session are now being processed together, metrics like page_view_in_session_index will be incorrectly calculated”
If we cannot increase our lookback limit, is there any other work arounds to this?
@ks107 Unfortunately there isn’t really a great solution to this case. Can you provide some more details about the events themselves? For example do they have a large gap between dvce_created_tstamp and dvce_sent_tstamp/collector_tstamp that could be used to filter out late data (genuine late arriving events), or sometime else that looks off about the events (e.g. can be added to the bots filter macro we provide?).
The other option would be to increase that session_lookback_limit value if possible - we already make quite a few optimisations on the snowplow_web_base_sessions_lifecycle_manifest table but depending on your warehouse you may be able to optimise it even further - for example we cluster on BigQuery on session_id but do on Snowflake you could alter the table to enable search optimisation on that column, or in databricks perhaps something with zorder (although I am less familiar with this and the benefits it may offer). These will come with tradeoffs, databricks in particular will become less optimised for date filtering, while snowflake I think just incurs additional storage costs but I am not 100% sure.