We’ve recently set up a Snowplow system set up to replace an older (custom built) tracking system, it has mostly been implemented via the Snowplow documentation, I am on the data science side and getting up to speed with it, but was curious about this specific point in the title.
With our legacy system we had a boolean/flag where each session (session_id) had a ‘status’ or ‘active’ flag, with:
1 = session is open (ie, still capturing events), or;
0 = session is closed
I depend on this for a few models and analytics needs and want to have a similar understanding with the Snowplow data – is there a way to understand the state of a session (domain_sessionid or domain_sessionidx) in Snowplow? If so, would this be based on some other column/variable in the atomic.events data table (ie, etl_tstamp) or would there need to be some modification to our Snowplow sessions to include such a flag? The desire is to know this as soon as possible, so simply reviewing at an arbitrarily later time would not be ideal.
A real analytics use case: we currently have an analytics report that summarizes session behaviour for our marketing team, and some of these are specifically focused on when a session is closed/has ended to enable them to understand how web visits are doing, are people spending less time per session or more? Which visits contain a specific marketing event? etc
A real data processing use case: we have a rules-based system for notifications, some of which are based on a completed/closed session, such as notify a representative when a visit includes a desired event (ie, visit contained a specific page view and surpassed a desired time on page) – this should only happen once the session is closed as it is specifically a follow up.
On web, the Snowplow tracker will measure sessions via the time since the last event. The session cookie has a timeout parameter (which defaults to 30 mins but is configurable on initialisation via the sessionCookieTimeout parameter). This is a pretty standard way of understanding web sessions across the industry as far as I’m aware.
Every time an event is registered, the tracker will check the state of the session cookie, and will either retain the session_id if the timeout hasn’t elapsed, or it’ll rotate the id if it has.
So whatever you do use to model this kind of thing, it can be guided by that definition.
However I’m not sure I quite understand specifically what the legacy system you describe does. At the time any event occurs, by definition it occurs when the session is still open. Until the timeout elapses, at any given moment in time, in order to know whether or not it’s still open, we must know wether or not something will happen in future. So I suspect that what that system measures as ‘open’ and ‘closed’ has some definition that I’m not quite grasping here.
If we take the standard definition of a session as I describe it above, we can make a good guess as to whether or not a session is over once we analyse the data - if we look at the derived_tstamp for the latest event we have from the property (ie that website) overall, and compare it to the derived_tstamp for the last event we have for the session, we can compare that duration to the session timeout we have configured (and account for potential lag), to determine if we may expect more events for that session.
We won’t always be correct that way - since data which fails to reach the collector is cached and sent later - but for the most part we’ll be correct, and we can query the data to ascertain how often late data has an impact (by comparing dvce_created_tstamp to dvce_sent_tstamp).
I hope that’s helpful - please do feel free to follow up with more detail of how the legacy system works, we’ll do our best to help.
Agreed, your understanding of a web session is the same as mine and standard.
Yes let me elaborate, perhaps a crucial point I omitted was that the legacy system is managing the sessions as a separate post processing step, not on the browser via a cookie. I believe therefore that the legacy system has somewhat decoupled events (page views and other related events) and sessions, with a dedicated task of reviewing the event time stamps and then determining whether the 30-minute inactivity limit has been reached in order to ‘end the session’ – does this make more sense?
I was thinking similarly, but based on the ‘current time’. Just so I am clear however, you are saying for a given session compare the max of derived_tstamp in that session with the max of derived_tstamp for the property (ie, app_id), and if the difference is greater than what we’ve configured as the sessionCookieTimeout parameter (30-minutes) then we conclude with a high degree of confidence that the session has ended – did I get that correct? If so, where would this logic live, is it a separate piece/job that happens in some other process (ie, ETL)? Any thoughts here would be helpful : )
Finally, yes I was looking at the summary stats of the difference between dvce_created_tstamp and dvce_sent_tstamp, as well as etl_tstamp, to see if there was a need to account for this – so far it does not seem to be an issue though I imagine this will change as the Snowplow system fully replaces the legacy system. In the case where we have delayed data, it will be too late as decisions or dependent events will already have been finalized – am not sure if there is an alternative here other than adding a delay or buffer as you mentioned.
You can definitely do this as a post-processing step but I think it’s going to come down to the somewhat arbitrary definition as @Colm has mentioned of a ‘closed’ session and what the use case is.
The cookie method - similar to a post-processing method - is an event that is going to be generated by the absence of events. In this sense a closed session is going to be at best after a user finishes the session - sometimes by 30 minutes or in the case that a user has just left to go make a cup of tea may still be an open session, but we mistake that for a period of inactivity.
Other signals like closing the browser tab, backgrounding the tab or firing continuous pings while the tab is open (this is different to the JS page ping implementation) might help contribute to determining whether a session has finished.
You’d only need to compare here to the current time rather than the max derived_tstamp of the property itself. If 30 minutes has elapsed in current time since the max(derived_tstamp) for a given session we can assume that the session has finished (and the sessionCookieTimeout will largely do the same - with some caveats around offline behaviour).
Yes - I’d have this as a separate light ETL. Implementations are slightly different in AWS / GCP but generally consist of:
storing the domain_userid / domain_sessionid and a timestamp of the latest event
updating the timestamp of these events on any observation for that user / session
periodically (e.g., once a minute) evicting sessions where timestamp < CURRENT_TIMESTAMP - 30 minutes as your ‘closed’ sessions