Modelling
From a modelling perspective there’s two things that would be useful to add to this that either enrich the representation or are required for something like a Sankey diagram:
- The notion of a pseudo time-based index for each action, e.g., first page view = 1, 3rd page view = 3 which in a Sankey roughly corresponds to your x-axis.
- If required the time between actions (Sankey diagrams don’t really show this so visualisations such as directed acyclic graphs are often a better option here).
- A scalar value that represents the magnitude (“flow”) from one state to another.
To generate the first two we can use Redshift window functions by partitioning by user and by session and to generate the final element we can simply aggregate our raw data in whatever means required.
A pre-aggregated query that gives us the raw data might look something like this:
SELECT
derived_tstamp,
domain_userid,
domain_sessionidx,
refr_urlpath,
page_urlpath,
event_name,
ROW_NUMBER()
OVER (
PARTITION BY domain_userid, domain_sessionidx
ORDER BY derived_tstamp ASC ) AS event_index,
DATEDIFF(
'seconds',
LAG(derived_tstamp, 1)
OVER (
PARTITION BY domain_userid, domain_sessionidx
ORDER BY derived_tstamp ASC ),
derived_tstamp
) AS last_action
FROM
atomic.events
WHERE
collector_tstamp > SYSDATE - INTERVAL '1 days'
AND
event_name = 'page_view'
We use the ROW_NUMBER()
window function to generate our event index, and the LAG
window function to get the time between subsequent page views. Some things to note here is that we’re limiting ourselves to page_views, but we could just as easily include additional events such as add_to_cart
. This example also uses derived_tstamp
rather than collector_tstamp
, the short answer being that this will give us an accurate representation of time for the users device (there’s a great blogpost on the Snowplow representations of time here).
From here it’s easy to derive by using a group by statement (on refr_urlpath, page_urlpath and event_index) to generate your magnitude. This may be something like a COUNT(*)
or could be something more useful such as a count of unique users COUNT(DISTINCT(domain_userid))
that have performed this action.
Visualisation
Visualisation is a little tricky. As @brucey31 has mentioned above the method in Tableau is a bit hacky and in a graph implementation (such as a bidirectional cyclic graph) you lose some of the inherent value of having an event_index that also plays a role in the layout of the visualisation.
That said there’s a fantastic d3 library that generates Sankey diagrams to do some of the hard parts (layout generation) and provides a simple interface for data.