Summary
Based off of this original Snowplow team recipe, http://snowplowanalytics.com/guides/recipes/catalog-analytics/measuring-and-comparing-content-page-performance.html, which plots out a journey map for a single user, I’ve created a similar visualization that illustrates an ‘average journey’ for a cohort of visitors. In my example, I’ve filtered the data so that only data from a group of people that purchased a certain product on their first visit was analyzed. Feel free to reuse and tweak as you like!
Sample Visualization
Understanding the Visualization:
Here, each URL is represented by a row (y-axis) on the Gantt chart. The start of each bar represents the average start time that the cohort began viewing this URL (in seconds) from the time their session started. The X axis represents the amount of seconds that has passed since the session start. The opacity of each bar represents the number of users that traversed that URL path. That is, bars that are darker were more heavily traversed than URLs with lighter bars.
Methodology:
Using Redshift, I identified the cohort of users of interest based on the products they purchased, and by using the session start time joined up the pageview and ping events for all of the URLs that this cohort viewed. In addition, I also calculate the number of cohort users visited a specific URL path. After an export into R, I do some simple data cleaning and create the visualization that’s shown above.
Redshift SQL:
--The following query calcualtes the average start and end times for pageviews and ping events for URLs of a specifically defined user cohort
--In the case when a user has many visits, the time between those visits is effectively removed. That is, if there was a 1 hour gap between session 1 and session 2, the last event of session 1 will appear directly before the first event of session 2, without a gap
select
userSessionURLTotals.domain_sessionidx,
lower(userSessionURLTotals.page_urlpath) as page_urlpath,
--USE this in place of calculated page_urlpath above if you'd like to only consider html files and not paths substring(userSessionURLTotals.page_urlpath, (len(userSessionURLTotals.page_urlpath))- position('/' in reverse(lower(userSessionURLTotals.page_urlpath))) + 1,len(userSessionURLTotals.page_urlpath)) as page_urlpath4,
avg(userSessionURLTotals.AdjmintimeFromSessionStart) as avgMinTimeFromStartSession,
avg(userSessionURLTotals.AdjmaxtimeFromSessionStart) as avgMinTimeFromEndSession,
avg(AdjTotalTime) as avgTotalTime,
count(distinct userSessionURLTotals.domain_userid) as urlUserCount,
totalUserCount,
cast(count(distinct userSessionURLTotals.domain_userid)as float) / cast(totalUserCount as float) as userRatio
from
(
select
sessions.domain_userid,
sessions.domain_sessionidx,
userEnteranceStartTime,
sessions.sessionStartTime,
sessions.sessionEndTime,
sessions.totalSessionTime,
previousSessions.runTotal,
urls.page_urlpath,
urls.mintstamp,
urls.maxtstamp,
urls.totaltime,
datediff(second, sessionStartTime, urls.mintstamp) mintimeFromSessionStart,
datediff(second, sessionStartTime, urls.maxtstamp) maxtimeFromSessionStart,
datediff(second, sessionStartTime, urls.mintstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end as AdjmintimeFromSessionStart,
datediff(second, sessionStartTime, urls.maxtstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end as AdjmaxtimeFromSessionStart,
(datediff(second, sessionStartTime, urls.maxtstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end) - (datediff(second, sessionStartTime, urls.mintstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end) as AdjTotalTime,
datediff(second, userEnteranceStartTime, urls.mintstamp) mintimeFromUserEnter,
datediff(second, userEnteranceStartTime, urls.maxtstamp) maxtimeFromUserEnter,
custCohort.totalUserCount
from
(
select
user_id,
domain_userid,
purchaseTstamp,
UserSeqOrderNumber,
domain_sessionidx,
sum(count(distinct domain_userid)) over () as totalUserCount
from
(
select
user_id,
domain_userid,
tr_orderid,
domain_sessionidx,
collector_tstamp as purchaseTstamp,
rank() over (partition by user_id order by collector_tstamp) as UserSeqOrderNumber
from atomic.events
where
event_name='transaction'
and user_id is not null
and tr_orderid is not null
--and domain_sessionidx = 1 --uncomment this line if you'd like to filter for a specific number of visits
group by
user_id,
domain_userid,
tr_orderid,
domain_sessionidx,
collector_tstamp
) CustBase
join
(--selects all items that were a part of the order
select
ti_orderid,
LISTAGG(ti_sku,';') as SKUsPurchased,
LISTAGG(ti_quantity,';') as SKUQuantities,
LISTAGG(ti_price,';') as SKUListPrices,
LISTAGG(replace(ti_name,',',''),';') as SKUNames,
sum(ti_quantity) as totalQuantity,
sum(ti_price*ti_quantity) as totalSKUsPrice
from
(
select *
from atomic.events e
where event_name='transaction_item'
order by ti_sku
)
group by
ti_orderid
)items on items.ti_orderid = CustBase.tr_orderid
where
UserSeqOrderNumber =1 --limiting users to their first purchase only
--FILTERING FOR PRODUCT to define COHORT
and
(
SKUsPurchased like '%PART_NUM1%'
or SKUsPurchased like '%PART_NUM2%'
--add additional product filters when necessary
)
group by
user_id,
domain_userid,
purchaseTstamp,
UserSeqOrderNumber,
domain_sessionidx
)custCohort
join
--get the session information
(
select distinct
domain_userid,
domain_sessionidx,
min(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionStartTime,
max(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionEndTime,
min(collector_tstamp) over (partition by domain_userid ) as userEnteranceStartTime,
datediff(second, min(collector_tstamp) over (partition by domain_userid, domain_sessionidx) , max(collector_tstamp) over (partition by domain_userid, domain_sessionidx )) totalSessionTime
FROM "atomic".events
)sessions on sessions.domain_userid = custCohort.domain_userid and sessions.domain_sessionidx <= custCohort.domain_sessionidx
left join
(
--getting previous session's time in seconds
select distinct
domain_userid,
domain_sessionidx as PreviousSessionID,
domain_sessionidx +1 as nextSessionIDX,
sum(totalSessionTime) over (partition by domain_userid order by domain_sessionidx ROWS UNBOUNDED PRECEDING ) runTotal
from
(
select distinct
domain_userid,
domain_sessionidx,
min(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionStartTime,
max(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionEndTime,
min(collector_tstamp) over (partition by domain_userid ) as userEnteranceStartTime,
datediff(second, min(collector_tstamp) over (partition by domain_userid, domain_sessionidx) , max(collector_tstamp) over (partition by domain_userid, domain_sessionidx )) totalSessionTime
FROM "atomic".events
) a
) previousSessions on previousSessions.domain_userid = sessions.domain_userid and previousSessions.nextSessionIDX = sessions.domain_sessionidx
--get the URL visit and times for each URL of each session
join
(
SELECT
domain_userid,
domain_sessionidx,
page_urlpath,
min(collector_tstamp) minTstamp,
max(collector_tstamp) maxTstamp,
datediff(second,min(collector_tstamp), max(collector_tstamp)) totalTime
FROM "atomic".events
WHERE
(event = 'page_ping' OR event = 'page_view')
and
group by
domain_userid,
domain_sessionidx,
page_urlpath
having
datediff(second, min(collector_tstamp), max(collector_tstamp)) > 0
order by
domain_sessionidx,
min(collector_tstamp)
)urls on urls.domain_sessionidx = sessions.domain_sessionidx and sessions.domain_userid = urls.domain_userid
order by sessions.domain_sessionidx, mintimefromsessionstart
)userSessionURLTotals
group by
userSessionURLTotals.domain_sessionidx,
lower(userSessionURLTotals.page_urlpath),
userSessionURLTotals.totalUserCount,
--substring(userSessionURLTotals.page_urlpath, (len(userSessionURLTotals.page_urlpath))- position('/' in reverse(lower(userSessionURLTotals.page_urlpath)))+1,len(userSessionURLTotals.page_urlpath))
having count(distinct userSessionURLTotals.domain_userid) > 1 --filtering out URLS that only have a single cohort user visiting it
order by
domain_sessionidx,
urlUserCount,
avgMinTimeFromStartSession
R CODE:
library(DiagrammeR)
library(tidyr)
library(dplyr)
library(ggplot2)
#creating z-scores for each time column
rawData$scaleMin = scale(rawData$avgmintimefromstartsession)
rawData$scaleMax = scale(rawData$avgmintimefromendsession)
#removing outliers for start or end times (zscores +- 2 stdev)
scaledData = subset(rawData, abs(scaleMax) <=2 & abs(scaleMin) <=2)
#calculating alpha/opacity based on scaled data set
scaledData$alphaRatio = scaledData$urlusercount / max(scaledData$urlusercount)
#converting domain_sessionidx to a factor data type
scaledData$visitCount = factor(scaledData$domain_sessionidx)
par(bg = "transparent")
##plotting multiple user User Journey, using average start from session start and average end from session start metrics
ggplot(scaledData, aes(colour=visitCount)) +
geom_segment(aes(x=avgmintimefromstartsession, xend=avgmintimefromendsession, y=page_urlpath, yend=page_urlpath, alpha=userratio), size=3) +
xlab("Duration in Seconds from Session Start") + ylab("URL Path")