The following are basic SQL recipes that can be useful for anyone starting to get familiar with the structure of Snowplow data collected from the web (Javascript tracker in particular):
Number of unique visitors per day, last 31 days
The number of unique visitors can be calculated by summing the number of distinct domain_userids in a specified time period. (Because each user is assigned a unique domain_userid, based on a lack of Snowplow tracking cookies on their browser):
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) as "Date",
COUNT(DISTINCT(domain_userid)) as "Uniques"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 1;
2. Number of visits per day, last 31 days
Because each user might visit a site more than once, summing the number of domain_userid
s returns the number if visitors, NOT the number of visits. Every time a user visits the site, however, Snowplow assigns that session with a domain_sessionidx
(e.g. 1 for their first visit, 2 for their second.) Hence, to count the number of visits in a time period, we concatenate the unique domain_userid with the domain_sessionidx
and then count the number of distinct concatenated entry in the events table:
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) as "Date",
COUNT(DISTINCT(domain_userid || '-' || domain_sessionidx)) as "Visits"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 1;
3. Number of page views per day
Page views are one type of event that are stored in the Snowplow events table. They can easily be identified using the event field, which is set to ‘page_view’.
To count the number of page views by day, then we simply execute the following query:
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) AS "Date",
COUNT(*) AS "page_views"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
AND event = 'page_view'
GROUP BY 1
ORDER BY 1;
4. Number of events
Although the number of page views is a standard metric in web analytics, this reflects the web’s history as a set of hyperlinked documents rather than the modern reality of web applications that are comprise lots of AJAX events (that need not necessarily result in a page load.)
As a result, counting the total number of events (including page views but also other AJAX events) is actually a more meaningful thing to do than to count the number of page views, as we have done above. We recommend setting up Snowplow so that all events / actions that a user takes are tracked. Hence, running the below queries should return a total sum of events on the site by time period:
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', derived_tstamp) AS "Date",
event,
COUNT(*) AS "Number"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2
ORDER BY 1,2;
5. Pages per visit
The number of pages per visit can be calculated by visit very straightforwardly:
/* Redshift / PostgreSQL */
SELECT
domain_userid || '-' || domain_sessionidx AS "session",
COUNT(*) as "pages_visited"
FROM events
WHERE event = 'page_view'
AND derived_tstamp > current_date - integer '31'
GROUP BY 1
We can then aggregate our data by number of pages per visit, to produce a frequency table:
/* Redshift / PostgreSQL */
CREATE VIEW basic_recipes.pages_per_visit AS
SELECT
pages_visited,
COUNT(*) as "frequency"
FROM (
SELECT
domain_userid || '-' || domain_sessionidx AS "session",
COUNT(*) as "pages_visited"
FROM events
WHERE event = 'page_view'
AND derived_tstamp > current_date - integer '31'
GROUP BY 1
) AS page_view_per_visit
GROUP BY 1
ORDER BY 1;
6. Bounce rate
First we need to identify all the sessions that were ‘bounces’. These are visits where there is only a single event captured: the initial page view:
/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "time_first_touch",
COUNT(*) as "number_of_events",
CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounces
FROM events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2;
This query returns a line of data for every session. For each, it logs a timestamp, the number of events, and a flag that is set to 1 if the visitor bounced.
To calculate bounce rate by day, we take the above table, aggregate the results by day, sum the number of bounces and divide it by the total number of sessions:
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', time_first_touch) AS "Date",
SUM(bounces)::REAL/COUNT(*) as "Bounce rate"
FROM (
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "time_first_touch",
COUNT(*) as "number_of_events",
CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounces
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2
) v
GROUP BY 1
ORDER BY 1;
Note that we have to cast sum(bounces) as a ‘real’ number, to force Redshift / PostgreSQL to output a real number rather than an integer for the bounce rate.
7. % new visits
A new visit is easily identified as a visit where the domain_sessionidx
= 1. Hence, to calculate the % of new visits, we need to sum all the visits where domain_sessionidx
= 1 and divide by the total number of visits, in the time period.
First, we create a table with every visit stored, and identify which visits were “new”:
/* Redshift / PostgreSQL */
SELECT
MIN(derived_tstamp) AS "time_first_touch",
domain_userid,
domain_sessionidx,
CASE WHEN domain_sessionidx = 1 THEN 1 ELSE 0 END AS "first_visit"
FROM events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY domain_userid, domain_sessionidx;
Then we aggregate the visits over our desired time period, and calculate the fraction of them that are new:
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', time_first_touch) AS "Date",
SUM(first_visit)::REAL/COUNT(*) as "fraction_of_visits_that_are_new"
FROM (
SELECT
MIN(derived_tstamp) AS "time_first_touch",
domain_userid,
domain_sessionidx,
CASE WHEN domain_sessionidx = 1 THEN 1 ELSE 0 END AS "first_visit"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY domain_userid, domain_sessionidx) v
GROUP BY 1
ORDER BY 1;
8. Average visitor duration
To calculate this, 1st we need to calculate the duration of every visit:
/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "start_time",
MAX(derived_tstamp) as "finish_time",
MAX(derived_tstamp) - min(derived_tstamp) as "duration"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2;
Then we simply average visit durations over the time period we’re interested e.g. by day:
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', start_time) AS "Date",
AVG(duration)/1000000 as "average_visit_duration_seconds"
FROM (
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "start_time",
MAX(derived_tstamp) as "finish_time",
MAX(derived_tstamp) - min(derived_tstamp) as "duration"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2
) v
group by 1
order by 1;
9. Demographics: language
For each event the browser language is stored in the br_language
field. As a result, counting the number of visitors in a time period by language is trivial:
/* Redshift / PostgreSQL */
SELECT
br_lang,
COUNT(DISTINCT(domain_userid)) as "visitors"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY br_lang
ORDER BY 2 DESC;
10. Demographics: location
We can identify the geographic location of users using the geo_country
, geo_region
, geo_city
, geo_zipcode
, geo_latitude
and geo_longitude
fields.
To calculate the number of visitors in the last month by country, simply execute:
/* Redshift / PostgreSQL */
SELECT
geo_country,
COUNT(DISTINCT(domain_userid)) as "visitors"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2 DESC;
11. Behavior: new vs returning
Within a given time period, we can compare the number of new visitors (for whom domain_sessionidx
= 1) with returning visitors (for whom domain_sessionidx
> 1):
/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as time_first_touch,
CASE WHEN domain_sessionidx = 1 THEN 'new' ELSE 'returning' END AS "new_vs_returning"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY domain_userid, domain_sessionidx;
Then we can aggregate them by time period, to get the total new vs returning e.g. by day:
/* Redshift / PostgreSQL */
SELECT
DATE_TRUNC('day', time_first_touch) AS "Date",
SUM(first_visit)::REAL/COUNT(*) as "fraction_of_visits_that_are_new"
FROM (
SELECT
MIN(derived_tstamp) AS "time_first_touch",
domain_userid,
domain_sessionidx,
CASE WHEN domain_sessionidx = 1 THEN 1 ELSE 0 END AS "first_visit"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY domain_userid, domain_sessionidx) v
GROUP BY 1
ORDER BY 1;
12. Behavior: frequency
We can plot the distribution of visits in a time period by the number of visits each visitor has performed:
SELECT
domain_sessionidx as "Number of visits",
COUNT(DISTINCT(domain_userid)) as "Frequency"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2;
13. Behavior: recency
We can plot the distribution of visits by the number of days since the previous visit. To do this, we first identify all the visits in our time period:
/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
domain_sessionidx - 1 as "previous_domain_sessionidx",
MIN(derived_tstamp) as "time_first_touch"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1, 2;
We can join the above table with a similar table, but join each visit to data for the previous visit, so we can calculate the number of days between visits:
/* Redshift / PostgreSQL */
SELECT
n.domain_userid,
n.domain_sessionidx,
EXTRACT(EPOCH FROM (n.time_first_touch - p.time_first_touch))/3600/24 as "days_between_visits",
CASE
WHEN n.domain_sessionidx = 1 THEN '0'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 1 THEN '1'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 2 THEN '2'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 3 THEN '3'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 4 THEN '4'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 5 THEN '5'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 10 THEN '6-10'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 25 THEN '11-25'
ELSE '25+' END as "Days between visits"
FROM (
SELECT
domain_userid,
domain_sessionidx,
domain_sessionidx - 1 as "previous_domain_sessionidx",
MIN(derived_tstamp) as "time_first_touch"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2
) n
LEFT JOIN (
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "time_first_touch"
FROM "atomic".events
GROUP BY 1,2
) p on n.previous_domain_sessionidx = p.domain_sessionidx
and n.domain_userid = p.domain_userid;
Finally, we group the results by the number of days between visits, to plot a frequency table:
/* Redshift / PostgreSQL */
SELECT
"Days between visits",
COUNT(*) as "Number of visits"
FROM (
SELECT
n.domain_userid,
n.domain_sessionidx,
EXTRACT(EPOCH FROM (n.time_first_touch - p.time_first_touch))/3600/24 as "days_between_visits",
CASE
WHEN n.domain_sessionidx = 1 THEN '0'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 1 THEN '1'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 2 THEN '2'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 3 THEN '3'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 4 THEN '4'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 5 THEN '5'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 10 THEN '6-10'
WHEN extract(epoch FROM (n.time_first_touch - p.time_first_touch))/3600/24 < 25 THEN '11-25'
ELSE '25+' END as "Days between visits"
FROM (
SELECT
domain_userid,
domain_sessionidx,
domain_sessionidx - 1 as "previous_domain_sessionidx",
MIN(derived_tstamp) as "time_first_touch"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2
) n
LEFT JOIN (
SELECT
domain_userid,
domain_sessionidx,
MIN(derived_tstamp) as "time_first_touch"
FROM "atomic".events
GROUP BY 1,2
) p ON n.previous_domain_sessionidx = p.domain_sessionidx
AND n.domain_userid = p.domain_userid
) t
GROUP BY 1
ORDER BY 1;
14. Behavior: engagement
Google Analytics provides two sets of metrics to indicate engagement:
- Visit duration
- Page depth (i.e. number of pages visited per session)
Both of these are flakey and unsophisticated measures of engagement. Nevertheless, they are easy to report on in Snowplow. To plot visit duration, we execute the following query:
/* Redshift / PostgreSQL */
SELECT
domain_userid,
domain_sessionidx,
CASE
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 1800 THEN 'g. 1801+ seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 600 THEN 'f. 601-1800 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 180 THEN 'e. 181-600 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 60 THEN 'd. 61 - 180 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 30 THEN 'c. 31-60 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 10 THEN 'b. 11-30 seconds'
ELSE 'a. 0-10 seconds' END AS "Visit duration"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2;
Then we aggregate the results for each bucket, so we have frequency by bucket:
/* Redshift / PostgreSQL */
SELECT
"Visit duration",
COUNT(*) as "Number of visits"
FROM (
SELECT
domain_userid,
domain_sessionidx,
CASE
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 1800 THEN 'g. 1801+ seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 600 THEN 'f. 601-1800 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 180 THEN 'e. 181-600 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 60 THEN 'd. 61 - 180 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 30 THEN 'c. 31-60 seconds'
WHEN extract(EPOCH FROM (MAX(dvce_tstamp)-MIN(dvce_tstamp))) > 10 THEN 'b. 11-30 seconds'
ELSE 'a. 0-10 seconds' END AS "Visit duration"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1,2
) t
GROUP BY 1
ORDER BY 1;
We can also look at the number of page views per visit:
/* Redshift / PostgreSLQ */
select
domain_userid,
domain_sessionidx,
count(*) as "Page views per visit"
from events
where derived_tstamp > current_date - integer '31'
and event = 'page_view'
group by domain_userid, domain_sessionidx;
We then aggregate those results together by the number of page views per visit
/* Redshift / PostgreSLQ */
SELECT
"Page views per visit",
COUNT(*) as "Number of visits"
FROM (
SELECT
domain_userid,
domain_sessionidx,
COUNT(*) as "Page views per visit"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
AND event = 'page_view'
GROUP BY 1,2
) t
GROUP BY 1
ORDER BY 1;
15. Technology: browser
Browser details are stored in the events table in the br_name
, br_family
, br_version
, br_type
, br_renderingengine
, br_features
and br_cookies
fields.
Looking at the distribution of visits by browser is straightforward:
/* Redshift / PostgreSQL */
SELECT
br_family as "Browser",
COUNT(DISTINCT(domain_userid || domain_sessionidx)) as "Visits"
FROM "atomic"events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2 DESC;
16. Technology: operating system
Operating system details are stored in the events table in the os_name
, os_family
and os_manufacturer
fields.
Looking at the distribution of visits by operating system is straightforward:
/* Redshift / PostgreSQL */
CREATE VIEW basic_recipes.technology_os AS
SELECT
os_name as "Operating System",
COUNT(DISTINCT(domain_userid || domain_sessionidx)) as "Visits"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1
ORDER BY 2 DESC;
17. Technology: mobile
To work out how the number of visits in a given time period splits between visitors on mobile and those not, simply execute the following query:
/* Redshift / PostgreSQL */
CREATE VIEW basic_recipes.technology_mobile AS
SELECT
CASE WHEN dvce_ismobile=1 THEN 'mobile' ELSE 'desktop' END AS "Device type",
COUNT(DISTINCT(domain_userid || domain_sessionidx)) as "Visits"
FROM "atomic".events
WHERE derived_tstamp > current_date - integer '31'
GROUP BY 1;