I was wondering if there is anyway to optimize the query below that executes in ~30sec:
-- :name get_overview :many
WITH sessions_cte AS (
SELECT
e1.domain_userid,
e1.domain_sessionidx,
CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounced,
EXTRACT(EPOCH FROM (MAX(e1.derived_tstamp) - MIN(e1.derived_tstamp))) AS duration
FROM atomic.events e1
WHERE e1.app_id = :app_id
AND e1.event = 'page_view'
AND e1.derived_tstamp BETWEEN :start_date AND :end_date
GROUP BY 1, 2
)
SELECT
COUNT(DISTINCT(domain_userid)) AS visitors,
COUNT(*) AS pageviews,
(SELECT SUM(bounced)::REAL / COUNT(*) FROM sessions_cte) AS bounce_rate,
(SELECT AVG(duration) FROM sessions_cte) AS session_duration
FROM atomic.events
WHERE app_id = :app_id
AND event = 'page_view'
AND derived_tstamp BETWEEN :start_date AND :end_date
ORDER BY 1 DESC
For context, the following query executes in <1sec:
SELECT
COUNT(DISTINCT(domain_userid)) AS visitors,
COUNT(*) AS pageviews
FROM atomic.events
WHERE app_id = :app_id
AND event = 'page_view'
AND derived_tstamp BETWEEN :start_date AND :end_date
ORDER BY 1 DESC
-- :name get_overview_1 :many
SELECT
COUNT(DISTINCT(domain_userid)) AS visitors,
COUNT(*) AS pageviews
FROM atomic.events
WHERE app_id = :app_id
AND event = 'page_view'
AND derived_tstamp BETWEEN :start_date AND :end_date
ORDER BY 1 DESC
-- :name get_overview_2 :many
SELECT
SUM(bounced)::REAL / COUNT(*) AS bounce_rate,
AVG(duration) AS session_duration
FROM (
SELECT
domain_userid,
domain_sessionidx,
CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounced,
EXTRACT(EPOCH FROM (MAX(derived_tstamp) - MIN(derived_tstamp))) AS duration
FROM atomic.events
WHERE app_id = :app_id
AND event = 'page_view'
AND derived_tstamp BETWEEN :start_date AND :end_date
GROUP BY 1, 2
)
ORDER BY 1 DESC
In general your constraint on this query is mostly going to be bound by your database itself as well as your table design. If we assume that this in Redshift the biggest contributors for this query is going to be the sort key for the table which is going to define how many blocks are required to be scanned in order to return the result. If it’s derived_tstamp then this will be reasonably performant - but if it’s not this scan may be larger than it needs to be.
If you want session duration (from a user point of view) I’d be tempted to go with dvce_created_tstamp here which is the local timestamp on the end user device rather than the derived_tstamp which is adjusted based on the collector timestamp. If you have a significant number of single page view sessions I would wrap this in a case as well so that you are not EXTRACT/MIN/MAXing an expression that will be 0 for “bounced” sessions.
It’s hard to know without being familiar with the instrumentation but this will give session time based exclusively on page views. For certain applications this is fine - but if you have users that read articles / events that fire post page view you may want to consider including these events as part of the session time as well (which would give you a more generous result) as this is more akin to an interaction bounce vs a non-interaction bounce. Depending on your application there is a movement away from bounce rates (i.e., they no longer exist in GA4 at all) and more towards defining engaged users and sessions based on these interactions.
I’ll have to check if the sort key is derived_tstamp, I doubt it is.
If you have a significant number of single page view sessions I would wrap this in a case as well so that you are not EXTRACT/MIN/MAXing an expression that will be 0 for “bounced” sessions.