Bounce rate and session duration query optimization

Hi guys!

I was wondering if there is anyway to optimize the query below that executes in ~30sec:

-- :name get_overview :many
WITH sessions_cte AS (
    SELECT
        e1.domain_userid,
        e1.domain_sessionidx,
        CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounced,
        EXTRACT(EPOCH FROM (MAX(e1.derived_tstamp) - MIN(e1.derived_tstamp))) AS duration
    FROM atomic.events e1
    WHERE e1.app_id = :app_id
    AND e1.event = 'page_view'
    AND e1.derived_tstamp BETWEEN :start_date AND :end_date
    GROUP BY 1, 2
)
SELECT
    COUNT(DISTINCT(domain_userid)) AS visitors,
    COUNT(*) AS pageviews,
    (SELECT SUM(bounced)::REAL / COUNT(*) FROM sessions_cte) AS bounce_rate,
    (SELECT AVG(duration) FROM sessions_cte) AS session_duration
FROM atomic.events
WHERE app_id = :app_id
AND event = 'page_view'
AND derived_tstamp BETWEEN :start_date AND :end_date
ORDER BY 1 DESC

For context, the following query executes in <1sec:

SELECT
    COUNT(DISTINCT(domain_userid)) AS visitors,
    COUNT(*) AS pageviews
FROM atomic.events
WHERE app_id = :app_id
AND event = 'page_view'
AND derived_tstamp BETWEEN :start_date AND :end_date
ORDER BY 1 DESC

Splitting the query in two solved the issue:

-- :name get_overview_1 :many
SELECT
    COUNT(DISTINCT(domain_userid)) AS visitors,
    COUNT(*) AS pageviews
FROM atomic.events
WHERE app_id = :app_id
AND event = 'page_view'
AND derived_tstamp BETWEEN :start_date AND :end_date
ORDER BY 1 DESC

-- :name get_overview_2 :many
SELECT
    SUM(bounced)::REAL / COUNT(*) AS bounce_rate,
    AVG(duration) AS session_duration
FROM (
    SELECT
        domain_userid,
        domain_sessionidx,
        CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounced,
        EXTRACT(EPOCH FROM (MAX(derived_tstamp) - MIN(derived_tstamp))) AS duration
    FROM atomic.events
    WHERE app_id = :app_id
    AND event = 'page_view'
    AND derived_tstamp BETWEEN :start_date AND :end_date
    GROUP BY 1, 2
)
ORDER BY 1 DESC

In general your constraint on this query is mostly going to be bound by your database itself as well as your table design. If we assume that this in Redshift the biggest contributors for this query is going to be the sort key for the table which is going to define how many blocks are required to be scanned in order to return the result. If it’s derived_tstamp then this will be reasonably performant - but if it’s not this scan may be larger than it needs to be.

If you want session duration (from a user point of view) I’d be tempted to go with dvce_created_tstamp here which is the local timestamp on the end user device rather than the derived_tstamp which is adjusted based on the collector timestamp. If you have a significant number of single page view sessions I would wrap this in a case as well so that you are not EXTRACT/MIN/MAXing an expression that will be 0 for “bounced” sessions.

It’s hard to know without being familiar with the instrumentation but this will give session time based exclusively on page views. For certain applications this is fine - but if you have users that read articles / events that fire post page view you may want to consider including these events as part of the session time as well (which would give you a more generous result) as this is more akin to an interaction bounce vs a non-interaction bounce. Depending on your application there is a movement away from bounce rates (i.e., they no longer exist in GA4 at all) and more towards defining engaged users and sessions based on these interactions.

1 Like

@mike thanks for the reply!

I’ll have to check if the sort key is derived_tstamp, I doubt it is.

If you have a significant number of single page view sessions I would wrap this in a case as well so that you are not EXTRACT/MIN/MAXing an expression that will be 0 for “bounced” sessions.

Makes sense. I just made the change. :slight_smile: