Why is event data modeling difficult?

Event data modeling is the process of using business logic to aggregate over event-level data to produce ‘modeled’ data that is simpler for querying.

Expressing business logic in SQL is hard

Event data modeling is hard primarily because expressing that business logic in SQL is hard.

SQL makes it easy to perform simple aggregations over rows of data. Unfortunately, performing simple aggregations over event level data typically is not very useful.

To illustrate this, let’s consider the an example:

  • We are a media company that record play_video events
  • With each event, we record a video_category indicating the category of video played (e.g. comedy, romance etc.)
  • We want to see if particular categories are becoming more popular over time at the expense of others

This is a simple question to answer with the data. We aggregate play events by day and category as follows:

date_trunc('day, derived_tstamp'),
from atomic.events
where event_name = 'video_play'
group by 1,2
order by 1,2;

We can then pivot the resulting data set to understand visualize if the mix of categories viewed over time is changing.

Simple! Yes. Unfortunately, most of the questions we want to ask of event data are harder. Consider the two typical examples:

  1. How do conversion rates vary by acquisition channel?
  2. At what point in the purchase funnel do most users churn? How does that vary by user type?

Tying patterns of behaviour early on in a user’s journey with patterns of behaviour later on is hard to do in SQL

The above two questions require more than simple aggregations. If we want to compare conversion rates by acquisition channel, we have to take an event early on in a user journey (the user engaged with a marketing channel) and tie that to an event that may or may not occur later on in the same user’s journey (the user converted). Only then do we aggregate over users (calculate the % of users converted) and slice that by channels (calculate the % of users converted by each channel) rather than simply aggregating over the underlying marketing touch and conversion events. To recap the computational steps required, we need to:

  1. Group events by users
  2. For each user, identify an acquisition channel
  3. For each user, identify whether or not they converted
  4. Aggregate over users by acquisititon channel, and for each channel calculate an aggregated conversion rate
  5. Compare the conversion rate by channel

The second example is similar: again we need to aggregate events by users. For each user we identify all the events that describe how they engaged with a particular funnel. For each user we then categorise how far through the funnel they get / where in the funnel they ‘drop off’, based on the event stream. We then aggregate by user, slicing by user type and stage in the funnel that each user drops off, to compare drop off rates by stage by user type.

As should be clear in both the above examples, when we’re dealing with event data, we’re generally interested in understanding the journeys that are composed of series of events. Therefore, we need to aggregate events over those journeys to generate tables that represent higher order units of analysis like funnels or workflows or sessions. We’re often interested in understanding the sequence of events, and how impact events earlier on in a user journey have on the likelihood of particular events later on in those same user journeys. Understanding that sequence is not something that falls out of a standard set of aggregations functions like COUNT, SUM, MAX. That can make modeling event data difficult, because it is not trivial to express the business logic that we wish to apply to event level data in languages like SQL that have not been built around event data modeling.

SQL is hard to read

The below is a SQL query that sessionizes events using a very simple rule: a new session begins in the there is a gap of 10 minutes of more user activity:

-- Find the start of every session
  FROM (
       SELECT *
              LAG(derived_tstamp,1) OVER (PARTITION BY domain_userid ORDER BY derived_tstamp) AS last_event
        FROM "atomic".events
      ) last
WHERE EXTRACT('EPOCH' FROM derived_tstamp) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
   OR last_event IS NULL

-- Map every event to its session
       SUM(is_new_session) OVER (ORDER BY domain_userid, derived_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
       SUM(is_new_session) OVER (PARTITION BY domain_userid ORDER BY derived_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
  FROM (
       SELECT *,
              CASE WHEN EXTRACT('EPOCH' FROM derived_tstamp) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10) 
                     OR last_event IS NULL 
                   THEN 1 ELSE 0 END AS is_new_session
         FROM (
              SELECT *,
                     LAG(derived_tstamp,1) OVER (PARTITION BY domain_userid ORDER BY derived_tstamp) AS last_event
                FROM "atomic".events
              ) last
       ) final

It is taken from the excellent sessionization post on the Mode Analytics website.

This is probably the simplest sessionization logic you could possibly apply to an event data set. As you can see, it is very difficult to see the business logic through the SQL. If we look at a more complicated approach to sessionization, the SQL only gets harder to understand. The following applies the Google Analytics approach to sessionization, to your Snowplow data:

WITH step_1 AS (


    domain_sessionidx, -- to compare with the custom session index


    LAG(dvce_tstamp, 1) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp) AS previous_dvce_tstamp,
    NVL(page_referrer, '') || NVL(mkt_medium, '') || NVL(mkt_source, '') || NVL(mkt_term, '') || NVL(mkt_content, '') || NVL(mkt_campaign, '') AS referrer,



  FROM atomic.events
  WHERE collector_tstamp::date = 'YYYY-MM-DD' -- restrict the dataset
  ORDER BY domain_userid, dvce_tstamp

), step_2 AS (


    *, SUM(CASE WHEN refr_medium = 'internal' OR referrer IS NULL OR referrer = '' THEN 0 ELSE 1 END) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS new_referrer

  FROM step_1
  ORDER BY domain_userid, dvce_tstamp

), step_3 AS (


  *, FIRST_VALUE(referrer) OVER (PARTITION BY domain_userid, new_referrer ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS referrer_partition

  FROM step_2
  ORDER BY domain_userid, dvce_tstamp

), step_4 AS (


    *, LAG(referrer_partition, 1) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp) AS previous_referrer_partition

  FROM step_3
  ORDER BY domain_userid, dvce_tstamp

), step_5 AS (


      WHEN ((EXTRACT(EPOCH FROM (dvce_tstamp - previous_dvce_tstamp)) < 60*30) AND (referrer_partition = previous_referrer_partition OR (referrer_partition IS NULL AND previous_referrer_partition IS NULL))) THEN 0
      ELSE 1
    END AS new_session

  FROM step_4
  ORDER BY domain_userid, dvce_tstamp

), step_6 AS (


    *, SUM(new_session) OVER (PARTITION BY domain_userid ORDER BY dvce_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS new_sessionidx

  FROM step_5
  ORDER BY domain_userid, dvce_tstamp


FROM step_6

What about unit testing?

Data models are a key part of any event data pipeline. So you should be able to unit test your data models prior to putting them live.

Unfortunately, it is not obvious how to do that with SQL.

So what should we do?

In spite of these difficulties, modeling event data is still possible with SQL and is worthwhile, because it adds so much value to the data. The following are some basic rules of thumb:

  1. Start simple and iterate
  2. Mirror your atomic tables in a second (e.g. test) schema, so you can test new data models on a subset of your data (copied over from the atomic schema`
  3. Keep only one query in each SQL Runner SQL file, so if a playbook fails, it is straightforward to identify what query failed, and rerun the playbook from that point
  4. Test each step in your playbook. Check simple things like the # of rows in the output is correct given the number of rows in input tables (no fan out) and all the values in each column are acceptable (no negative numbers where they don’t make sense, no timestamps from the 18th century)

Going forwards we’re investing a lot of work in building a data modeling toolkit in Spark, so our users can avoid having to do this in SQL.