Hi guys,
We’re beginning to dabble in custom contexts more and we’re running into troubles when joining events to their shredded contexts. I’ve seen a few posts on how duplicate event_ids are caused and how to deal with them in Redshift/Enrichment but I haven’t seen any discussions about how to deal with them during analysis.
We’re reasonably comfortable that when joining shredded unstructured events to atomic.events, we’ll join accurately on event_id = root_id. Mostly because we can constrain unstructured events more tightly in a query:
SELECT event_name, count(*) AS rows, count(DISTINCT event_id) AS event_ids, count(DISTINCT event_fingerprint) AS event_fingerprints
FROM atomic.events e
INNER JOIN com_acme_custom_event_1 s
ON event_id = root_id
WHERE collector_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1 AND root_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1
AND event_name = 'custom_event'
GROUP BY 1
ORDER BY 2 DESC;
Which returns exactly what you would expect:
event_name rows event_ids event_fingerprints
custom_event 2663663872 5320837 5367762
Yes, we get some wonky joins (see that row number), but when further constrained (excluding bots, tracking a certain app_id, filtering by hostname etc) nothing that impacts analysis too badly.
However when we take away the “event_name” constraint (e.g. as the case may be with custom contexts), Redshift joins the custom_event table to all the rows matching the event_id… including some that are completely different events:
SELECT event_name, count(*) AS rows, count(DISTINCT event_id) AS event_ids, count(DISTINCT event_fingerprint) AS event_fingerprints
FROM atomic.events e
INNER JOIN com_acme_custom_event_1 s
ON event_id = root_id
WHERE collector_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1 AND root_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1
GROUP BY 1
ORDER BY 2 DESC;
Out of the above query I’ll normally get something like this:
event_name rows event_ids event_fingerprints
custom_event 2663663872 5320837 5367762
other_event 21337975 21 14058
page_view 5544808 25 6365
page_error 52831 5 72
another_event 21855 2 64
last_event 1 1 1
Custom event rows jump a lot, and it starts to match “other_event”, “page_view” and “page_error” events. Obviously if I were to do a simple event_id = root_id join for custom contexts, they may be subject to joining across many events for which the context wasn’t attached.
How can we constrain contexts so they join back to the precise event they occurred on?
Should we be joining contexts on event_id AND collector_tstamp?
Would event fingerprints be a better key to join the tables in in future?
Cheers,
Rob