We’ve been using Snowplow self-describing events for a long time and only recently we noticed that some of these self-describing events do not have any “real” counterparts in the events table (joining on root_id = event_id returns nothing).
The beginning of these occurrences exactly matches the date when we upgraded to Snowplow R97 (also, changed Clojure collector to Scala collector + Kinesis streaming). The amount of bad events is relatively small compared to normal ones (below 1%) so we cannot really see the lack of anything.
How can it be? Events come in single lines and thus should be held together until shredding and loading, how can they go missing?
Our MAXERROR=1 so that shouldn’t be a problem, right?
Regarding shredded and enriched data, how would you suggest to look into it conveniently?
If I open some of the archived files manually, the chances of randomly spotting an astray ID (if it even exists) are very slim.
MAXERROR shouldn’t be an issue. As @dilyan has mentioned above you can use Athena (or the S3 Select API) to query for a single row. Make sure you’re first filtering to the etl_tstamp associated with the event_id as that will significantly reduce the amount of data you need to scan.
As far as I know the only way for an event_id to change after the fact is during the synthetic deduplication process in which an event have a new event_id generated (where events have the same event_id but differing event_fingerprints).
If this is the case a duplicate context should have been attached to the event. The event_id of this event should be the newly generated event_id and the originalEventId should contain the event_id before the random generation. More on this here.
Thanks @mike - I looked into the duplicate context and here are the findings.
None of the orphan self-describing events match entries in the duplicate context ON custom_event.root_id = original_event_id.
Some of the orphan events match the duplicate context USING(root_id, root_tstamp).
I’d like to concentrate on the latter, and call those ‘orphan duplicate contexts’.
None of the orphan duplicate contexts match normal events ON original_event_id = event_id AND root_tstamp = collector_tstamp.
All of the orphan duplicate contexts match normal events ON original_event_id = event_id only. However, these are old (older than our orphan problem) and look like bots a lot of the time.
The conclusion is that at least some (over 40%, to be precise) orphan self-describing events have something to do with Snowplow’s synthetic deduplication. Unfortunately, their event counterparts (whose event_id match either root_id or original_event_id and an exact time) are not found.
P.S. Also attaching a hand-drawn diagram to illustrate the described situation.
Hey @pranas! We’re sorry it took so long, but at last we fixed this issue in RDB Loader R31 Snowplow RDB Loader R31 released. The problem was in synthetic deduplication, you can find more information in the corresponding blog post.