Question
Is there any way to hook (set callback function) for session_id initialization?
Background
We want to send a field about clients’ status from clients to our collector endpoint only when a session is initialized, to minimize the amount of data transferred. Because the field is large, about 2KB.
Although it is possible to send an event when the field is updated (for example with ‘the_field_updated’ event) but we think having the field with the 1st event of a session helps us to add (enrich) the field to all events within the session by the enricher.
In this case, we are planning to create a custom enricher. The enricher will store a session_id and the field in a key-value store when it first received the field for a session_id. And then, it can use the key-value store to enrich subsequent events by looking up value for the field with using session_id (as a key).
So the session ID will be established on the creation of the first event in the session. I don’t see a straightforward way to instrument a callback for that, but it’s probably within the art of the possible.
However, there’s another pitfall here, which is worth considering. The order in which the events are processed is not guaranteed, and is in fact quite likely to be out of order.
The events may not be sent from the client to the collector in order (when there’s a spotty connection for example). After they reach the collector, they get processed in several mutli-threaded asynchronous systems - kinesis streams with multiple shards, collector and enrich apps with multiple instances. Even within an instance each component is multi-threaded and there’s no guarantees around event sequencing.
For these reasons, I don’t believe that the enricher is the best place for this logic, because the only way I see to ensure that the data is joined up when out-of-order, is to either send with every event or to somehow wait for the value when it’s not there, which would have detrimental effects on the pipeline.
Generally when I’ve needed logic where events depend on other events, I have found that post-enrichment is the solution.
A simple example of this is to send the client status data as it’s own custom event, and join it to the other events in the session using SQL, once it’s in the warehouse. If it needs to be real-time, then it’s more complicated, but an app that consumes the enriched stream is at liberty to instrument this more complex logic without affecting the other services in the pipeline.
@Colm Hello Thank you very much for the quick and detailed replay! The comment about ordering guarantee is particularly helpful.
May I ask some questions about the ordering guarantee?
Question1: I see an option named “useIpAddressAsPartitionKey” for collector+kinesis which seems to provide guarantee order for events with same IP. And if we set true for this option, we can ensure the order of events (from the same IP address). Is my understanding correct? Assuming that a client keeps a unique IP address during a session is reasonable in the real world?
Question2: Does the Snowplow Collector provides a similar option for Kafka?
Neither are my area, so I’m not entirely certain of the specifics, however:
Question1: I see an option named “useIpAddressAsPartitionKey” for collector+kinesis which seems to provide guarantee order for events with same IP. And if we set true for this option, we can ensure the order of events (from the same IP address). Is my understanding correct? Assuming that a client keeps a unique IP address during a session is reasonable in the real world?
I don’t believe this provides anything like a guarantee of the order of events. The partition key just determines which shard an event is allocated to, so all this would do is ensure that this allocation is consistent for events with the same partition key. (Note, however, that how data is allocated to shards is down to how AWS internally provides this allocation, and since shard counts can change I don’t know what the behaviour is in terms of any guarantees there.)
None of that says anything about the order of events.
Question2: Does the Snowplow Collector provides a similar option for Kafka?
I’m not sure of this.
However, on both counts, even if assurances of ordering from the collector to the raw stream did exist, you still wouldn’t have any assurance that the data would be in order, in the way that your solution presupposes. There is no way to assure that the data arrives at the collector in order. So before you even reach those settings, your data will be out of order.
It is quite common for a client connection to be unavailable for the first few events in a session, which then get forwarded to the collector later, when the connection is restored.
The best advice I can give on this topic is to reconsider your proposed architecture - the enricher is not suited to cross-event operations, but I haven’t yet come across a requirement that requires this which isn’t possible using some post-enrich process.
Sorry haven’t replied for a while and thank you again for the detail and quick response.
I agree that it seems not possible to guarantee the order of events and we should avoid implementing enrichments that rely on the order of events in our data pipeline.
I don’t believe this provides anything like a guarantee of the order of events. The partition key just determines which shard an event is allocated to, so all this would do is ensure that this allocation is consistent for events with the same partition key.
But even if a shard guarantees the order of events, I think scaling up the collector to more than two instances will break the guarantee as we cannot ensure the order of writes between two or more collectors. Also, as you pointed out, operation like changing the number of shards seems to break the guarantee.